occlusion aware sensor fusion for early crossing...

Occlusion aware sensor fusion for early crossing pedestrian detection

Andras Palffy a,1, Julian F. P. Kooij a,2 and Dariu M. Gavrila a,3

Abstract— Early and accurate detection of crossing pedestri-ans is crucial in automated driving to execute emergency ma-noeuvres in time. This is a challenging task in urban scenarioshowever, where people are often occluded (not visible) behindobjects, e.g. other parked vehicles. In this paper, an occlusionaware multi-modal sensor fusion system is proposed to addressscenarios with crossing pedestrians behind parked vehicles.Our proposed method adjusts the detection rate in differentareas based on sensor visibility. We argue that using thisocclusion information can help to evaluate the measurements.Our experiments on real world data show that fusing radar andstereo camera for such tasks is beneficial, and that includingocclusion into the model helps to detect pedestrians earlier andmore accurately.

I. INTRODUCTION

Densely populated urban areas are challenging locationsfor automated driving. One of the most critical reasons forthat is the high number of Vulnerable Road Users (VRUs):pedestrians, cyclists, and moped riders. VRUs’ locationsare loosely regulated and they can change their speed andheading rapidly, which makes their detection and trackingcomplicated. At the same time, they are at high risk incase of a potential collision. E.g., of the approximately 1.3million road traffic deaths every year, more than half areVRUs [1]. One particularly dangerous scenario is when apedestrian crosses in front of the vehicle: 94% of injuredpedestrians between 2010 and 2013 in the US were hit aftersuch behaviour [2].

Detecting crossing pedestrians as early as possible is crucialsince it leaves more time to execute emergency braking orsteering. To plan such a manoeuvre, precise location andtrajectory of the VRU is also needed. Thus, a pedestriandetection system has two aims: early detection and accuratetracking of the VRU. Intelligent vehicles have several sensorsto cope with this task: cameras [3], [4], [5], LIDARs, [6], andradars [7], [8], [9] have been used. Fusing different type ofsensors, e.g. radar with camera [10] or LIDAR with camera[11] can add reliability and redundancy to such systems.

Unfortunately, pedestrian detection is often complicated inurban scenarios by severe occlusions, e.g. by parked vehicles.Consider the scene in Figure 1. The ego-vehicle (white) ispassing a parked vehicle (blue). A pedestrian, partly or fullyoccluded by the parked vehicle is stepping out onto the roadin front of the approaching ego-vehicle. Such a behaviourof the VRU is also called darting-out [2]. This situationis particularly dangerous since neither the driver nor thepedestrian has clear view of the other road user. The sensors

a) Intelligent Vehicles Group, Delft Technical University, The Netherlands;1) [email protected]; 2) [email protected];3) [email protected]

Fig. 1: Darting-out scenario: a pedestrian steps out frombehind a parked car (blue), which blocks the line-of-sightof the ego-vehicle (white). Occlusion free/Occluded area ismarked with A/B. The checkerboard and the camera on theVRU are only used for ground truth collection.

equipped in the ego-vehicle are also influenced by the parkedcar, since it blocks their direct line-of-sight to the VRU aswell. However, the level of this influence is highly dependenton the sensor type. E.g., a camera may still see the upper partof the pedestrian behind a small car but cannot see them at allbehind a truck, or a van. Radars, on the other hand, are ableto detect reflections of a pedestrian even in full occlusion viamulti-path propagation [12]. I.e., the reflected radar signal can‘bounce’ from other parked cars, or the ground underneath theoccluding vehicle. Such indirect reflections are weaker thandirect ones [12]. However, they still could provide valuableinformation of a darting-out pedestrian.

Detecting the occluding object (i.e. the parked car, van, ortruck) and estimating its size and position is already addressedin the literature, e.g. in [13]. This information can be used tocreate an occlusion model of the environment which describeswhat kind of and how many detections are reasonable toexpect in differently occluded regions of the scene. We arguethat incorporating this occlusion model in a sensor fusionframework helps to detect darting-out pedestrians earlier andmore accurately, and we show that camera and radar aresuitable choices for such a system.

II. RELATED WORK

Pedestrian detection in intelligent vehicles is a widelyresearched topic. An extensive survey on vision basedpedestrian detection can be found in [14]. In recent years,Convolutional Neural Networks and Deep Learning methodsare often applied for pedestrian detection tasks [15].

Various camera based pedestrian detection systems explic-itly address the problem of occlusion. E.g., [4] proposes

dgavrila

Sticky Note

Proc. of the IEEE Intelligent Vehicles Symposium, Paris (France), 2019.

to use a set of component-based classifiers. A generativestochastic neural network model is used to estimate theposterior probability of pedestrian given its components scores.[5] uses Gradient Patch and a CNN to learn partial featuresand tackle occlusion without any prior knowledge. In [16]HOG features were used to create an occlusion likelihoodmap of the scanning window. Regions with mostly negativescores are segmented and considered as occluded regions.[17] used a mixture-of-experts framework to handle partialocclusion of pedestrians. [18] applied motion based objectdetection and pedestrian recognition on stereo camera inputto initiate emergency braking or evasive steering for darting-out scenarios. However, none of these methods use a globalmodel of the scene to describe occlusions, which can changethe detection quality or quantity at certain positions (e.g.no/fewer detections behind cars).

Using radars to detect VRUs is also interesting as they aremore robust to weather and lighting conditions (e.g. rain, snow,darkness) compared to camera and LIDAR sensors. Severalradar based pedestrian detection systems were described inthe literature. A radar based multi-class classifier system(including pedestrian and group of pedestrians) was shownin [19]. [7] and [8] both aim to distinguish pedestrians fromvehicles using features like size and velocity profiles of theobjects using radar. In [20] a radar based pedestrian trackingis introduced using track-before-detection method and particlefiltering. They also tested their system on tracks where theVRU enters and leaves an occluded region behind a car. Theradar was able to provide measurements even in the occlusion.However, the occlusion itself was not considered.

Several sensor fusion methods were published with the aimof better pedestrian detection. Camera has been combinedwith LIDAR [11] using CNNs. A fused system of camera andradar is introduced in [10] for static indoor applications. Allthree sensors were fused in [21] in a multi-class movingobject detection system. In [13], fusion of LIDAR andradar was used to detect pedestrians in occlusion in astatic experimental setup. They used LIDAR to detect bothunoccluded pedestrians and to map occluding objects toprovide regions of interest for the radar.

Pedestrians are often tracked using some kind of Bayesianfilter, e.g. Kalman Filters (KF) [22]. Another commonly usedmethod is the Particle Filter [20], [23], which estimates theposterior distribution using a set of weighted particles. To fitour use-case (detection and tracking of a pedestrian) a filtershould not only track an object, but also report a detectionconfidence that there is an object to track. [23], [24], [25],[26] give solutions to include this existence probability intoParticle Filters.

Several datasets have been published to help the devel-opment and testing of autonomous vehicles, e.g. the well-known KITTI [27] or the recently published Eurocity dataset[28]. At the time of writing, NuScenes [29] is the onlypublic automotive dataset to include radar measurements.Unfortunately, it does not include enough darting-out scenesfor this research.

In conclusion, pedestrian detection was extensively re-

st st+1st−1

λF

Kt

λB

ηkt zkt

Kt

Fig. 2: Graphical model of the probabilistic dependencies in asingle time slice t with Kt detections Zt = {z1t , . . . z

Ktt }. st

is the state vector and λB , λF are the expected detection rates.The binary flag ηkt denotes if the kth detection zkt comes fromforeground or background. Discrete/real variables are shownwith square/circle nodes. Observed variables are shaded.

searched using all three main sensors (camera, LIDAR, radar).However, to the best of our knowledge this is the first researchto propose a fused pedestrian detection system in automotivesetup to include knowledge of occluded areas into the model,and the first one to explicitly address the previously defineddarting-out scenario using a fusion of stereo camera and radar.

Our contributions are as follows. 1) We propose a genericocclusion aware multi-sensor Bayesian filter for objectdetection and tracking. 2) We apply the proposed filter asa radar and camera based pedestrian detection and trackingsystem on challenging darting-out scenarios, and show thatincluding occlusion information into our model helps to detectthese occluded pedestrians earlier and more accurately.

III. METHOD

In this section, we will present our occlusion aware sensorfusion framework. For modeling the state space and transitionswe will follow [23]. A graphical model of the method is shownat Figure 2. Subscripts are timestamps, superscripts are eitherindices or mark background/foreground.

Let the space S be a 2D (lateral and longitudinal) positionand velocity, and a binary flag marking if the tracked object(e.g. a VRU) exists. Let s be a state vector in S:

S : R2 × R2 × {0, 1}, (1)s ∈ S, s = (x, v, E), (2)

where x and v are the object’s position and velocity vectorson the ground plane, and E is its the binary presence flag. Ewill be used to represent the existence probability. I.e., E=1means there is a pedestrian in the scene and E=0 means itsabsence. We define a Bayesian filter for detection and tracking,for which we need the prior distribution P (st|Z1:t−1), andmeasurement likelihood function P (Zt|st), where Zt is theset of all sensor detections at timestamp t.

A. Prediction stepIn this subsection, we derive the prior distribution

P (st|Z1:t−1) for time t. To get this we need the previousstate and state transition P (st|st−1). To handle the temporal

changes in the presence flag E , we introduce two functionsand a parameter to our model. An object stays in the scenewith a probability of ps(st−1). A new object can appear witha probability of pn, and its entering position is distributedas pe(st). Using these, we can set the probabilities of the Etflag states given the previous state st−1:

P (Et =1|Et−1 =0, xt−1, vt−1) = pn (3)P (Et =0|Et−1 =0, xt−1, vt−1) = 1− pn (4)P (Et =1|Et−1 =1, xt−1, vt−1) = ps(st−1) (5)P (Et =0|Et−1 =1, xt−1, vt−1) = 1− ps(st−1). (6)

In case of a present object (Et =1) the xt and vt values aredistributed as:

P (xt, vt|Et =1, Et−1 =0, st−1) = pe(xt, vt) (7)P (xt, vt|Et =1, Et−1 =1, st−1) = P (xt, vt|xt−1, vt−1).

For this last term, we assume a linear dynamic model with anormally distributed acceleration noise.

Thus, we can finally write the full state transition as:

P (st|st−1) = P (Et|st−1) · P (xt, vt|Et, st−1). (8)

B. Update stepNow we describe the likelihood P (Zt|st). We assume

conditional independence for our sensors, thus the updatestep here is described for one sensor. The sensor returns Kt

detections at once: Zt = {z1t , . . . zKtt }. Each detection zkt

contains a 2D location. The total number of detections (Kt) isthe sum of foreground (KF

t ) and background (KBt ) detections:

Kt = KBt +KF

t . We model the number of foreground (truepositive) and background (false positive) detections with twoPoisson distributions. Let us denote the detection’s rateswith λB and λF (xt, Et) for the background and foregrounddetections respectively. KB

t , KFt are then distributed as

Poisson distributions parametrized by λB , λF :

KBt ∼ Pois(λB), (9)

KFt ∼ Pois(λF (xt, Et)). (10)

Together, the number of detections Kt is distributed as:

P (Kt|xt, Et) ∼ Pois(λB + λF (xt, Et)). (11)

Note that the number of true positive (foreground) detectionsdepends both on the object’s presence and location, thus wecan incorporate occlusion information here. E.g., more truedetections are expected if the pedestrian is unoccluded thanif the pedestrian is occluded.

λF (xt, Et =1) =

{λunocc if xt ∈ A,λocc if xt ∈ B,

(12)

where λunocc, λocc stand for the expected detection rates inunoccluded (A), occluded (B) areas respectively (see Figure1). In a not occlusion aware (naive) filter, λF is constant andassumes the unoccluded case:

naive approach: λF (xt, Et =1) = λunocc, (13)

as occlusion is not incorporated. Our occlusion aware filter(OAF) behaves the same as a naive one in unoccluded cases,but in occluded positions it adapts its expected rate λF .

Derived from the properties of Poisson distributions, thenumber of false and true positive detections given Kt aredistributed as Binomial distributions parametrized by the ratioof λB and λF . Thus, the probability of a detection zkt beingforeground/background is (given Kt number of detections):

P (ηkt =1|Et, xt,Kt) =λF (xt, Et)

λF (xt, Et) + λB, (14)

P (ηkt =0|Et, xt,Kt) =λB

λF (xt, Et) + λB, (15)

where the binary flag ηkt denotes if the kth detection zkt comesfrom the tracked object, i.e. is a true positive detection.

Now we have to define the likelihood function P (zkt |ηkt , xt)for true positive (ηkt =1) and false positive cases (ηkt =0). Weassume that true positive detections are distributed around theobject’s position xt described by some distribution L(zkt |xt),and that false detections are distributed as described by somedistribution D(zkt ):

P (zkt |xt, ηkt =1) = L(zkt |xt), (16)

P (zkt |ηkt =0) = D(zkt ). (17)

Hence the whole likelihood of one measurement is given:

P (zkt |Et, xt,Kt)=P (zkt |xt, ηkt =1)·P (ηkt =1|Et, xt,Kt)

+P (zkt |ηkt =0)·P (ηkt =0|Et, xt,Kt).(18)

We assume that all Kt detections are conditionally in-dependent given xt and Et, thus we can update with themindividually using (11):

P (Zt|st) =

Kt∏k=1

P (zkt |Et, xt,Kt) · P (Kt|Et, xt). (19)

Existence probability of a pedestrian in the scene given allmeasurement is then:

P (Et|Z1:t) =

∫∫P (st|Z1:t)dxtdvt. (20)

IV. IMPLEMENTATION

In this section, we discuss the implementation of theproposed framework in our experiments.

A. Particle filteringFor inference, we use a particle filter to represent the

posterior distribution in our model with a set of samples (i.e.particles), because it is straightforward to include occlusioninformation. I.e., particles in occluded areas are handleddifferently than those in unoccluded areas.

To include existence probability into the filter, we willfollow [23]. The method is briefly explained in this paragraph.From N particles the first one (index 0) is assigned to all thehypotheses with non-present pedestrian, called the negativeparticle. The remaining N − 1 = Ns particles (called the

positive ones) represent the case of a present pedestrian:

Et =0→ w(0)t , (21)

Et =1→ (s(i)t , w(i)t ) for i = 1 . . . Ns. (22)

where s(i)t is the state of the ith particle, and w(i)t is its as-

signed weight. Thus, the probability of a non-present/presentpedestrian given all detection is the normalized weight of thefirst particle/summed weights of all remaining, see (20):

P (Et =0|Z1:t) = w(0)t , P (Et =1|Z1:t) =

Ns∑i=1

w(i)t . (23)

To obtain the estimated state of the pedestrian, we use theweighted average of the particles along the hypothesis space.1) Initialization

Particles’ positions are initialized uniformly across theRegion of Interest (ROI). Their velocity is drawn fromnormal distribution around walking pace, and their directionis uniformly drawn from an angular region between ±22.5°,where 0° is the direction perpendicular to the movement ofthe ego-vehicle.2) Predict step

The input of the prediction step are Ns uniformly weightedparticles representing the present pedestrian, and one particlerepresenting the Et =0 hypothesis. First, we estimate the nextweight of the first particle as the following.

P (Et =0|Z1:t−1) ∼ w(0)t =

wnp

wnp + wp(24)

where wp, wnp are the cumulative weights of present, andnot present predicted states using equations (3) - (6):

wp = (pn)w(0)t−1 +

Ns∑1

(ps(s(i)t ))w(i)t−1, (25)

wnp = (1− pn)w(0)t−1 +

Ns∑1

(1− ps(s(i)t ))w(i)t−1. (26)

Afterwards, we sample Ns new positive particles, each iseither a mutation of an existing particle moved by the dynamicmodel, or a completely new (entering) one (7). An existingparticle stays in the scene with probability ps(s(i)t ), or isreplaced by a new one with probability of 1− ps(s(i)t ):

s(i)t−1 →

{st(i) ∼ P (st|s(i)t−1) if moved particle,st(i) ∼ pe(st) if new particle.

(27)

All positive particles weights are normed and set uniformly:

w(i)t =

1− w(0)t

Nsfor i = 1 . . . Ns. (28)

3) Update stepParticles are updated by new detections based on (19):

w(i)t ∝ w

(i)t · P (Zt|st(i)). (29)

After the update, all weights are normalized. To avoid sampledegeneracy, we resample the positive particles if the EffectiveSample Size (ESS) drops below a threshold as in e.g. [30].

Parameter Short description In our experiments

ps(s(i)t ) Probability for a particle to stay 0.95/0.0, in/out ROIpn Probability of an entering pedestrian 0.2λunocc Exp. # of detections (occluded) sensor specificλocc Exp. # of detections (unoccluded) sensor specificλB Exp. # of false detections sensor specificD(zkt ) Exp. distribution of noise Uniform in the ROIL(zkt |xt) Exp. distribution of true detections N(zkt |xt,Σs)Σs Covariance matrix of true detections sensor specific

TABLE I: List of model parameters and functions chosen bythe user in the framework. See text for sensor specific values.

Our framework can be ‘tuned’ by the following parametersand functions. The function ps(s(i)t ) returns that how likelyit is for that particle’s hypothesis to stay in the scene. pn isthe chance of an entering pedestrian. Their ratio tunes howsceptical is the system about the presence of a pedestrian.D(zkt ), L(zkt |x) tune the expected spatial distribution of false,true positive detections. The parameter λB gives the assumedrate of false positive detections. Finally, λunocc, λocc arethe expected rates of occurring true positive detections inunoccluded, occluded areas respectively. The ratio of thesetunes how sceptical is the system about a measurement atthat position in the first place. For a short conclusion of theseparameters see Table I.

B. SensorsPlausible values for the different λ and Σs values were

chosen after examining the dataset. A ROI of 5 m× 15 m×3 m was applied in front of the vehicle.

Our vision based detection’s input is a stereo camera(1936 × 1216 px) mounted behind the windshield of theresearch vehicle. Detections are fetched from the SingleShot Multibox Detector (SSD) [31]. Depth is estimated byprojecting the bounding boxes into the stereo point cloudcomputed by the Semi-Global Matching algorithm (SGM)[32], and taking the median distance of the points insidethem. For camera, we used λunocc = 1 as SSD is reliable atthis range in unoccluded regions. λocc was set to 0.1 in caseof partial occlusion scenes, and 0 for fully occluded ones, asno visual clue is expected if the view is fully blocked. Fewfalse positives occurred in the ROI, thus λB was set to 0.05.SSD is also used to compute occluded regions. Objects fromcar, bus, truck, and van classes are considered as occlusions.Using the projection described before, their 3D positions arecalculated. Areas behind them are marked as occluded asshown on Figure 1 and Figure 3.

Our radar sensor is a Continental 400 mounted behind thefront bumper. It outputs a list of reflections, each consistingof a 2D position, a Radar Cross Section (RCS) value, anda relative speed to the ego-vehicle. In a preprocessing stepwe used the RCS and speed values (after compensating forego-motion) to filter the reflections. We expect λunocc =1.5 detections for unoccluded positions, as often multiplereflections are received from the same pedestrian. Behindocclusions (vehicles), we still assume λunocc = 0.3 reflectionsbecause of the multi-path propagation described earlier. Anaverage of λB = 0.1 false positive reflections was expected.

V. DATASET

Our dataset consists of 83 recordings. Each contains onepedestrian stepping out from behind a vehicle. See Figure 3for an example sequence. In total, nine different occludingvehicles were used, of which four were passenger cars (partialocclusion) and five were vans (full occlusion), see Table II.Three persons, with different heights, cloths and body type,acted as pedestrians in our dataset. To record the position ofthe pedestrian we mounted an additional camera to the chest.A calibration board was placed in the field of view the carand pedestrian cameras as on Figure 1. 6-DoF positions of thesensors relative to the board were obtained by off-the-shelfpose estimation methods.

Cars Vans Total

moving 19 27 48standing 23 12 35

Total 42 39 83

TABLE II: Number of sequences of each type in the dataset.

VI. EXPERIMENTS

In the experiments we evaluated how the fusion of thetwo sensors performs compared to the individual sensorsin aspects from Section I. Firstly, in subsection VI-A weexamine the change of the probability existence over time.Secondly, in subsection VI-B, we discuss the spatial accuracyof the methods. We tested the following four methods: camera,radar, naive fusion and OAF fusion. The first three are naivefilters using camera/radar/both sensors to update, see (13).The fourth method is the proposed occlusion aware fusion.All methods run with a processing speed of ∼ 10 Hz using1000 particles in an (un-optimized) Matlab environment on ahigh-end PC.

As stated by Eq. (13), OAF fusion works identically tothe naive one in unoccluded positions. To validate this, wesimulated a scene without any occlusions or detections, onwhich the existence probabilities reported by OAF fusionand naive fusion methods both converged to the exact samevalue (0.021). This means that comparing the methods isonly reasonable when occlusions occur, thus this section willfocus on occluded scenes only. Sequences of the dataset werealigned temporally by marking the last moment where theground truth position is in the occluded region as T0 = 0.

A. Existence probabilityWe executed the methods on all sequences and recorded

the probability outputs as in equation (20). Subfigure 4aand Subfigure 4b show the results from the fully (vans) andpartially occluded (cars) scenes.

The results show that detection in car scenarios happenssooner than in the cases of vans by 0.25-0.5 seconds. Thisis expected for the camera, as in contrast to vans, partialvisual detections (e.g. head above the car) occur. In case ofradar this was surprising, as the height of a vehicle has notrivial effects on the radar signal. This shift may be causedby the length of the vehicles: vans tend to be longer than

cars, which can influence the propagation below the vehicle.In general, including radar (Naive/OAF fusion) helped to

detect the pedestrian earlier in both cases, i.e. any chosenthreshold of probability is reached sooner by the OAF fusionthan any other method. E.g., on car scenarios, the thresholdof 0.8 is reached by OAF fusion 0.3 seconds earlier than thecamera and 0.12 seconds earlier than the naive fusion.

Advantages of occlusion awareness are seen by comparingnaive and OAF fusions. OAF fusion reports a higher prob-ability of a present pedestrian than the naive fusion at alltimestamps at which the pedestrian is occluded (T < 0). Thereason for this is two fold. Firstly, OAF fusion ‘acknowledges’that parts of the scenes are occluded and cannot be measured,causing uncertainty. That is, lack of detections from theseareas are not considered as evidence for the lack of apedestrian in contrast to naive fusion. Instead, particles behindocclusion gets higher weights compared to the unoccludedparticles to represent this uncertainty. This results in higher apriori awareness even before any detections occur. Secondly,detections received from these regions are valued more thanin the naive one, as the number of received detections fitsthe expectations better in Eq. (14). As a consequence, thelikelihoods are higher for the same detections than in the naivefusion (19). In contrast, the naive fusion reports the lowestprobability at the beginning of the tracks. This is expected asit receives more evidence for a non-present pedestrian thanthe two sensors individually, and ignores the uncertainty ofoccluded regions. In the unoccluded area both fusion yieldsvery similar probabilities as expected, caused by (13).

B. Spatial accuracySpatial error is calculated as the Euclidean distance of the

ground truth and the estimated position. Subfigures 4c and 4dpresent the results from the fully and partially occluded scenes.Errors are high at the beginning as no detections occured yet,and particles are distributed uniformly, see Subfigure 3a.

Including radar into the system improves the accuracywhile the pedestrian is still occluded. The average error ofthe radar converges faster than the camera. Naive fusionyields smaller errors than radar, but not drastically. This isreasonable, as most detections in the occluded area comefrom the radar.

OAF fusion (see Figure 3) outperforms all other methodsand reach its minimal error the fastest. The reason for thisis two-fold. Firstly, the occlusion awareness means the filteracknowledges that occluded areas are not possible or hardto observe. However, it can measure that the pedestrian isnot in the unoccluded area, see Subfigure 3a. This resultsin particles in the occluded region having higher weights,and thus, being resampled more often than particles in theopen area. The estimated position (the weighted average ofthe particles) is drifted in the direction of the occlusion. Weargue that this is a valid prior, as the area where the particlesconverge is the expected entrance point of the pedestrian.Secondly, it values detections received from occluded regionsmore than the naive fusion (explained in Subsection VI-A).

All methods converge to a low (∼ 30 cm) spatial error after

(a) No detection. Occluded particleshave larger weights.

(b) Radar (red filled circle) detectsthe pedestrian.

(c) Both radar and camera (yellowfilled circle) detect the pedestrian.

Fig. 3: Camera and top views of the scene at consecutive timestamps using OAF fusion. Occlusion (orange) is calculated asthe ‘shadow’ of the blue bounding box from SSD. Initially, particles (blue to red for small to high relative weights) areuniformly distributed (3a), but occluded ones have larger weights. They converge on the location of the pedestrian (groundtruth track with green) after first radar (3b) then camera (3c) detects him. Lidar reflections are shown as black for reference.

−2 −1 0 10

0.2

0.4

0.6

0.8

1

Time(s)

Exi

sten

cePr

obab

ility

radarcameraOAF fusionnaive fusion

(a) Probability of a present pedestrian (Full occlusion)

−2 −1 0 10

0.2

0.4

0.6

0.8

1

Time(s)

Exi

sten

cePr

obab

ility

radarcameraOAF fusionnaive fusion

(b) Probability of a present pedestrian (Partial occlusion)

−2 −1 0 10

1

2

3

4

Time(s)

Ave

rage

erro

rin

met

ers radar

cameraOAF fusionnaive fusion

(c) Spatial error of estimated state (Full occlusion)

−2 −1 0 10

1

2

3

4

Time(s)

Ave

rage

erro

rin

met

ers radar

cameraOAF fusionnaive fusion

(d) Spatial error of estimated state (Partial occlusion)

Fig. 4: First/second row: Existence probability/Spatial error averaged on all sequences, with variance around the mean. Addingradar resulted in earlier detection than using only camera. OAF fusion performed with the highest confidence/accuracy.

the pedestrian left the occlusion, since the occlusion awaremethods work identically to the naive ones in unoccludedpositions (13).

VII. CONCLUSIONS

In this paper we proposed a generic occlusion aware multi-sensor Bayesian filter to address the challenge of detectingpedestrians. We applied the filter on a real life datasetof darting-out pedestrians recorded with camera and radarsensors.

Our results show that the inclusion of radar sensor andocclusion information is beneficial as pedestrians are detectedearlier and more accurately. E.g., on car scenarios, theexistence probability threshold of 0.8 is reached 0.3 secondsearlier by our occlusion aware fusion than by a naive camerabased detector and 0.12 seconds earlier than the naive fusionmethod.

In this paper, some simplifications were applied. E.g., ourmethod to build the occlusion model was suitable for ourexperiments, however, more precise 3D estimation of thevehicles could provide a better a priori knowledge to theframework. We assumed uniformly distributed backgroundnoise, although their occurrence might not be spatially inde-pendent. Including further sensors (e.g. LIDAR or additionalradars) into our modular framework is straightforward and isalso an interesting research topic.

ACKNOWLEDGEMENT

This work received support from the Dutch ScienceFoundation NWO-TTW, within the SafeVRU project (nr.14667).

REFERENCES

[1] World Health Organization, “Road traffic injuries,” 2018.[2] R. Sherony and C. Zhang, “Pedestrian and Bicyclist Crash Scenarios

in the U.S,” IEEE Conference on Intelligent Transportation Systems,Proceedings (ITSC), vol. 2015-Octob, pp. 1533–1538, 2015.

[3] C. G. Keller and D. M. Gavrila, “Will the pedestrian cross? Astudy on pedestrian path prediction,” IEEE Transactions on IntelligentTransportation Systems, vol. 15, no. 2, pp. 494–506, 2014.

[4] S. Aly, L. Hassan, A. Sagheer, and H. Murase, “Partially occludedpedestrian classification using part-based classifiers and RestrictedBoltzmann Machine model,” in IEEE Conference on IntelligentTransportation Systems, Proceedings, ITSC, 2013, pp. 1065–1070.

[5] S. Kim and M. Kim, “Occluded pedestrian classification using gradientpatch and convolutional neural networks,” in Lecture Notes in ElectricalEngineering, 2017, vol. 421, pp. 198–204.

[6] K. Granstrom, S. Reuter, M. Fatemi, and L. Svensson, “Pedestriantracking using Velodyne data stochastic optimization for extendedobject tracking,” IEEE Intelligent Vehicles Symposium (IV), no. Iv,2017.

[7] S. Heuel and H. Rohling, “Pedestrian classification in automotive radarsystems,” Proceedings International Radar Symposium, pp. 39–44,2012.

[8] ——, “Pedestrian recognition in automotive radar sensors,” Interna-tional Radar Symposium (IRS), vol. 2, pp. 732–739, 2013.

[9] H. Rohling, S. Heuel, and H. Ritter, “Pedestrian detection procedureintegrated into an 24 GHz automotive radar,” IEEE National RadarConference - Proceedings, pp. 1229–1232, 2010.

[10] R. Streubel and B. Yang, “Fusion of Stereo Camera and MIMO-FMCWRadar for Pedestrian Tracking in Indoor Environments,” Fusion, 2016.

[11] J. Schlosser, C. K. Chow, and Z. Kira, “Fusing LIDAR and imagesfor pedestrian detection using convolutional neural networks,” IEEE

International Conference on Robotics and Automation (ICRA), pp.2198–2205, 2016.

[12] A. Bartsch, F. Fitzek, and R. H. Rasshofer, “Pedestrian recognitionusing automotive radar sensors,” Advances in Radio Science, vol. 10,pp. 45–55, 2012.

[13] S. K. Kwon, E. Hyun, J.-H. Lee, J. Lee, and S. H. Son, “Detectionscheme for a partially occluded pedestrian based on occluded depth inlidarradar sensor fusion,” Optical Engineering, vol. 56, no. 11, p. 1,2017.

[14] R. Benenson, M. Omran, J. Hosang, and B. Schiele, “Ten Years ofPedestrian Detection, What Have We Learned?” in Computer Vision -ECCV 2014 Workshops, L. Agapito, M. M. Bronstein, and C. Rother,Eds. Cham: Springer International Publishing, 2015, pp. 613–627.

[15] A. Brunetti, D. Buongiorno, G. F. Trotta, and V. Bevilacqua, “Computervision and deep learning techniques for pedestrian detection andtracking: A survey,” Neurocomputing, vol. 0, pp. 1–17, 2018.

[16] X. Wang, T. X. Han, and S. Yan, “An HOG-LBP human detectorwith partial occlusion handling,” in 2009 IEEE 12th InternationalConference on Computer Vision, 2009.

[17] M. Enzweiler, A. Eigenstetter, B. Schiele, and D. M. Gavrila, “Multi-cuepedestrian classification with partial occlusion handling,” Proceedingsof the IEEE Computer Society Conference on Computer Vision andPattern Recognition, pp. 990–997, 2010.

[18] C. G. Keller, T. Dang, H. Fritz, A. Joos, C. Rabe, and D. M. Gavrila,“Active pedestrian safety by automatic braking and evasive steering,”IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 4,pp. 1292–1304, 2011.

[19] O. Schumann, M. Hahn, J. Dickmann, and C. Wohler, “Comparison ofrandom forest and long short-term memory network performances inclassification tasks using radar,” Sensor Data Fusion: Trends, Solutions,Applications (SDF), pp. 1–6, 2017.

[20] M. Heuer, A. Al-Hamadi, A. Rain, and M. M. Meinecke, “Detectionand tracking approach using an automotive radar to increase activepedestrian safety,” IEEE Intelligent Vehicles Symposium, Proceedings,no. Iv, pp. 890–893, 2014.

[21] R. O. Chavez-Garcia and O. Aycard, “Multiple Sensor Fusion andClassification for Moving Object Detection and Tracking,” IEEETransactions on Intelligent Transportation Systems, vol. 17, no. 2,pp. 525–534, 2016.

[22] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila, “Context-based pedestrian path prediction,” European Conference on ComputerVision (ECCV), vol. 8694 LNCS, no. PART 6, pp. 618–633, 2014.

[23] S. Munder, C. Schnorr, and D. Gavrila, “Pedestrian Detection andTracking Using a Mixture of View-Based Shape-Texture Models,” IEEETransactions on intelligent transportation systems, pp. 1–25, 2008.

[24] D. Salmond and H. Birch, “A particle filter for track-before-detect,” Proceedings of the 2001 American Control Conference. (Cat.No.01CH37148), pp. 3755–3760 vol.5, 2001.

[25] G. Wang, S. Tan, C. Guan, N. Wang, and Z. Liu, “Multiple modelparticle filter track-before-detect for range ambiguous radar,” ChineseJournal of Aeronautics, vol. 26, no. 6, pp. 1477–1487, 2013.

[26] Z. Radosavljevic, D. Musicki, B. Kovacevic, W. Kim, and T. Song,“Integrated particle filter for target tracking in clutter,” IET Radar, Sonarand Navigation, vol. 9, no. 8, pp. 1–2, 2015.

[27] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:The KITTI dataset,” International Journal of Robotics Research, vol. 32,no. 11, pp. 1231–1237, 2013.

[28] M. Braun, S. Krebs, F. Flohr, and D. Gavrila, “EuroCity Persons:A Novel Benchmark for Person Detection in Traffic Scenes,” IEEETransactions on Pattern Analysis and Machine Intelligence, pp. 1–1,may 2019.

[29] NuTonomy, “The nuScenes dataset.” [Online]. Available: https://www.nuscenes.org/

[30] T. Li, S. Sun, T. P. Sattar, and J. M. Corchado, “Fight sampledegeneracy and impoverishment in particle filters: A review ofintelligent approaches,” Expert Systems with Applications, vol. 41,no. 8, pp. 3944–3954, 2014.

[31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C.Berg, “SSD: Single shot multibox detector,” Lecture Notes in ComputerScience (including subseries Lecture Notes in Artificial Intelligenceand Lecture Notes in Bioinformatics), vol. 9905 LNCS, pp. 21–37,2016.

[32] H. Hirschmuller, “Stereo Processing by Semi-Global Matching andMutual Information,” IEEE Trans. on Pattern Analysis and MachineIntelligence, vol. 30, pp. 328–341, 2008.

https://www.nuscenes.org/

https://www.nuscenes.org/

occlusion aware sensor fusion for early crossing...

Documents