[ieee 2010 ieee virtual reality conference (vr) - boston, ma, usa (2010.03.20-2010.03.24)] 2010 ieee...

Performance Evaluation Method for Mobile Computer Vision Systems

using Augmented Reality

Jonas Nilsson † ‡∗ Anders C.E. Odblom † Jonas Fredriksson ‡ Adeel Zafar ‡ Fahim Ahmed ‡

† Department of Vehicle Dynamics and Active SafetyVolvo Car Corporation

405 31 Goteborg, Sweden

‡ Department of Signals and SystemsChalmers University of Technology

412 96 Goteborg, Sweden

ABSTRACT

This paper describes a framework which uses augmented realityfor evaluating the performance of mobile computer vision systems.Computer vision systems use primarily image data to interpret thesurrounding world, e.g to detect, classify and track objects. Theperformance of mobile computer vision systems acting in unknownenvironments is inherently difficult to evaluate since, often, obtain-ing ground truth data is problematic. The proposed novel frame-work exploits the possibility to add virtual agents into a real datasequence collected in an unknown environment, thus making itpossible to efficiently create augmented data sequences, includingground truth, to be used for performance evaluation. Varying thecontent in the data sequence by adding different virtual agents isstraightforward, making the proposed framework very flexible.

The method has been implemented and tested on a pedestriandetection system used for automotive collision avoidance. Prelimi-nary results show that the method has potential to replace and com-plement physical testing, for instance by creating collision scenar-ios, which are difficult to test in reality.

Keywords: Augmented reality, computer vision, performanceevaluation, active safety, collision avoidance

Index Terms: I.4.8 [Image processing and computer vision]:Scene Analysis; I.3.8 [Computer graphics]: Application

1 INTRODUCTION

In recent years Active Safety (AS) systems have become increas-ingly common in road vehicles since they provide an opportunity tosignificantly reduce traffic fatalities by active vehicle control. SomeAS systems utilize sensor information on the current vehicle state,such as wheel speeds and yaw rate, to stabilize an unstable vehicle.By sensing the surrounding environment, other AS systems performautonomous braking or steering interventions to help the driver tostay in the current lane [13] or avoid or mitigate the consequencesof a collision [12].

A key challenge when developing AS systems is to verify that thesystems make correct decisions to meet the function requirements.For the system to be effective, the number of missed interventions,i.e. when the system does not intervene when supposed to, mustbe minimized. Also unnecessary interventions must be avoidedsince they can be a nuisance to, or unacceptable for, the driver. AsAS systems evolve to support the driver in an increased numberof complex traffic situations, to further save lives by avoiding se-vere traffic accidents, the interventions will become more intrusivewhich in turn puts stricter requirements on the acceptable rate of un-necessary interventions. This, together with the increasing amount

∗Corresponding author, e-mail: [email protected]

of available sensor information implies that AS test methods mustcover large sets of possible scenarios to verify that the system per-formance is adequate. More efficient test methods are thus neededto enable the development of more complex AS systems.

Image analysis methods which extract information regarding de-picted items in images are often referred to as Computer Vision(CV). CV technology has become widely used in AS applications,largely due to the availability of cost efficient camera technology.CV techniques are used for many purposes, e.g. detecting, classi-fying and tracking objects such as vehicles, pedestrians, lane mark-ings and traffic signs, see e.g. [9]. Often CV algorithms exploit notexclusively image data, but also data from additional sensors.

Verifying the performance of an AS-CV system requires collec-tion of image data, as well as additional sensor data, for a repre-sentative set of traffic scenarios. As traffic environments have un-limited variability, large field tests are conducted for this purposein addition to dedicated tests on test tracks, see e.g. [6]. Thesephysical test methods are both expensive and time consuming.

Alternative test methods based on virtual generation of data re-quire models that describe relevant scenarios and sensor behaviour.In [4], a Virtual Reality (VR) solution is proposed to generate im-age data for the purpose of evaluating mobile CV systems. Morerecent VR work, focused on the evaluation of CV systems usingnetworks of distributed stationary cameras is presented in [15] and[14]. In [3], a static CV system is evaluated using two comple-mentary approaches. Data generated using VR is complemented bydata generated by adding virtual agents into real image sequences.Methods for combining virtually generated data with real imagedata are known as Augmented Reality (AR).

Compared to physical test methods, VR/AR test methods arecheap and efficient, enabling repeatable and reproducible evalua-tion of large input sets in early phases of system development. Fur-thermore, VR/AR test methods are flexible in the sense that oncea VR/AR environment has been created, variations in the scenario,such as multiple object trajectories and motion patterns, can be cre-ated with relatively small effort, including ground truth data. In realdata, ground truth is often problematic to acquire and some scenar-ios are dangerous or difficult to reproduce on demand, e.g. for ASapplications different weather conditions, animals crossing the roador scenarios where vehicles collide or nearly collide.

The drawback compared to physical testing is that AR/VR testresults are merely a prediction of the system performance. Thequality, in terms of realism, of the generated images defines withwhat confidence the test results can be used to draw conclusions onthe system performance in real operating conditions.

By inserting virtual agents in real image sequences using ARtechniques, the inherent realism of real world data collected in fieldtests can be combined with the efficiency and flexibility offered bycomputer generated images. The remainder of this paper describesa novel AR based framework for performance evaluation of mobileCV systems acting in unknown environments. Also, preliminaryresults from an AS application example are presented.

19

IEEE Virtual Reality 201020 - 24 March, Waltham, Massachusetts, USA978-1-4244-6238-4/10/$26.00 ©2010 IEEE

2 PERFORMANCE EVALUATION FRAMEWORK

Consider a camera moving along a trajectory C = [c1 c2 ... cn], asshown in Figure 1, where c j describes the position and orientation,i.e. pose, of the camera at time t j in a three-dimensional (3D) world-fixed coordinate system. At each time t j the image i j is recorded bythe camera and forms the image sequence I = [i1 i2 ... in]. Let a dataset sampled from the scenario S be defined as D = {I,X}, whereX is sensor data from additional sensors, e.g. laser, radar, gyro,GPS. Each image frame, together with additional sensor input, isprocessed by a CV algorithm forming the output Y = [y1 y2 ... yn],which contains detected, classified and tracked objects.

c1 c2 cn

C:

I: i1 i2 in

Y:

X:

y1 y2 yn

Figure 1: Definitions

The proposed method is visualized in Figure 2. Consider a setof virtual agents A = [a1 a2 ... ak], with k being the number ofagents. Each agent ai is represented by a state trajectory definedas ai = [ai,1 ai,2 ... ai,n], where ai, j is the agent state vector at timet j. By using augmented reality, A can be added to a real data set Dand a new augmented data set DA = {IA,XA}, corresponding to theaugmented scenario SA, can be created. The augmented data set DA

is processed by the CV algorithm, delivering the output YA whichcan be compared to a ground truth YGT for performance evaluationof the CV system.

Sensors

Computer

Vision

Algorithm

Data

Augmentation

90

I

YA

IA

XA

YGT

S

90

A

Data Augmentation

I

X

YGT

Animation

IA

XA

Rendering3D SceneModelling

CameraTracking

Sensor DataAugmentation

MF MG

C

MA

A

X

Figure 2: Performance evaluation framework

The data augmentation process is described in more detail in thelower part of Figure 2. Camera tracking is the first step in the dataaugmentation process and the first basic step in transforming a se-quence of 2D images into a 3D world. For a camera moving inan unknown environment, the problem is to reconstruct both themotion of the camera, i.e the extrinsic camera parameters, and thestructure of the scene using the image and additional sensor datasequences. The reconstruction depends on the intrinsic camera pa-rameters which link the pixel coordinates of an image point with thecorresponding 3D coordinates in the camera reference frame. Theprocess of reconstructing the 3D structure from an image sequenceis known as Structure from Motion (SfM) and can be divided intotwo parts, feature point tracking and camera parameter estimation.[10] provides a comprehensive treatment of the subject. The out-put from the camera tracking process is a set of 3D feature pointslocated in the scene, MF , and the camera pose trajectory, C.

Since the feature points in MF are located around the edges ofobjects, it is possible to fit planes or other geometric shapes to thepoints. These geometric shapes form a 3D scene model MG whichis used as a play ground for the virtual agents. There exist methodsto do this automatically, as mentioned in [5]. Given MG it is possi-ble to add virtual agents A into the scene. For each virtual agent astate trajectory ai is created which align the agent with the planes inMG. Once the state trajectories are created in the scene, the virtualagents (e.g. humans, animals, vehicles or road signs) can be placedand animated in the scene using the trajectories as reference. Theoutcome is an animated 3D model MA. The ground truth YGT canbe obtained from the state trajectories of the virtual agents ai. In therendering step the original image sequence I is joined together withthe animated 3D scene model MA, to produce the final augmentedimage sequence IA. If the CV system does not rely exclusively onimage data, XA needs to be generated by combining information inMA with X .

3 CASE STUDY

The purpose of the present case study is to illustrate that augmen-tation techniques provide a viable method for addressing perfor-mance evaluation of mobile CV systems. An augmented scenariois compared to a similar real traffic scenario to show that they pro-duce similar results for detection, classification and tracking of acrossing pedestrian.

3.1 System Description

A collision avoidance (CA) prototype system under developmentat Volvo Car Corporation (VCC) is used as a test subject. A for-ward looking camera is mounted behind the windshield of a testvehicle. A CV system use image data I from the camera to detect,classify and track objects. Additional inputs X are data from a for-ward looking radar and sensors measuring the state of the vehicle,e.g wheel speeds and accelerations. The CA system use radar andCV system outputs to identify situations when the vehicle is closeto colliding with detected objects. When judged necessary, the CAsystem will warn the driver and if the driver does not respond, au-tonomously brake the vehicle to avoid the impeding collision. TheCV algorithm is not public and can be executed offline on recordedor augmented image sequences using [7].

3.2 Scenario Description

Two scenarios are addressed in this case study. In the first scenarioS1 a pedestrian is initially occluded at the side of the road and thencrosses the road in front of the moving vehicle. This is chosen as itis a common scenario in real traffic. A subset of the images fromI1 is displayed in Figure 3. The second scenario S2 is used as abasis to create augmented scenarios. S2 is recorded in the same dayand at the same location as S1, but without a pedestrian crossingthe road. The absence of moving objects makes it ideal for adding

20

Figure 3: Image frames from the reference image sequence I1. Forprivacy reasons, the face of the pedestrian has been blurred

virtual agents. To be able to compare the augmented scenario S2A

with the real scenario S1, the set of virtual agents A2 is chosen tobe a single crossing pedestrian. Adding A2 to the empty scene in S2

means that S1 and S2A are qualitatively similar.

3.3 Implementation

This section briefly describes how the data augmentation processshown in Figure 2 is implemented to create the augmented dataset D2

A = {I2A,X2

A}. 3D features M2F and camera pose C2 are esti-

mated using the software tool Voodoo Camera Tracker (VCT) [8].In this implementation, no a priori information on intrinsic cameraparameters, e.g focal length and distortion, is available for VCT,meaning that VCT estimates the intrinsic parameters directly fromimage data. If the intrinsic parameters are known the tracking pro-cess can be improved. The size of the 3D models is inferred fromvehicle velocity data obtained from on board wheel speed sensors.

The software tool 3ds Max [1] is used for the geometric mod-elling of the 3D scene M2

G. This modelling is done by manually

fitting planes to clusters of feature points obtained from M2F . The

ground surface is modelled for the purpose of adding virtual agentsat a correct position in the scene as well as creating realistic agentshadows. Also, planes are added to simulate agent occlusions. Fig-ure 4 show examples of feature points and modelled planes.

The geometry and motion pattern of the agent are obtained fromMetropoly 3D Humans [2] which is a human model library. Theanimation is performed in the scene by manually setting a refer-ence trajectory for the agent. The world-fixed position of the virtualagent in the animated 3D scene model M2

A can be directly used as

ground truth data Y 2GT . Finally, the augmented image sequence I2

A

has been rendered from M2A using 3ds Max.

As the CV system utilize information from a forward lookingradar, which is capable of detecting a real pedestrian, the augmentedsensor data X2

A needs to be generated. The radar is assumed ideal,meaning that the virtual agent is detected without measurement er-ror, given that the agent is within the sensor field of view.

3.4 Results

A subset of the augmented image sequence I2A is shown in Figure 5.

When compared to I1, shown in Figure 3, it can be noted that thepedestrians in S1 and S2

A walk across the road using similar obliquepaths. They are both initially occluded by the commercial sign onthe right side of the road and then cross the road in front of themoving vehicle. The pedestrian in S1 has a different direction when

Figure 4: 3D feature points and 3D scene model

Figure 5: Image frames from the augmented image sequence I2A

crossing the road than the pedestrian in S2A. Also, the appearance of

the pedestrians are different with regards to clothing and size.

Figure 6 illustrates the camera and object position estimates, foreach detection, in world-fixed coordinates for S1 and S2

A. For S1, alldetections shown in the figure have been classified by the CV sys-tem as pedestrian detections. The world-fixed camera trajectory C1

is estimated by integrating host vehicle velocity data obtained fromon board wheel speed sensors. There is no ground truth data Y 1

GT

available for this scenario. By visual inspection of I1 and Y 1, theinitial detection delay T 1

det , which is defined as the time between thevisual appearance of the pedestrian from behind the occlusion andthe first detection, can be determined. An object is termed as visi-ble when > 50% of the object area is visible in the image. The timefor the visual appearance is defined as the first time instance wherethe object is visible. Starting at the initial detection, the pedestrianis tracked in every image frame, until the pedestrian is no longervisible, i.e the pedestrian is tracked as he crosses the road.

0 2 4 6 8 10 12 14 16 18 20 22

−2

0

2

S1

x[m]

y[m

]

C1

Y 1

0 2 4 6 8 10 12 14 16 18 20 22

−2

0

2

S2

A

x[m]

y[m

]

C2

Y 2

A

Y 2

GT

Figure 6: Camera position, CV detections and ground truth agentpositions. Data originating from the same time instance are linkedwith gray line

For S2A, all detections shown in Figure 6 have been classified by

the CV system as pedestrian detections. C2 is obtained by cameratracking. By visual inspection of I2

A and Y 2A , it can be concluded

that the virtual agent is tracked in every image frame until no longervisible, with an initial detection delay T 2

det = T 1det/2.

4 DISCUSSION

4.1 Discussion on Method

When compared to VR methods, a limitation with AR methods isthat the camera trajectory cannot be modified without recording anew image sequence. If considering a CV system which influences

21

the future path of the camera, this is a major limitation. When veri-fying that unnecessary interventions are not made by an AS system,this is usually not the case since the system has no influence on themotion of the vehicle prior to the intervention.

A key challenge for both AR and VR performance evaluationmethods is to ensure that the CV system produces the same out-put for an augmented/virtual scenario as it would for an equivalentreal scenario, i.e that the generated data set is sufficiently realistic.Modelling complex 3D environments with a high degree of realismis challenging as well as time consuming, including for instancelighting effects, motion patterns, geometries and materials. Whencomparing AR to VR methods, this challenge is in one sense re-duced significantly. AR supplements reality, rather than completelyreplacing it as VR does. Therefore fewer virtual objects needs tobe developed. Also, the background image is taken from real dataand is therefore highly representative, both with respect to the cam-era hardware used and the scenario from which the image sequenceoriginates. For AR the agents A need to be merged into I with suf-ficient realism. There is great potential to automate steps in theprocess to create the AR environment, see e.g. [5] and [11].

The needed realism is difficult to quantify for the general casedue to its dependence on the choice of CV algorithm and camerahardware. However, the required level of realism of the virtual im-ages is bounded since it does not need to exceed the resolution ofthe real image data used by the system. Furthermore, in many appli-cations, there is no detailed knowledge of the CV algorithm beingevaluated. A solution to this issue could be to validate the methodexperimentally for each type of system to be evaluated.

4.2 Discussion on Case Study Results

As pointed out in Section 3.2, the scenarios S1 and S2A are sim-

ilar but not identical. Differences in for instance vehicle speed,pedestrian trajectory, motion pattern and clothing between the twoscenarios imply that differences between the CV system outputsY 1 and Y 2

A are expected. The similar augmented and real oblique-crossing pedestrian scenarios yield qualitatively similar results forCV detection, classification and tracking.

Regarding quantitative positioning, there seems to be deviationsin the estimates of the virtual agent position as compared to groundtruth, as shown in Figure 6. These position discrepancies may bedue to errors in CV output Y 2

A as well as ground truth position Y 2GT .

Y 2A is affected by uncertainties in augmented image quality, such as

exposure and blending errors or errors caused by inaccuracy in M2A.

The accuracy of Y 2GT depends according to Figure 2 on the preci-

sion of the feature points M2F , which are used to create the 3D mod-

els M2G and M2

A, and the camera pose estimate C2. As mentionedin Section 3.3, the camera tracking has no a priori information re-garding the intrinsic parameters of the camera, e.g focal length anddistortion. Therefore, there is uncertainty on the accuracy of M2

A.Due to the uncertanties imposed by the data augmentation processon both Y 2

A and Y 2GT it is hard to draw conclusions on the positioning

performance of the CV system in the augmented scenario S2A.

By visually inspecting I2A the agent trajectory can be approxi-

mated satisfactorily with a straight line, in agreement with Y 2GT in

Figure 6, meaning that the initial detections in Y 2A are likely to in-

clude a position error. It can also be observed that the latter partof Y 2

A is consistent with a straight line trajectory. Even so, far fromthe x-axis there is an unexplained discrepancy between the straightline parts of the ground truth trajectory Y 2

GT and the tracked trajec-

tory Y 2A . The difference in position is small when the detections are

made close to the x-axis, which coincides with the optical axis ofthe camera. This indicates that inaccurate intrinsic parameters, e.g.describing the camera distortion, could explain some of the differ-ence between Y 2

A and Y 2GT , as the effects of inaccurate distortion

estimation would not affect Y 2GT close to the optical axis.

5 CONCLUSIONS

An AR based framework for performance evaluation of mobileCV systems acting in unknown environments has been presented.The proposed method has been applied to an automotive collisionavoidance application. Comparison of two similar oblique-crossingpedestrian scenarios, one real and one augmented, yield qualita-tively similar CV results for detection, classification and tracking.Further studies are needed, including multiple scenarios, to validatethat the realism of the augmented data is sufficent for its purpose.

ACKNOWLEDGEMENTS

Financial supports from the Swedish Automotive Research Pro-gram (FFP contract Dnr 2007-01733 and FFI contract Dnr 2008-04110 at VINNOVA) and the Vehicle and Traffic Safety Centre(SAFER) are gratefully acknowledged. Furthermore, the authorswish to thank Gunnar Bergstrom and Henrik Moren at XDIN,Goteborg, Sweden, for initial discussions on augmented realitytechniques, and Robert Jacobson at VCC for 3ds Max support.

REFERENCES

[1] Autodesk. Autodesk 3ds Max. http://www.autodesk.com/, Sept 2009.

[2] aXYZ design. Metropoly 3D Humans. http://www.axyz-design.com/,

Sept 2009.

[3] P. Baiget, X. Roca, and J. Gonzalez. Autonomous virtual agents for

performance evaluation of tracking algorithms. Articulated Motion

and Deformable Objects. 5th International Conference, AMDO 2008,

pages 299–308, Berlin, Germany, 2008. Springer-Verlag.

[4] W. Burger, M. J. Barth, and W. Strzlinger. Immersive simulation for

computer vision. In K. Solina, editor, joint 19th AGM and 1st SDRV

workshop Visual Modules, pages 160–168, Maribor, Slovenia, 1995.

Oldenbourg Press.

[5] D. Chekhlov, A. Ge, A. Calway, and W. Mayol-Cuevas. Ninja on a

plane: Automatic discovery of physical planes for augmented reality

using visual slam. In International Symposium on Mixed and Aug-

mented Reality (ISMAR), 2007.

[6] E. Coelingh, H. Lind, W. Birk, and D. Wetterberg. Collision warning

with auto brake. In FISITA World Congress, F2006VI30, 2006.

[7] Delphi Corporation. Cwm2 resimulator, 2008.

[8] Digilab. Voodoo Camera Tracker. University of Hannover,

http://www.digilab.uni-hannover.de/docs/manual.html, Sept 2009.

[9] D. M. Gavrila and V. Philomin. Real-time object detection for ‘smart’

vehicles. In Proceedings of the IEEE International Conference on

Computer Vision, volume 1, pages 87–93, 1999.

[10] R. Hartley and A. Zisserman. Multiple view geometry in computer

vision. Cambridge University Press, Cambridge, 2003. 2nd ed.

[11] K. Jacobs, J.-D. Nahmias, C. Angus, A. Reche, C. Loscos, and

A. Steed. Automatic generation of consistent shadows for augmented

reality. In GI ’05: Proceedings of Graphics Interface 2005, pages

113–120, School of Computer Science, University of Waterloo, Wa-

terloo, Ontario, Canada, 2005. Canadian Human-Computer Commu-

nications Society.

[12] J. Jansson. Collision Avoidance Theory with Application to Automo-

tive Collision Mitigation. Dissertation No 950, Dept. of Electrical

Enginering, Linkoping University, Linkoping, 2005.

[13] J. Pohl, W. Birk, and L. Westervall. A driver-distraction-based lane-

keeping assistance system. Proceedings of the Institution of Mechan-

ical Engineers. Part I: Journal of Systems and Control Engineering,

221(4):541–552, 2007.

[14] F. Z. Qureshi and D. Terzopoulos. Towards intelligent camera net-

works: a virtual vision approach. In Visual Surveillance and Per-

formance Evaluation of Tracking and Surveillance. 2nd Joint IEEE

International Workshop on, pages 177–184, 2005.

[15] A. Santuari, O. Lanz, and R. Brunelli. Synthetic movies for com-

puter vision applications. Proceedings of the 3rd IASTED Inter-

national Conference: Visualization, Imaging, and Image Processing

(VHP 2003), 1:1–6, 2003.

22

[ieee 2010 ieee virtual reality conference (vr) - boston, ma, usa (2010.03.20-2010.03.24)] 2010 ieee...

Documents