[ieee 2012 ieee virtual reality (vr) - costa mesa, ca, usa (2012.03.4-2012.03.8)] 2012 ieee virtual...

3DTown: The Automatic Urban Awareness Project

Eduardo R. Corral-Soto∗ Ron Tal† Larry Wang‡ Ravi Persad§ Luo Chao¶ Chan Solomon‖

Bob Hou∗∗ Gunho Sohn†† James H. Elder‡‡

Dept. of Computer Science and Engineering / Dept. of Earth and Space Science and EngineeringYork University

ABSTRACT

In this work the goal is to develop a distributed system for sens-ing, interpreting and visualizing the real-time dynamics of urbanlife within the 3D context of a city focusing on typical, useful dy-namic information such as walking pedestrians and moving ve-hicles captured by pan-tilt-zoom (PTZ) video cameras. Three-dimensionalization of the data extracted from video cameras isachieved by an algorithm that uses the Manhattan structure of theurban scene to automatically estimate the camera pose. Thus, ifthe pose of the video camera changes, our system will automati-cally update the corresponding projection matrix to maintain accu-rate geo-location of the scene dynamics.

Index Terms:H.5.1 [Information Interfaces and Presentation]: Multimedia In-

formation Systems—Artificial, augmented, and virtual realities;I.2.10 [Artificial Intelligence]: Vision and Scene Understanding

—Video analysis;I.4.8 [Image Processing and Computer Vision]: Scene

Analysis—Tracking;

1 INTRODUCTION

During the recent years tools such as Google Earth and MicrosoftVirtual Earth have become essential for street finding and visualiza-tion. However, these tools provide a stationary virtual world whichlimits their use in surveillance applications where object dynamicsis important. On the other hand, the number of surveillance videocameras installed in urban areas such as universities, airports andshopping malls grows every year as cameras become less expen-sive and as the demand for security and monitoring systems grows.With the availability of large sets of video images comes the prob-lem of viewing and analyzing the image data, which is traditionallydone by teams of human camera operators that switch and controlthe cameras remotely in order to view the separate images on agrid of 2D displays. This problem has been addressed in the pastby researchers [10, 14, 15, 11, 8] in different ways by means ofthe introduction of cooperative multi-sensor video surveillance andmonitoring systems that map the dynamics from different videos ofa scene onto the appropriate areas of a common static 3D model ofthe corresponding scene, by performing texture mapping 3D modelareas with the projected live video images from the cameras, and by

∗e-mail: [email protected]†e-mail: [email protected]‡e-mail: [email protected]§e-mail: [email protected]¶e-mail: [email protected]‖e-mail: [email protected]∗∗e-mail: [email protected]††e-mail: [email protected]‡‡e-mail: [email protected]

detecting and tracking moving objects in the scene, thus allowingthe user to have a broad overview and awareness of the current situ-ation by looking at the projected scene dynamics simultaneously onthe 3D model. In our work the goal is to develop a distributed sys-tem for sensing, interpreting and visualizing the real-time dynamicsof urban life within the 3D context of a city focusing on typical, use-ful dynamic information such as walking pedestrians and movingvehicles captured by pan-tilt-zoom (PTZ) video cameras, and en-vironmental signals provided by temperature sensors. To this endwe have created a system called 3DTown which is composed of thefollowing discrete modules: 1) A 3D modeling module that allowsfor the efficient reconstruction of building models and integrationwith indoor architectural plans; 2) A GeoWeb server that indexesa 3D urban database to render perspective views of both outdoorand indoor environments from any requested vantage; 3) Sensormodules that receive and cache real-time data; 4) Tracking mod-ules that detect and track pedestrians and vehicles, in urban spacesand access highways; 5) Camera pose modules that automaticallyestimate camera pose relative to the urban environment; 6) Three-dimensionalization modules that receive information from the Ge-oWeb server, tracking and camera pose modules in order to back-project image tracks to geolocate pedestrians and vehicles withinthe 3D model; 7) An animation module that represents geo-locateddynamic agents as sprites; and 8) A web-based visualization mod-ule that allows a user to explore the resulting dynamic 3D visual-ization in a number of interesting ways.

2 3DTOWN SYSTEM DESCRIPTION

We construct our 3D virtual world by augmenting Google Earth 3Dmaps with photorealistic prismatic 3D building models from a uni-versity campus which we created from LIDAR data and optical im-ages using standard methods described in [12, 13]. In our trackingalgorithm, which is based on [9] and [6], we model the colour ofeach image pixel as a mixture of two Gaussians each correspondingto background and foreground where the parameters of the Gaus-sians and the weights are re-estimated for each incoming image bymeans of the incremental version of the Expectation Maximization(EM) algorithm used in [9]. In order to discard persistent back-groung responses (e.g. slow-moving trees), we take the temporaldifferences of the foreground posterior probability maps and ap-ply a sensitivity threshold. The result is filtered to produce smoothGaussian-like blobs that are easy to track by means of a simple peakdetection algorithm. We perform the mapping of the tracked imagelocations to the corresponding 3D points in the virtual world byfinding the intersection of the back-projected ray with the known3D ground plane provided by Google Earth. The main novelty inour work, which is explained in the next section of this document,is the introduction of an automatic on-line method for updating therotation matrix component of our virtual camera for the cases whenthe operator changes the pose of the video camera. We have also de-veloped a friendly and intuitive web-based graphical user interfacethat allows the user to select a real-time surveillance video cameralocated in the 3D model, activate the 3D visualization of trackedpedestrians and vehicles, change the 3D view of the scene, and per-form indoor temperatures monitoring. In our system the tracked

87

IEEE Virtual Reality 20124-8 March, Orange County, CA, USA978-1-4673-1246-2/12/$31.00 ©2012 IEEE

objects are presented as sprites in the case of pedestrians, and assimple 3D car models in the case of vehicles.

2.1 Estimating the Camera Pose via Manhattan FramesIn order to compute the rotation matrix R component of our camerawe employ a model-free method that estimates the rotation inde-pendently for every frame by considering R as the product of twomatrices

R = RM→UT M ·RM , (1)

where RM defines the rotation of the camera relative to the Man-hattan frame (the canonical coordinate system defined by the or-thogonal man-made structures in the scene) and RM→UT M definesthe orientation of man-made structures with respect to the Univer-sal Transverse Mercator (UTM) coordinate system. Since build-ings are static in general, RM→UT M need only be computed once.We estimate the pose of the camera with respect to the 3D scenestructure RM at every frame by exploiting the inherent geometricalstructure of urban environments [3, 4]. The principle behind suchmethods is to optimize for the parameter that defines the rotationbetween the camera and the Manhattan frame of reference Ψ bymaximizing its likelihood function over a set of linear perspectivecues E = {~E1, . . . ,~EN}, where in our case N is the number of linesdetected in an image:

Ψ∗ = argmax

Ψp(E|Ψ) = ∏

ip(~Ei|Ψ). (2)

The association of observations with the Manhattan directions isexpressed through a mixture model:

p(~Ei|Ψ) = ∑mi

p(~Ei|Ψ,mi)p(mi), (3)

where mi is the ‘Manhattan cause’ of the line (vertical, hori-zontal(1), horizontal(2), background) and p(mi) is the prior overcauses. In our method edge locations and gradients are first esti-mated to sub-pixel accuracy using the Elder-Zucker method [7]and then grouped into lines using a Hough transform based tech-nique [5] that uses a kernel-based voting scheme which propagatesthe uncertainty of edge observations onto the parameter domain asproposed by [16]. False detections are avoided by employing asoft voting scheme that probabilistically subtracts the contributionsof edge observations that correspond to previously detected lines.The set of 2D detected lines can be used to recover RM by con-sidering the Gauss sphere representation of the problem [2]. Adetected line in the image plane, along with the optical centre ofthe camera define an interpretation plane. The space formed bythe normal vectors of all possible interpretation planes is calledthe Gauss sphere. Under perspective projection, the interpretationplane normals of world parallel lines are coplanar. The normal vec-tor to this plane is parallel to the direction of the lines in 3D. Thus,p(~Ei|Ψ,mi) is modeled using the angular error formed between theline segment’s interpretation plane and the 3D orientation vector inthe Gauss sphere. The values of the priors p(mi) and the param-eters of the distribution of the error functions are learned using aground-truth database [4]. The optimal solution is found using agradient-descent algorithm [1] that uses the Euler angles that de-fine the valid Manhattan frame as a search space. The output of thealgorithm is the rotation matrix that defines the pose of the camerawith respect to the Manhattan frame, and its inverse is the matrixRM which defines the transformation from the camera coordinatesystem to the Manhattan coordinate system.

3 PRELIMINARY EVALUATIONS AND CONCLUSIONS

Related systems that have been reported in the past have generallynot been systematically and quantitatively evaluated, and there is

no standard method for such an evaluation at this point. Ultimately,we intend to conduct a human-in-the-loop usability study within thecontext of a specific set of tasks. At this point, however, we can re-port some specific quantitative performance parameters for specificmodules of our system. The accuracy of our 3D building modelsis on the order of 5cm. The average error of our automatic cam-era pose estimation module, measured on a standard public dataset(YorkUrbanDB) is 2.5deg, which compares favourably with otherpublished single-frame approaches [4]. While our tracking moduleoperates at 4 frames per second on a standard PC, our camera posealgorithm takes about 14 seconds to estimate the camera pose froma single frame: thus it is useful for intermittent pan/tilt operation,but not for continuous smooth pursuit.

Demonstrations of our 3DTown system for hundreds of peoplehave yielded substantial feedback that has been helpful in planningfuture work, which will include expanding the system to includemore cameras, improvements to our tracker and minimization ofrendering delay, introduction of full-articulated avatars, and incor-poration of automatic action labelling, and code optimizations.

REFERENCES

[1] M. Avriel. Nonlinear Programming: Analysis and Methods. PrenticeHall, 1976.

[2] S. T. Barnard. Interpreting perspective images. Artificial Intelligence,21(4):435–462, 1983.

[3] J. Coughlan and A. Yuille. Manhattan World: Compass Directionfrom a Single Image by Bayesian Inference. International Conferenceon Computer Vision., 2:941–947, 1999.

[4] P. Denis, J. Elder, and F. Estrada. Efficient Edge-Based Methods forEstimating Manhattan Frames in Urban Imagery. European Confer-ence on Computer Vision, pages 197–210, 2008.

[5] R. Duda and P. Hart. Use of the Hough Transformation to Detect Linesand Curves in Pictures. Communications of the ACM, 1(15):11–15,1972.

[6] J. Elder, S. Prince, Y. Hou, M. Sizintsev, and E. Olevskiy. Pre-Attentive and Attentive Detection of Humans in Wide-Field Scenes.International Journal of Computer Vision, 72(1):47–66, 2007.

[7] J. H. Elder and S. W. Zucker. Local Scale Control for Edge Detectionand Blur Estimation. Transactions on Pattern Analysis and MachineIntelligence, 20(7):699–716, 1999.

[8] M. P. et al. Detailed Real-Time Urban 3D Reconstruction from Video.International Journal of Computer Vision, 78(2-3):143–167, 2008.

[9] N. Friedman and S. Russel. Image Segmentation in Video Sequences:A Probabilistic Approach. In Proc. UAI, pages 175–181, 1997.

[10] T. Kanade, R. Collins, A. Lipton, P. Burt, and L. Wixson. Advances inCooperative Multi-Sensor Video Surveillance. Proc. of DARPA ImageUnderstanding Workshop, pages 3–24, 1998.

[11] K. Kim, S. Oh, J. Lee, and I. Essa. Augmenting Aerial Earth Mapswith Dynamic Information. IEEE International Symposium on Mixedand Augmented Reality 2009, Science and Technology Proceedings,pages 19–22, 2009.

[12] J. Li-Chee-Ming, D. Gumerov, T. Ciobanu, and C. Armenakis. Gener-ation of Three-Dimensional Photo-Realistic Models from LIDAR andImage Data. Proceedings 2009 IEEE Toronto International Confer-ence - Science and Technology for Humanity, pages 445–450, 2009.

[13] C. Poullis and S. You. Automatic Reconstruction of Cities from Re-mote Sensor Data. IEEE Computer Vision and Pattern Recognition,2009, pages 2775–2782, 2009.

[14] H. Sawhney, A. Arpa, R. Kumar, S. Samarasekera, M. Aggarwal,S. Hsu, D. Nister, and K. Hanna. Video Fashlights - Real Time Ren-dering of Multiple Videos for Immersive Model Visualization. Thir-teenth Eurographics Workshop on Rendering (2002), pages 157–168,2002.

[15] I. Sebe, J. Hu, S. You, and U. Neumann. 3D Video Surveillancewith Augmented Virtual Environments. IWVS03, November 7, 2003,Berkeley, California, USA, pages 107–112, 2003.

[16] R. Tal. Line-based single-view methods for estimating 3d camera ori-entation in urban scenes. Master’s thesis, York University, 2011.

88

[ieee 2012 ieee virtual reality (vr) - costa mesa, ca, usa (2012.03.4-2012.03.8)] 2012 ieee virtual...

Documents