[ieee 2011 3dtv-conference: the true vision - capture, transmission and display of 3d video...

A FAST IMAGE SEGMENTATION ALGORITHM USING COLOR AND DEPTH MAP

Emanuele Mirante1, Mihail Georgiev2, Atanas Gotchev2

1Università degli Studi Roma TRE, 2Tampere University of Technology

ABSTRACT In this paper, a real-time image segmentation algorithm is presented. It utilizes both color and depth information retrieved from a multi-sensor capture system, which combines stereo camera pairs with time-of-flight range sensor. The algorithm targets low complexity and fast implementation which can be achieved through parallelization. Applications, such as immersive videoconferencing and lecturer segmentation for augmented reality lecture presentation, can benefit from the designed algorithm.

Index Terms—Object segmentation, Real time systems, Videoconference, Image Processing.

1. INTRODUCTION

Image segmentation is an essential operation in many applications, such as action recognition and tracking [1], [2], immersive videoconference [3], and biomedical imaging, just to name a few. The challenges in image segmentation are imposed by the requirements for precise object boundaries and real-time implementation. In general, segmentation algorithms are application-specific. Some of them rely only on color information [3], [4], [5], [6]. Others utilize data from supplementary devices such as infrared sensors [7] or Time-of-Flight (ToF) cameras [1], [2]. Algorithms based on color can provide high-quality background subtraction algorithms for the price of high computational complexity [5], [6], [8]. Their performance is decreased by illumination changes or local motions in the background. The use of scene geometry, e.g. through depth-from-stereo [9] is an attractive addition to color segmentation however also computationally demanding. Therefore, practitioners have opted for inclusion of specific range sensor to complement the color-based segmentation for achieving better robustness and lower complexity. An example is the solution proposed in [7], where an infrared (IR) camera has been employed. Another alternative is to utilize a ToF camera. This type of sensors has emerged as reliable range sensors for indoor applications with increasing precision and lowering cost. However, those cameras have a limitation of rather low spatial resolution, capture noise due to multiple reflections and also impose synchronization issues.

This work aims at tackling some of those problems by combining color with depth information in a jointly-calibrated 2D/ToF camera system. The depth information is used to simplify the detection of the regions of interest (ROI), thus making the algorithm faster and with lower complexity. Then, the color information is used to provide feedback information about segmenting confidence, which further refines the output quality. The proposed solution is suitable for segmenting a foreground moving object in a not strictly static background. The main targeted applications include 3D videoconferencing (imposing multiple participants in a common scene), and lecturer segmentation in augmented-reality type of lecture presentation (background change for an "immersive" lecture environment).

2. PROPOSED METHOD The proposed method utilizes three kinds of information: motion, color and depth. Motion is used in the beginning in order to get a prior about the position of ROI within the scene. The initial segmentation area is then obtained based on depth information, and final refinement is done using color information. The flow diagram in Fig. 1 shows the main steps of the algorithm, which are described in more details in the next sections.

Fig. 1. Segmentation Algorithmic Flow Diagram. 2.1. Motion detection The aim of this step is to detect object movements, which usually appear as changes between frames. Specifically, two consecutive frames are tracked pixel-wise for color changes, while also following the changes in the depth channel:

| ( , ) ( , 1)| >( , ) ( , 1) > 0 , (1)

Depth infoMotion info

Region growing Color-based refining

Previous mask

Edge-based refining

Final mask

This work was supported by EC within FP7 (Grant 216503 with the acronym MOBILE3DTV).

978-1-61284-162-5/11/$26.00 2011 © IEEE

where color(x, t) and color(x, t – 1) are the current and previous color frame pixels with coordinates x=(x,y), at times t and t– 1, depth(x, t) and depth(x, t – 1) are the depth values of the corresponding pixels and th is a threshold, chosen to distinguish substantial movements from noise (e.g. 30% of the maximum changing value). Depth information is a secondary source of motion detection refining the areas of color-detected motion. This operation yields a mask of regions with detected motion. 2.2. Creation of foreground segmentation mask After motion detection, we estimate an initial foreground mask. We implement a region growing algorithm, which uses so called “pixel seeds” for area growing. We provide our initial seed by pixels obtained in the previous step. The idea of the region growing algorithm is the following: a chosen seeding pixel is compared with the neighboring pixels, and if these pixels are similar to the seeding one, then they are added to the seeding region, thus growing it. The iteration ends when there are no more pixels to include in this region. The set of pixels in the current frame which belongs to the growing region is denoted by R(t), with a subset P(t) = {p1=(x1,y1), p2=(x2,y2), ..} of R, formed only by points adjacent to the particular pixel (xi,yi). The similarity is calculated as follows:

( , ) | ( , ) ( )| , (2)

where pk is any point belonging to P and th is a particular threshold. A pixel is considered in R only when its depth value belongs to a range of [f(pk)-th, f(pk)+th)]. While the region grows, the range is updated also, because it is not linked to the initial value. In this way, when starting from the points with the biggest movements, it is possible to obtain the complete mask of the moving object, assuming that the depth values change smoothly. The th parameter sets the tolerance of possible depth change. Its value should be selected high enough so to ensure that it covers the whole area of the moving object. This reduces possible errors at the next refining stages.

Fig. 2. Edge-based refining Flow Diagram 2.3. Edge-based refining The estimated foreground mask contains errors encountered by the region-growing algorithm when applied to noisy depth maps. Errors in the depth map lead to false inclusion of large areas of background data to the foreground mask, as

illustrated in Fig. 3a. This kind of errors can be tackled by using more precise edge information from the color data, which results in improved foreground mask. Thus, an Edge-based refining process is imposed to reduce such kind of errors. More specifically, we slightly enlarge the color frame by dilatation so to ensure that it contains all important edges. A Canny operator is applied on the enlarged mask and edges are joined using a dilatation operator. Finally, a logical AND operator is applied between the resulted and the initial foreground mask to exclude any redundant background data. In this way, the initial foreground mask fits only around external edges of the ROI. Fig. 2 shows the algorithm and Fig. 3b illustrates the result after edge-based refining.

(a) (b) (c) Fig. 3. a) Initial foreground shape, b) after edge based refining, (c) continuation consistency 2.4. Color-based refining Another refinement step is used to reduce boundary errors on depth data, by refining the foreground mask by color feedback. We generate the so-called “tri-map”: pixels inside the foreground mask are marked as “certain foreground”, the ones outside – as “certain background”, and pixels near the edges are marked as “uncertain” (see Fig. 4b for an illustration).

(a) (b) (c) Fig. 4. Color-based refinement: (a) tri-map - white pixels are “certain foreground”, dark gray pixels are “certain background”, medium gray pixels are “uncertain” pixels; (b) decision on uncertain pixels; (c) resulted foreground mask

Lower precision of the depth map, would require wider

“uncertain” area. The decision of whether an uncertain pixel belongs to foreground or background is given by comparing the color of the pixel, to the nearest pixels of certain foreground and certain background. In case of ambiguity, the pixel is classified based on closest distance. Color comparison is done by Nearest- Neighbor search (K-nn search) in the CIELAB color space. For faster comparison, we use a 3D kd-tree [10], where all color values of the pixel

Projection

Initial foreground

shapeDilatation

Edge detecton Dilatation

&

Color frame

are taken into account. The CIELAB was chosen among others, because of reported uniformity in perception which in our case provided better results [11]. Moreover, a reduction of number of colors, through quantization, allowed achieving faster implementation and introduced fewer errors than all other tested color spaces. 2.5. Continuation consistency For each frame of the input video, a foreground mask is produced as output. This information is available for consistency check to the next frames. It can be utilized in all described stages of the segmentation algorithm to improve the overall performance and robustness. It improves the continuity of segmentation e.g. in cases where there is a lack of sufficient detected initial seeds for region growing due to low motion (changes in corresponding pixels are mainly caused by noise) or when there are problems with the generation of the tri-maps. Another important case in which the use of such additional information helps is when the scene has fast motion not linked to the segmented object in the other frames. Such case happens typically in teleconferencing systems, when changes in the background are bigger than the speaker motion, or body movements near the camera (hands, pen, etc.) are loosely related with the speaker, like in Fig. 6a. In the first example, the previous segmentation mask helps to distinguish the proper current one. In the second example, the choice is made based on changes of depth values in the new calculated mask: if they are close to the camera, they probably belong to the previously segmented (e.g. hands for the speaking person in ‘Bullinger’ video).

Fig. 5. Hardware 2D/PMD system and real data acquisition

3. EXPERIMENTAL RESULTS Three video sequences have been used to test the algorithm, as given in Fig. 6. The ‘Bullinger’ video of 320x192 pixels resolution represents a person speaking in front of the

camera [12]. Moderate motion is observed for the foreground object only. The depth map for the video is obtained using Hybrid Recursive Stereo Matching (HRM) [13]. The data is rather noisy: visible artifacts occur in areas of homogenous color and there is no clear depth separation of foreground from background. The ‘Ballet’ video of 1024x768 pixels resolution contains a scene with fast motion changes between consecutive frames and more than one object in the scene [14]. The third video, entitled as ‘Speaker’, was collected by a hybrid color + ToF camera setting connected to a desktop computer, as illustrated in Fig. 5. The color camera used was a Prosilica-GE1900C, by Allied Vision Technologies, and the ToF device used was a Cam Cube 2.0 by PMDTek. The real captured video has a resolution of 300x300 pixels at 15 fps. The controlling computer used to capture the video and to process all three videos was a PC with 3 GHz Core2Duo processor and 2 GB RAM.

Using an application developed in Matlab, we have achieved an average speed of about 0.2 sec/frame for ‘Bullinger’, about 0.8 sec/frame for ‘Ballet’, and about 0.25 sec/frame for ‘Speaker’.

(a) (b) (c) Fig. 6. - Segmentation results for tested videos: a) ‘Bullinger’, b) ‘Ballet’, c) ‘Speaker’

The tests included an objective evaluation of the performance of the proposed algorithm against ground true segmentation. The ground true segments were obtained by manual and precise selection of background and foreground. The comparison between segments obtained by the algorithm and the ground true was done in terms of number of different pixels between the two. We compared also our results with those provided by the SIOX method [8]. For the latter, we obtained the foreground mask by repeating the procedure several times and selecting the best obtained result, as described in [8]. Fig. 7 summarizes the results. As seen in the figure, the proposed algorithm over-performs the SIOX method for ‘Speaker’ and ‘Ballet’, and is inferior for ‘Bullinger’. For the case of ‘Speaker’, the problem is in the depth date provided by the PMD sensor. The lower resolution depth images are aligned not very precisely near the edges of the moving object and it is unlikely that the depth and color are superimposed correctly for every pixel. This is however tackled by our algorithm. For the case of ‘Ballet’, the reason for the much higher performance of our algorithm is due to the presence of artifacts in the depth data. For the SIOX method, it results in segmentation with

large areas of background (e.g. dance floor), marked as foreground.

Fig. 7. Results on test data compared against SIOX [8]

(a) (b) Fig. 8. Segmentation Problems a) near object contact, b) bad depth map and edges in color map

4. CONCLUSION AND FUTURE WORK

In this paper we present a new approach for a real time segmentation application utilizing both 2D color and depth information. The implemented algorithm demonstrates fast and robust performance for real-case scenarios. The main difference between our and other proposed solutions is that we use not only depth to calculate the tri-map, but also motion and color. Thus, it makes a possibility of practical use in different capturing conditions (different level of depth noise artifacts, ToF depth, and etc.) and scenarios (multiple persons, change of background, etc). There are two main limitations that could be observed in some cases. The proposed algorithm segments a moving object in the scene, when assuming a smooth continuous nature of depth data. The other one is observed when the moving object interacts with other foreground objects. In this case, those objects will be considered as part of the moving one (Fig. 8a). Another minor issue is that sometimes in the edge-based refining stage, if the area that should be removed contains some edges of background, they would be considered foreground as well (Fig. 8b). Our real-time implementation is of low complexity, and is suitable for mobile devices. For example, a 3D mobile device could track the region of interest by the algorithm to enhance the resolution of depth and color data of static and moving objects. We have

currently studying possible automatic adapting technique of algorithm parameters according to the scene content. We also test how to restore information of resulted occluded areas of projected PMD data by utilizing same segmentation approach.

5. REFERENCES [1] S. Oprisescu, C. Burlacu, V. Buzuloiu, “Action Recognition using Time of Flight Cameras”, 8th Int. COMM, 2010. [2] S. Guðmundsson, J. Sveinsson, M. Pardas, “Model-Based Hand Gesture Tracking in ToF Image Sequences”, F.J. Perales and R.B. Fisher (Eds.): AMDO 2010, LNCS 6169, pp. 118–127, 2010. [3] J. Civit, O. D. Escoda, “Robust Foreground Segmentation for GPU Architecture in an Immersive 3D Video-Conferencing Systems”, IEEE Int. Workshop on MMSP, 2010. [4] J. Fan, D. K. Y. Yau, A. K. Elmagarmid, S.Walid, G. Aref, “Automatic Image Segmentation by Integrating Color-Edge Extraction and Seeded Region Growing”, IEEE Trans. IP, pp. 1454-1466, Vol. 10, No. 10, October 2001. [5] S. Kwak, G. Bae, H. Byun, “Moving-object segmentation using a foreground history map”, Journal of the Optical Society of America A, pp. 180-187, Vol. 27, No. 2, February 2010. [6] S. Chien, S. Ma, L. Chen, “Efficient Moving Object Segmentation Algorithm Using Background Registration Technique”, IEEE Trans. CSVT, pp.577-586, Vol. 12, No. 7, 2002. [7] Q. Wu, P. Boulanger , W. Bischof, “Robust Real-Time Bi-Layer Video Segmentation Using Infrared Video”, CRV '08, 2008. [8] G. Friedland, K. Jantz, R. Rojas, “SIOX: Simple Interactive Object Extraction in Still Images”, 7th IEEE ISM, 2005. [9] Y. Ma, Q. Chen, “Stereo-Based Object Segmentation Combining Spatio-Temporal Information”, Lecture Notes in Computer Science, Vol. 6455, pp. 229-238, 2010. [10] J. L. Bentley. “Multidimensional binary search trees used for associative searching”, Communications of the ACM, Vol.18 pp. 509–517, 1975. [11] A. D’Angelo, J. Dugelay, “A Statistical Approach to Culture Colors Distribution In Video Sensors”, 5th Int. Workshop on VPQM, January 2010. [12] Mobile 3DTV research, video plus depth sequences: http://sp.cs.tut.fi/mobile3dtv/video-plus-depth. [13] N. Atzpadin, P. Kauff, “Stereo Analysis by Hybrid Recursive Matching for Real-Time Immersive Video Conferencing”, IEEE Trans. CSVT, pp. 321-334, Vol. 14, No. 3, 2004. [14] MSR 3D Video Sequences. Available at: www.research.microsoft.com/vision/ImageBasedRealities/3DVideoDownload.

Erro

r pixe

ls [m

ean;

%]

[ieee 2011 3dtv-conference: the true vision - capture, transmission and display of 3d video...

Documents