urban 3d semantic modelling using stereo vision, icra 2013

Urban 3D Semantic Modelling Using Stereo Vision

Sunando Sengupta1, Eric Greveson2, Ali Shahrokni2, Philip HS Torr1

1Oxford Brookes Vision Group, 22d3 Sensing.

Urban 3D Semantic Modelling Road Scene

• Given a sequence of stereo images we generate a dense 3D, semantic model

Input Stereo image Sequence Dense 3D Semantic Model

• Stereo images

Pipeline –Semantic Reconstruction

• Depth map generation• Camera estimation


• Surface reconstruction


• Semantic labelling of street view images


• Semantic model generation


Camera Estimation

• Feature tracking using left-right pair and consecutive frames

Camera Estimation

• Use the feature tracks to estimate camera poses.

• Use bundle adjustment

[a] Andreas Geiger et. Al. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite CVPR 2012

Depth-Map Estimation

• Semiglobal block matching[1] for disparity estimation

• Per-pixel depth computed as z = B x f / d

[1] H. Hirschmueller, Stereo Processing by Semi-Global Matching and Mutual Information. PAMI 2008.

B – Baselinef - Focal Length

d – pixel disparity

Depth Fusion

• Depth estimates are fused using camera poses.

• Fused into truncated signed distance (TSDF) volumetric representation[1].

[1] Brian Curless and Marc Levoy, A Volumetric Method for Building Complex Models from Range Images Siggraph 96.

TSDF Volume[1]

• Entire space divided into grids of voxels.

• For each voxel compute the truncated signed distance.

– +ve increasing when it lies in the free space, – -ve when it lies behind the surface– zero when lies on the surface

• Performed for all depth maps.

[1] B. Curless et. al. A volumetric method for building complex models from range images.

TSDF Volume

-.8

Camera

Actual surface TSDF volume

TSDF Volume

-1 -.8 -.3 .2 .8 1 1 1

-1 -.9 -.4 .1 .5 1 1 1

-1 -1 -.8 -.2 .1 1 1 1

-1 -1 -.9 -.3 .2 .8 1 1

-1 -1 -.9 -.4 .3 .9 1 1

-1 -1 -.8 -.3 .3 .9 1 1

-1 -1 -.9 -.5 .2 .8 1 1

-1 -1 -.6 .1 .7 1 1 1

Camera

TSDF volume

Actual surface

Incremental Volume Update

• Road scenes are arbitrary length long sequence.

• 3x3x1 volume of voxel grids initialised

Incremental Volume Update

• Road scenes are arbitrary length long sequence.

• 3x3x1 volume of voxel grids initialised

• Incrementally add volume as the vehicle moves out of the region

CRF

construction

Semantic Image Segmentation• We use conditional random field framework (CRF)

Final SegmentationInput Image

17

• Each pixel is a node in a grid graph G = (V,E).• Each node is a random variable x taking a label from

label set.

X

[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009.

Semantic Image Segmentation• Total energy E = Epix + Epair + Eregion

• Epix - Model individual pixel’s cost of taking a label.

– Computed via the dense boosting approach– Multi feature variant of texton boost[1]

x

Car 0.2

Road 0.3

18[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009.


• Epair- Model each pixels neighbourhood interaction.

– Encourages label consistency in adjacent pixels and sensitive to edges.

– Contrast sensitive Potts model xi xj

Car

Road

0

g(i,j)Car

Road

19

Epair


• Eregion - Model behaviour of a group of pixels.

– Encourages all the pixels in a region to take the same label.

– Group of pixels given by a multiple meanshift segmentations

c

Car 0.3Road 0.1

20

Semantic Image Segmentation - Results

• Input Images, output of our image level CRF, ground truths.

Mesh Face Labelling

• A histogram of labels is built for each mesh face (Zf ), by projecting the points from the face into labelled images.

• Majority label is considered as the label of the face.

Semantic Model

Top: Left – Surface reconstruction, Right – Semantic modelBottom: Left - input image, Right- object label set

Evaluation

• The Model is projected back using the estimated camera poses to create labelled images.

• The points in the model far away from the camera are ignored in the projection.

Evaluation• Metrics– Recall = tp/(tp+fn)– Intersection vs Union = tp/(tp+fn+fp)

Future Work

http://cms.brookes.ac.uk/research/visiongroup/projects

• Use semantic to build the structure.

• Realtime implementation.

• Combine image level information and geometric contextual information.

Thank you!!!

http://cms.brookes.ac.uk/research/visiongroup/projects/SemanticUrbanModelling/index.php






urban 3d semantic modelling using stereo vision, icra 2013

Science

sequence of stereo images

pixel depth

camera poses

3x3x1 volume of voxel

b x f d

disparity estimation

depth fusion depth estimates

depth maps