3d scanning technology overview: kinect reconstruction algorithms explained

3D Scanning Technology

Milwaukee 3D Printing Meetup & Voxel Metric

Agenda• Demo• Overview of scanning technologies• Primesense sensors in-”depth”• Freehand camera reconstruction core algorithms• 3D freehand reconstruction algorithm, step-by-step

Overview of scanning technologies

ContactPrecise

Slow

Touch not always practical

Line LaserLine appears distorted from camera’s point-of-view

Geometry inferred from distortion

StereoscopicTwo offset cameras, like human vision

Computationally Expensive

Time-of-flight

Single-pointNot a 3D scanner, but underlies other technology

LIDARLike single point, but uses mirrors to rapidly take many measurements across scene

ToF CameraModulated light & phase detection

Gated/Shuttered

More scanning technologies

PhotogrammetryLike stereoscopic, but with many “eyes” in undefined positions

Images stitched together and depth is inferred

Doesn’t work well with concave surfaces

VolumetricMany see-through images

Complex reconstruction algorithms

Structured LightLike a laser scanner on steroids

Patterns projected on object

Camera positioned at offset captures images of projection

Distortions in captured pattern used to infer geometry

Primesense’s LightCoding™ Structured Light Scanners

IR pattern projectorIR laser light reflects off of hologram to paint pattern on scene

IR patternDoes not change over time

Looks random, but are a few markers

Pattern gridThe pattern is repeated in a 3 x 3 grid

PrimesenseLightCoding™

The dot pattern, or information about it, is hardcoded on the Primesense chip.

The sensor’s IR camera takes video of the pattern as it is projected on the scene.

Objects in the scene distort the way the pattern looks to the camera. Depth is inferred from these distortions.

How exactly is depth determined?• It’s proprietary!• But that won’t stop people from guessing.• There are at least two likely methods…

Pattern shifting with distance

Astigmatic OpticsCamera has lens with different focal lengths in X and Y direction

Dots become blurry as the target surface moves away from the camera’s focal point, but in a specific way.

They may change shape or apparent orientation with distance.

What is computed on the sensor chip?• The depth map is computed within the sensor, the host

computer needs only to read the depth data.• This reduces required computation for the host device.• It also probably helps to keep the Primesense algorithms secret.

• Skeletal tracking is run on the host computer• Algorithms created through machine learning• Designed to be fast for minimal stress on Xbox 360• Many, many processor hours were spent “learning” these

algorithms.• The hard work is done previously, the game only needs to run the

depth data through the process and get skeletal data out.

What is sent from sensor to host?

IR StreamWhat the chip “sees” to produce the depth image

Not often used by host device

Color StreamNot used by host or device for depth

Can be “registered” with depth image for correspondence between depth and RGB pixels

Used to produce XYZRGB point data

Depth StreamResult of computation on IR camera video

Like a normal image/video stream, except pixel intensity represents distance from sensor, not color and brightness

Reconstruction algorithm used with Primesense cameras

How we get from a series of depth images to representations of real-world objects.

Examples: Kinect Fusion, KinFu, ReconstructMe, Skanect, Digifii

SLAM• S.L.A.M. – Simultaneous Location And Mapping• Track where the camera is located, while building a model of

the scene at the same time• Done simultaneously since there is much overlap in the

types of information that is calculated.

ICP – Iterative Closest Point• The core of camera location and point cloud alignment• Algorithm summary:• As the camera moves, it sees a different perspective of the

scene at every frame, but there is some overlap.• ICP repeatedly rotates and translate what the camera sees

this frame until it finds the best overlap with what the camera saw in the last frame.• When the best matching rotation and translation is found,

we not only know how to stitch the frames together, but also know how the camera moved between frames.

SLAM on GPU• The full SLAM process should occur 30 times per

second, at the rate the depth frames are sent from the camera.• If the process is too slow, frames must be skipped.• Skipped frames make it harder for ICP to be

successful.• To have the algorithm run as fast as possible, the

problem is broken up into chunks and run in parallel on a standard graphics card.• Graphics cards contain thousands of processor units,

and excel at tasks which can be parallelized.

Reconstruction,step-by-step

Step 1:Receive a depth frame

The software receives a depth frame from the camera.

Raw depth data from the camera is often noisy.

Step 2:Bilateral filtering

Bilateral filtering removes noise from the image, while maintaining sharp transitions between pixels.

Step 3:Downsampling

The full size depth image is scaled down to half- and quarter-sized copies.

The copies are used to do 3 levels of ICP alignment.

The small image enables quick rough alignment.

As we repeat ICP with the more detailed images, the alignment is refined.

Step 4: Map the 3 depth frames• Images of depth pixels are not easy to work with

mathematically• They visualize depth in parts of an image, but not explicit

geometric coordinates• To run ICP, we need to rotate and translate the scene• Before ICP, convert full and resized images from pixel-based

to 3D coordinate-based representation• Result: a vertex and a normal map for each of the 3 depth

images – a pair of maps at 3 different levels of accuracy

Step 4: Map the 3 depth frames

Vertex Map (point cloud) Normal Map

Step 4: Map the 3 depth frames• Vertex mapping• Pixel brightness = distance• Pixel position in image ->

X/Y• Also known: camera field-

of-view• Have angles, distance: Do

trigonometry to get 3D coordinate for pixel• Repeat for every pixel• Do in parallel on GPU

• Normal mapping• Find the orientation of the

surface for each vertex just mapped• Useful for ICP• Look at closest neighbors• Implementations vary, but

a common way is to compute a vector cross product between a vertex and two neighbors.• Repeat for each vertex.

Step 5: Do ICP• Vertex and normal maps are used together so each

vertex has a position and orientation• First, align the low-res maps to get a rough estimate of

alignment and camera position. Iterate four times.• Next, align the medium-res maps to get a better

estimate. Iterate 5 times.• Finally, align the full-scale maps to get the final

estimate for alignment and camera orientation. This is repeated 10 times.• Total of 19 iterations per frame – the most time

consuming step.

Questions so far?

Representing the surface in memory• Truncated Surface Distance Function (TSDF)• Makes handheld scanning on personal computers feasible

• Faster than other high-accuracy methods• Allows for continuous refinement of the model

• When scanning, a (usually cubic) virtual volume is defined. The real-world target object is reconstructed within this volume.• The volume is subdivided into a grid of many smaller cubes,

called voxels.• Voxels are volumetric picture elements.

TSDF representation

Each voxel is assigned a distance to the surface

Negative is behind, positive is in front

Each distance is also assigned a weight (not shown)

Weights represent an estimate of accuracy for a voxel’s distance

For example: a surface facing the camera is likely more accurate than a surface at an angle, so those measurements are given a higher weight

Step 6:Calculating TSDF Values

Line from each vertex to camera through voxel grid.

Intersected voxels near the surface are updated

Distance from vertex to voxel center is distance value for a given voxel

Assign a weight for the measurement based on the voxel orientation

Use the measurement weight and distance to update the voxel’s current value.

Repeat for other intersected voxels

Repeat for each vertex.

TSDF RefinementAs the camera moves around and captures more vertices, the TSDF is continuously updated and refined.

Since TSDF voxels contain distances and are not just representative of the edge of a surface, the surface is represented far more accurately than the actual voxel resolution.

Step 7:Raycasting

The TSDF Volume is raycasted and converted to an image.

The image gives the user feedback on how the scan is going and areas which should be refined.

Step 8: Extract points for next ICP• The previous ICP alignment had a tiny amount of error• If it were used for the next ICP round, errors would

compound.• Instead, in the final step of SLAM, we extract vertices

and normal from the TSDF volume itself, so all ICP iterations have a common reference.

Do it again, 30x per sec

“Repeat as necessary”

Step 9:Export and Use the Data

Extract points from TSDF

Convert to mesh if desired

View

Measure

Print!

Questions?

3d scanning technology overview: kinect reconstruction algorithms explained

Technology

depth map

depth imagenot

depth framesimage

camera moves

size depth image

raw depth data

sensors ir camera

series of depth images