3d scanning technology overview: kinect reconstruction algorithms explained
DESCRIPTION
Primesense depth cameras are the new standard in 3D scanning technology. The sensors have been mass-produced, and thus sold for a much lower price since the debut of Microsoft Kinect, which uses Primesense infrared LightCoding structured light technology. In this slide deck, we will describe the basics of Primesense-based 3D scanning technology from a physical and computational viewpoint.TRANSCRIPT
3D Scanning Technology
Milwaukee 3D Printing Meetup & Voxel Metric
Agenda• Demo• Overview of scanning technologies• Primesense sensors in-”depth”• Freehand camera reconstruction core algorithms• 3D freehand reconstruction algorithm, step-by-step
Demo
Overview of scanning technologies
ContactPrecise
Slow
Touch not always practical
Line LaserLine appears distorted from camera’s point-of-view
Geometry inferred from distortion
StereoscopicTwo offset cameras, like human vision
Computationally Expensive
Time-of-flight
Single-pointNot a 3D scanner, but underlies other technology
LIDARLike single point, but uses mirrors to rapidly take many measurements across scene
ToF CameraModulated light & phase detection
Gated/Shuttered
More scanning technologies
PhotogrammetryLike stereoscopic, but with many “eyes” in undefined positions
Images stitched together and depth is inferred
Doesn’t work well with concave surfaces
VolumetricMany see-through images
Complex reconstruction algorithms
Structured LightLike a laser scanner on steroids
Patterns projected on object
Camera positioned at offset captures images of projection
Distortions in captured pattern used to infer geometry
Primesense’s LightCoding™ Structured Light Scanners
IR pattern projectorIR laser light reflects off of hologram to paint pattern on scene
IR patternDoes not change over time
Looks random, but are a few markers
Pattern gridThe pattern is repeated in a 3 x 3 grid
PrimesenseLightCoding™
The dot pattern, or information about it, is hardcoded on the Primesense chip.
The sensor’s IR camera takes video of the pattern as it is projected on the scene.
Objects in the scene distort the way the pattern looks to the camera. Depth is inferred from these distortions.
How exactly is depth determined?• It’s proprietary!• But that won’t stop people from guessing.• There are at least two likely methods…
Pattern shifting with distance
Astigmatic OpticsCamera has lens with different focal lengths in X and Y direction
Dots become blurry as the target surface moves away from the camera’s focal point, but in a specific way.
They may change shape or apparent orientation with distance.
What is computed on the sensor chip?• The depth map is computed within the sensor, the host
computer needs only to read the depth data.• This reduces required computation for the host device.• It also probably helps to keep the Primesense algorithms secret.
• Skeletal tracking is run on the host computer• Algorithms created through machine learning• Designed to be fast for minimal stress on Xbox 360• Many, many processor hours were spent “learning” these
algorithms.• The hard work is done previously, the game only needs to run the
depth data through the process and get skeletal data out.
What is sent from sensor to host?
IR StreamWhat the chip “sees” to produce the depth image
Not often used by host device
Color StreamNot used by host or device for depth
Can be “registered” with depth image for correspondence between depth and RGB pixels
Used to produce XYZRGB point data
Depth StreamResult of computation on IR camera video
Like a normal image/video stream, except pixel intensity represents distance from sensor, not color and brightness
Reconstruction algorithm used with Primesense cameras
How we get from a series of depth images to representations of real-world objects.
Examples: Kinect Fusion, KinFu, ReconstructMe, Skanect, Digifii
SLAM• S.L.A.M. – Simultaneous Location And Mapping• Track where the camera is located, while building a model of
the scene at the same time• Done simultaneously since there is much overlap in the
types of information that is calculated.
ICP – Iterative Closest Point• The core of camera location and point cloud alignment• Algorithm summary:• As the camera moves, it sees a different perspective of the
scene at every frame, but there is some overlap.• ICP repeatedly rotates and translate what the camera sees
this frame until it finds the best overlap with what the camera saw in the last frame.• When the best matching rotation and translation is found,
we not only know how to stitch the frames together, but also know how the camera moved between frames.
SLAM on GPU• The full SLAM process should occur 30 times per
second, at the rate the depth frames are sent from the camera.• If the process is too slow, frames must be skipped.• Skipped frames make it harder for ICP to be
successful.• To have the algorithm run as fast as possible, the
problem is broken up into chunks and run in parallel on a standard graphics card.• Graphics cards contain thousands of processor units,
and excel at tasks which can be parallelized.
Reconstruction,step-by-step
Step 1:Receive a depth frame
The software receives a depth frame from the camera.
Raw depth data from the camera is often noisy.
Step 2:Bilateral filtering
Bilateral filtering removes noise from the image, while maintaining sharp transitions between pixels.
Step 3:Downsampling
The full size depth image is scaled down to half- and quarter-sized copies.
The copies are used to do 3 levels of ICP alignment.
The small image enables quick rough alignment.
As we repeat ICP with the more detailed images, the alignment is refined.
Step 4: Map the 3 depth frames• Images of depth pixels are not easy to work with
mathematically• They visualize depth in parts of an image, but not explicit
geometric coordinates• To run ICP, we need to rotate and translate the scene• Before ICP, convert full and resized images from pixel-based
to 3D coordinate-based representation• Result: a vertex and a normal map for each of the 3 depth
images – a pair of maps at 3 different levels of accuracy
Step 4: Map the 3 depth frames
Vertex Map (point cloud) Normal Map
Step 4: Map the 3 depth frames• Vertex mapping• Pixel brightness = distance• Pixel position in image ->
X/Y• Also known: camera field-
of-view• Have angles, distance: Do
trigonometry to get 3D coordinate for pixel• Repeat for every pixel• Do in parallel on GPU
• Normal mapping• Find the orientation of the
surface for each vertex just mapped• Useful for ICP• Look at closest neighbors• Implementations vary, but
a common way is to compute a vector cross product between a vertex and two neighbors.• Repeat for each vertex.
Step 5: Do ICP• Vertex and normal maps are used together so each
vertex has a position and orientation• First, align the low-res maps to get a rough estimate of
alignment and camera position. Iterate four times.• Next, align the medium-res maps to get a better
estimate. Iterate 5 times.• Finally, align the full-scale maps to get the final
estimate for alignment and camera orientation. This is repeated 10 times.• Total of 19 iterations per frame – the most time
consuming step.
Questions so far?
Representing the surface in memory• Truncated Surface Distance Function (TSDF)• Makes handheld scanning on personal computers feasible
• Faster than other high-accuracy methods• Allows for continuous refinement of the model
• When scanning, a (usually cubic) virtual volume is defined. The real-world target object is reconstructed within this volume.• The volume is subdivided into a grid of many smaller cubes,
called voxels.• Voxels are volumetric picture elements.
TSDF representation
Each voxel is assigned a distance to the surface
Negative is behind, positive is in front
Each distance is also assigned a weight (not shown)
Weights represent an estimate of accuracy for a voxel’s distance
For example: a surface facing the camera is likely more accurate than a surface at an angle, so those measurements are given a higher weight
Step 6:Calculating TSDF Values
Line from each vertex to camera through voxel grid.
Intersected voxels near the surface are updated
Distance from vertex to voxel center is distance value for a given voxel
Assign a weight for the measurement based on the voxel orientation
Use the measurement weight and distance to update the voxel’s current value.
Repeat for other intersected voxels
Repeat for each vertex.
TSDF RefinementAs the camera moves around and captures more vertices, the TSDF is continuously updated and refined.
Since TSDF voxels contain distances and are not just representative of the edge of a surface, the surface is represented far more accurately than the actual voxel resolution.
Step 7:Raycasting
The TSDF Volume is raycasted and converted to an image.
The image gives the user feedback on how the scan is going and areas which should be refined.
Step 8: Extract points for next ICP• The previous ICP alignment had a tiny amount of error• If it were used for the next ICP round, errors would
compound.• Instead, in the final step of SLAM, we extract vertices
and normal from the TSDF volume itself, so all ICP iterations have a common reference.
Do it again, 30x per sec
“Repeat as necessary”
Step 9:Export and Use the Data
Extract points from TSDF
Convert to mesh if desired
View
Measure
Print!
Questions?