Unsolved Problems in Optical Flow
and Stereo Estimation
Richard Szeliski
Microsoft Research
and
Daniel Scharstein
Middlebury College
This work was supported in part by NSF grants IIS-0413169 and IIS-0917109
Outline
Prior work: Middlebury benchmarks
Recent work: handling reflections
What are current challenges?
Future evaluation efforts?
Collaborators - Benchmarks
Steve Seitz, U Washington
Brian Curless, U Washington
James Diebel, Stanford
Simon Baker, Microsoft Research
Michael Black, Brown U
JP Lewis, Weta Digital Ltd
Stefan Roth, TU Darmstadt
Heiko Hirschmüller, DLR Germany
Chris Pal, U Rochester
Collaborators – Middlebury students
Anna Blasiak ’07
Padma Ugbabe ’03
Alexander Vandenberg-Rodes
Jiaxin (Lily) Fu ’03
Sarri Al-Nashashibi ’08
Gonzalo Alonso ’06
Jeff Wehrwein ’08 Brad Hiebert-Treuer
’07
Alan Lim ’09 Nera Nesic ’13
Xi Wang ’14
Goal: Extract information from images (both 2D and 3D)
Hard problem:
Noisy data
Lots of it
Need additional assumptions
Computer Vision
Our focus: image matching
Stereo vision
Multi-view stereo
Image motion / optical flow
Applications - Stereo
Video conferencing
Game control
Intelligent cars
Applications – Multiview stereo
3D reconstruction
3D printing
Applications – Optical flow
Video interpolation and compression
Vehicle and people tracking
Stereo vision
Infer 3D structure from 2 (or more)
images of a scene
Seems easy for humans…
Why is matching hard?
Untextured areas
Noisy data / aliasing
Depth discontinuities
Occlusions
Reflections / specularities
Different camera responses
Imperfect calibration
…
Datasets with ground truth
Ground truth = true answer
(e.g. true disparities)
GT needed for quantitative analysis
of algorithms (benchmarks)
Middlebury benchmarks:
http://vision.middlebury.edu/
1. Middlebury Stereo Page
(Scharstein & Szeliski – CVPR 2001, IJCV 2002)
vision.middlebury.edu/stereo
Evaluator with web interface
(Scharstein & Szeliski – CVPR 2001, IJCV 2002)
vision.middlebury.edu/stereo
Evaluator with web interface
v.1 by Lily Fu ’03
Left views
GT
disps
1. Middlebury Stereo Page
(Scharstein & Szeliski – CVPR 2001, IJCV 2002)
vision.middlebury.edu/stereo
Evaluator with web interface
v.1 by Lily Fu ’03 v.2 by Anna Blasiak ’07
Left views
GT
disps
1. Middlebury Stereo Page
Currently 135 entries
2. Multiview Stereo Evaluation
(Seitz, Curless, Diebel, Scharstein, Szeliski – CVPR 2006)
vision.middlebury.edu/mview
Create 3D model from 100s of views
One view
GT
Surface mesh
Currently 58 entries
3. Optical Flow Evaluation
(Baker, Scharstein, Lewis, Roth, Black, Szeliski – ICCV 2007)
vision.middlebury.edu/flow
Input: video sequence
Output: flow vectors Where do pixels move from frame to frame?
Currently 75 entries
How to get ground truth?
1. Stereo – true disparities
2. Multiview stereo – true surface mesh
3. Optical flow – true motion vectors
Setup 2005 / 2006
7 views
3 ambient light setups
3 exposures
2005: 9 datasets 2006: 21 datasets
see vision.middlebury.edu/stereo/data
Version 3 – soon?
Current work: new datasets
Specular surfaces
Point-and-shoot cameras
Possibly outdoor scenes
“Space-time stereo” techniques
Stereo video?
Unpublished datasets
Work in progress on specular scenes
Spray paint motorcycle after color photos are acquired to enable active lighting ranging
Mobile acquisition system, 2012
DSLR Cameras
Point & Shoot Cameras
Projector Laptop for Processing
Motorcycle Scene - Original
Motorcycle Scene - Painting
Motorcycle Scene - Painting
Motorcycle Scene - Painted
Motorcycle Disparity Map
Motorcycle Scene - Original
What can we do about specular scenes?
A1: treat reflections as separate layers
Image-Based Rendering for Scenes with Reflections
Sudipta N. Sinha
Johannes Kopf
Michael Goesele
Daniel Scharstein
Richard Szeliski
Use laser scanner
Merge 100s of scans
Fill holes
Align with image data
2. Multiview stereo: range data
Version 2 – current work
Version 2 – soon?
Have high-quality CT scans
Need better reference views
Need highly accurate camera locations
Include objects from industrial setting
Collaborate with NIST
3. Optical flow: Hidden texture
Can’t use structured light (objects move)
Idea: make pixels “trackable” with
High resolution (downsample by 6)
Hidden fluorescent texture
Very slow motion
Value of benchmarks
Enables quantitative comparison
Summarizes state of the art
Stimulates new research
Challenging data “pushes envelope”
Pitfalls
Overfitting to test data
Focus on ranking
Deemphasizes aspects not evaluated
“Rest” after initial “push”
Solutions
Provide separate training data
Provide diverse datasets
Avoid single ranking
Update benchmarks periodically
Other uses of GT data
Algorithm design
Evaluate algorithm components
Robust data term
Smoothness priors
Machine learning
Evaluation of Cost Functions for Stereo Matching (Hirschmüller & Scharstein, CVPR 2007, PAMI 2009)
Learning Conditional Random Fields for Stereo (Scharstein & Pal, CVPR 2007; Pal et al. IJCV 2010)
Moebius – trained on other 5 Moebius – trained on self
Why is matching hard?
Untextured areas
Noisy data / aliasing
Depth discontinuities
Occlusions
Reflections / specularities
Different camera responses
Imperfect calibration
… what about higher-level semantics?
Semantic scene reconstruction
Conclusion
Benchmarks are important,
stimulate research
Creating ground-truth data is
challenging, fun
Rolling benchmarks
Code archival: source, binaries, and Web services (Web Vision Workshop)