creating good test data - heidelberg university · creating good test data o oliver zendel...
TRANSCRIPT
Creating Good Test Data
Oliver Zendel
AIT Austrian Institute of Technology
www.vitro-testing.com
ECCV 2016 Workshop on Datasets and Performance Analysis in Early Vision
Saturday, 2016-10-08 10h15, Oudemanhuispoort of the University of Amsterdam
What should be present in my test data?
Goal
Good test data includes:
Enough variation
Systematically organized
Low redundancy
2 08.10.2016
Outline
Basics
Why is validation of CV hard?
CV-HAZOP
Tool for evaluation and planning of test data
Outlook
Relevant complementing topics
3 08.10.2016
Outline
Basics
Why is validation of CV hard?
CV-HAZOP
Tool for evaluation and planning of test data
Outlook
Relevant complementing topics
4 08.10.2016
SW Quality Assurance in General
Verification
System meets specification
Implementation is correct
“Are we building the solution right?”
Validation
System fulfills intended purpose
Algorithm is suitable/robust
“Are we building the right solution?”
5 08.10.2016
Quality Assurance in CV (1)
Verification
Interested in code coverage
Formal methods with proofs -> can guarantee completeness
Rich tool set exist
Generic SW tools are applicable to CV implementations
Examples: clang static code analysis; Event-B
6 08.10.2016
Quality Assurance in CV (2)
Validation
Interested in data coverage
Based on experience – no 100% guarantees
Very application/intent specific special considerations for CV
Amount possible input data >> amount possible test data
7 08.10.2016
State of the Art
Example: Test data in Stereo Vision
Middlebury stereo database: Scharstein et al. [Sch2002], [Sch2014]
KITTI Vision Benchmark Suite: Geiger et al. [Gei2012], [Men2015]
Synthetic: Sintel [But2012]; VKITTI [Gai2016]; SYNTHIA [Ros2016]
Evaluation of dataset bias:
Ponce et al. [Pon2006]
Pinto et al. [Pin2008]
Torralba and Efros [Tor2011]
Systematic High-Level
Approach?
8 08.10.2016
Equivalence Classes
Partitioning of input data
Cluster/Segment input data into distinct classes
Represent each class by finite number of test cases
9 08.10.2016
Equivalence Classes
Partitioning of input data
Cluster/Segment input data into distinct classes
Represent each class by finite number of test cases
10 08.10.2016
Continuous parameters
Finite number of test
cases
Equivalent Test Images?
Mathematically definition for:
„All images showing trees“
„All images without a reindeer“
11 08.10.2016
Not feasible!
Use Semantics
Test images classified using two semantics:
Domain aspects
Vulnerability aspects
Domain aspects:
Open-world vs. closed-world
Vulnerabilities:
Situations / Relations known
to cause problems for CV
Examples:
Low contrast, Reflections, Glare,
Shadows, Image Noise,
Occlusions, …
12 08.10.2016
Mapping to the Goals
Good test data includes:
Enough variation
Enough equivalence classes
Systematically organized
Low redundancy
13 08.10.2016
Mapping to the Goals
Good test data includes:
Enough variation
Enough equivalence classes
Systematically organized
Meaningful organization also for equivalence classes
Low redundancy
14 08.10.2016
Mapping to the Goals
Good test data includes:
Enough variation
Enough equivalence classes
Systematically organized
Meaningful organization also for equivalence classes
Low redundancy
Right amount of representatives per class
15 08.10.2016
Outline
Basics
Why is validation of CV hard?
CV-HAZOP
Tool for evaluation and planning of test data
Outlook
Relevant complementing topics
16 08.10.2016
Our Approach: Checklist!
All critical situations that decrease CV output quality
Just tick off entries for your test data
Coverage vs. all entries
Systematic Approach: Risk Analysis
Analyze a complex system and its interactions
Chosen method: HAZOP
Hazard and Operability Study
17 08.10.2016
The Generic CV- Model
One generic model
Identify subcomponents valid for all CV applications
Information-flow as flow diagram
18 08.10.2016
HAZOP Ingredients
Locations
Smallest components/parts of model
Parameters
Descriptive for each location
Guide Words
How parameters can deviate
from the expected
19 08.10.2016
Locations
Model Parts -> Locations
Recursion = own location “Objects”
Observer as two locations:
“Opto-mechanics” and “Electronics”
20 08.10.2016
Parameters: Characterize Locations
Example: Light Sources
Number
Position
Area
Spectrum
Texture
Intensity
Beam properties
Wave properties
21 08.10.2016
Guide Words
22 08.10.2016
HAZOP Ingredients
7 Locations
Subcomponents of the model
52 Parameters
Descriptive for each location
E.g. for Light Sources: Number, Position, Area, Spectrum, Texture,
Intensity, Beam properties, and Wave properties
17 Guide Words
How parameters can deviate from the expected
E.g. More, Less, Other Than, Faster, …
23 08.10.2016
CV-HAZOP Execution
24 08.10.2016
Experts assign meanings to each Parameter / Guide Word combination
Derive Consequences and Risks from each Meaning
CV-HAZOP Example
Meaning: A light source shines stronger than expected
Consequence: Too much light is in the scene
Risk: Overexposure
25 08.10.2016
Parameter: Intensity
Location: Light Sources
x =
Guide Word: More
CV-HAZOP Example (2)
Meaning: Particles move faster than expected
Consequence: Motion blur of particles
Risk: Scene is severely occluded
26 08.10.2016
Guide Word: Faster Parameter: Particles
Location: Medium
x =
HAZOP Combinations
52 Parameters x 17 Guide Words 884 Combinations
27 08.10.2016
x
Valid Combinations
28 08.10.2016
Results
Nine experts, one year 947 unique entries
See vitro-testing.com
29 08.10.2016
Evaluation
Proof-of-concept:
Generic to specific
Entries in list algorithm output quality decrease
Applied to stereo vision test data sets
Middlebury (Original)
[Sch2002]
Middlebury 2014
[Sch2014]
KITTI
[Gei2012]
30 08.10.2016
Found any Risks? Where?
About 500 entries valid for the stereo vision task
Examples:
31 08.10.2016
MB 06
MB 14
KITTI
Image
Pairs
Images w.
Risks
Found
Risks
Number of
Annotiations
26 19 34 55
23 17 57 80
194 62 76 101
Glare No Texture Mirroring Interlens
Reflection Underexp.
Evaluating our Approach
32 08.10.2016
Test influence on output quality of:
Shape of risk region ( shape )
Only position and area ( box )
In comparison to; controls:
Random position, same area ( rand )
The entire image ( all )
Annotation for No Texture
Percentage of erroneous disparity output:
Stereo Vision Evaluation
33 08.10.2016
shape
box
rand
all
0% 100%
Identified risks indeed increase test data difficulty
Areas identified by checklist result in higher error ratios
Multiple Algorithms and Datasets
Percentage of erroneous disparity output:
[Kon1998] [Hum2010] [Hir2008] [Rhe2011] [Ble2011] [Mei2013]
SAD CENSUS SGBM CVF PatchMatch ST-2
Areas identified by checklist result in higher error ratios
34 08.10.2016
Identified risks indeed increase test data difficulty
Areas identified by checklist result in higher error ratios
How do I apply this?
Start with the whole list
Visit vitro-testing.com
Filter out specific entries
Concretize entries
CV-HAZOP entries are generic (up to hazard level)
Interpret entries for your application (intent + domain)
Guide test data creation
Use list to guide and categorize test data
Test-driven development: Iterate!
Change focus / test data amount based on results from evaluation
35 08.10.2016
Example:
Concretize entry for stereo vision:
Object / Less / Texture
Meaning:
Object has less texture than expected
Consequence:
Texture correlation quality is reduced
Hazard (for stereo vision)
Images with large textureless surface on same
epipolar lines prevents correct correlation
Find in existing test data sets or create anew
Starting distribution: based on experience,
e.g. 10 images per equivalence class = hazard entry
Evaluation: Indeed problematic create/use more test here
36 08.10.2016
Goals in Regard to Vulnerabilities
Good test data includes:
Enough variation
Checklist enforces many different scenarios
Systematically organized
Low redundancy
37 08.10.2016
Goals in Regard to Vulnerabilities
Good test data includes:
Enough variation
Checklist enforces many different scenarios
Systematically organized
Test cases are organized by checklist order/categories
Low redundancy
38 08.10.2016
Goals in Regard to Vulnerabilities
Good test data includes:
Enough variation
Checklist enforces many different scenarios
Systematically organized
Test cases are organized by checklist order/categories
Low redundancy
Process of applying domain/intent gives insights into priorities
39 08.10.2016
Goals in Regard to Vulnerabilities
Good test data includes:
Enough variation
Checklist enforces many different scenarios
Systematically organized
Test cases are organized by checklist order/categories
Low redundancy
Process of applying domain/intent gives insights into priorities
40 08.10.2016
Outline
Basics
Why is validation of CV hard?
CV-HAZOP
Tool for evaluation and planning of test data
Outlook
Relevant complementing topics
41 08.10.2016
Ground Truth Quality
Label noise [Bow2001]
People interpret same data differently
Robust statistics over many annotations
Use crowd-sourcing for mass [Don2013]
Measurement errors distort GT
Add error bars
Stereo Ground Truth With Error Bars [Kon2015]:
LIDAR based data transformed to disparity
Combination of multiple influences:
2D feature annotation ,pose estimation,
bundle adjustment, stereo camera
calibration (intrinsic and extrinsic)
42 08.10.2016
[Kon2015]
[Bow2001]
Synthetic vs. Real
Is artificial test data valid for real world applications?
Rendering artifacts can create false alarms
Clean data can be too easy (e.g. sensor noise)
Realism must match algorithm; not humans
Computer graphics improve realism rapidly
Physically correct rendering gets faster;
At least for offline data feasible
Many benefits
Perfect ground truth without label noise
Safe simulation of dangerous situations
Generation of specific scenes
Systematic sampling of parameters
43 08.10.2016
Domain aspects
Building blocks of our scenery
Possible objects (expected and unexpected ones!)
Static scenery, background, clutter
Relations / Rules of environment
Physics
Behavior of actors
Interaction between actors
For More:
Workshop on Quality Assurance
in Computer Vision at ICTSS-2016
2016-10-19, Graz, Austria
44 08.10.2016
Outlook CV-HAZOP
Apply checklist to existing test data sets
Create merged test data set spotlighting hazards
Creating new test data for missing hazards
Create test data set for parameter sweeping
Sample specific parameter
Find breaking-point of algorithm
Deep learning training data vs. test data
Different ratio of corner cases vs. normality
Too much difficult cases will prevent learning
45 08.10.2016
Conclusion
Validation for CV has unique needs
Usage of checklists increase quality of new test data
CV-HAZOP is a useful tool to create
checklists for robustness testing
46 08.10.2016
Better test data better systems
Participate, access CV-HAZOP and data sets:
www.vitro-testing.com Contact:
References
47 08.10.2016
[Bow2001] K. Bowyer, C. Kranenburg, and Sean Dougherty. Edge Detector Evaluation Using Empirical ROC Curves. In Computer Vision and Image Understanding 84ff, 2001.
[Ble2011] M. Bleyer, C. Rhemann, and C. Rother. Patchmatch stereo-stereo matching with slanted support windows. In British Machine Vision Conference, 2011.
[Don2013] A. Donath, D. Kondermann. Is crowdsourcing for optical ow ground truth generation feasible? In Prroceeding to the International Conference on Vision Systems, 2013.
[But2012] D. J. Butler, J. Wulff, G. B. Stanly, and M. J. Black. A Naturalistic Open Source Movie for Optical Flow Evaluation. In European Conference on Computer Vision, 2012.
[Gai2016] A. Gaidon, Q. Wang, Y. Cabon and E. Vig. Virtual Worlds as Proxy for Multi-Object Tracking Analysis. In Computer Vision and Pattern Recognition, 2016.
[Gei2012] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The kitti vision benchmark suite. In Computer Vision and Pattern Recognition, 2012.
[Hir2008] H. Hirschmüller. Stereo processing by semiglobal matching and mutual information. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(2):328ff, 2008.
[Hum2010] M. Humenberger, C. Zinner, M.Weber,W. Kubinger, and M. Vincze. A fast stereo matching algorithm suitable for embedded real-time systems. Computer Vision and
Image Understanding, 2010.
[Kon1998] K. Konolige. Small vision systems: Hardware and implementation. In Robotics Research. Springer, 1998.
[Kon2015] D. Kondermann, R. Nair, S. Meister, W. Mischler, B. Güssefeld, K. Honauer, S. Hofmann, C. Brenner, and B. Jähne. Stereo ground truth with error bars. In Asian
Conference on Computer Vision, 2015.
[Mei2013] X. Mei, X. Sun, W. Dong, H. Wang, and X. Zhang. Segment-tree based cost aggregation for stereo matching. In Computer Vision and Pattern Recognition, 313ff, 2013.
[Men2015] M. Menze and A. Geiger. Object Scene Flow for Autonomous Vehicles. Conference on Computer Vision and Pattern Recognition, 2015.
[Pin2008] N. Pinto, D. D. Cox, and J. J. DiCarlo. Why is real-world visual object recognition hard? PLoS Comput Biol, 4(1), 2008.
[Pon2006] J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth, M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid, B. C. Russell, A. Torralba, et al. Dataset issues in object
recognition. In Toward category-level object recognition, pages 29–48. Springer, 2006.
[Rhe2011] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz. Fast cost-volume filtering for visual correspondence and beyond. In Computer Vision and Pattern
Recognition, pages 3017–3024, 2011.
[Ros2016] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez. The SYNTHIA Dataset. In Computer Vision and Pattern Recognition, 2016.
[Sch2002] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1):7ff, 2002.
[Sch2014] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In
Pattern Recognition, pages 31–42. Springer, 2014.
[Tor2011] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition, pages 1521–1528, 2011.
Participate, access CV-HAZOP and data sets:
www.vitro-testing.com Contact:
48 08.10.2016
Algorithms used:
SAD CENSUS SGBM CVF PatchMatch ST-2
[Kon1998] [Hum2010] [Hir2008] [Rhe2011] [Ble2011] [Mei2013]
49 08.10.2016
[Kon1998] K. Konolige. Small vision systems: Hardware and implementation. In Robotics Research. Springer, 1998.
[Hum2010] M. Humenberger, C. Zinner, M.Weber,W. Kubinger, and M. Vincze. A fast stereo matching algorithm suitable for embedded
real-time systems. Computer Vision and Image Understanding, 2010.
[Hir2008] H. Hirschmüller. Stereo processing by semiglobal matching and mutual information. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 30(2):328–341, 2008.
[Rhe2011] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz. Fast cost-volume filtering for visual correspondence and
beyond. In Computer Vision and Pattern Recognition, pages 3017–3024, 2011.
[Ble2011] M. Bleyer, C. Rhemann, and C. Rother. Patchmatch stereo-stereo matching with slanted support windows. In British Machine
Vision Conference, 2011.
[Mei2013] X. Mei, X. Sun, W. Dong, H. Wang, and X. Zhang. Segment-tree based cost aggregation for stereo matching. In Computer
Vision and Pattern Recognition, pages 313–320, 2013.
Annotation
Per dataset and for each entry in the HAZOP list:
Look for first image that includes identified hazard
Annotation of shape saved together with risk entry id
Images are randomly ordered (no bias from starting point)
50 08.10.2016
For your contributions
Lawitzky G., Wichert G., Feiten W. (Siemens Munich)
Köthe U. (HCI Heidelberg)
Fischer J. (Fraunhofer IPA)
Zinner C. (AIT)
Thank you Experts
51 08.10.2016
Parameters (1)
52 08.10.2016
Parameters (2)
53 08.10.2016
The Main Question:
Which situations should be covered by the test data
Domain aspects: elements and situations from actual real world
Vulnerability aspects: situations and relations known to cause problems
Validation <-> Experience!
When have we tested enough to reach a conclusion?
Reduce redundencies
Don‘t use 1 Million Kilometers of road data showing the same stuff!
54 08.10.2016