1 john collomosse visual search of dance archives john collomosse [email protected] centre...
TRANSCRIPT
1John Collomosse
Visual Search of Dance Archives
John [email protected]
Centre for Vision, Speech and Signal ProcessingUniversity of Surrey
Oxford Robotics Group. May 2013.
2John Collomosse
Motivation: Add value to curated Dance collectionsDance archvies are currently searchable by text (curated metadata)
What if you want to search on the content e.g. choreography itself?
Metadata
3John Collomosse
Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchivePose driven visual search Sketch driven
Choreography
4John Collomosse
Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchiveSketch driven
ChoreographyPose driven visual search
Visual sentences for pose retrieval over low-resolution cross-media dance collections. R. Ren and J. Collomosse. IEEE Trans. Multimedia 14(6). Dec 2012.
5John Collomosse
Grainy/noisy
Contrast bleaching
Blur / poor definition
Small performer e.g. 100 px
Ilumination aritfactsFeatureless bg
Inter & intra-occlusion
UK-NRCD Archival Dance Footage Digital Dance Archive (DDA) spanning ~100 years of UK dance history Videos transferred between several analogue formats prior to digitisation.
http://www.dance-archives.ac.uk
6John Collomosse
Characterizing HPE on Archival FootageExplicit Human Pose Estimation (HPE) fails on typical NRCD archival footage
NRCD Footage
Eichner et al. [CVPR’09]
Andriluka et al. [CVPR’09]
7John Collomosse
Contributions Cross-media pose retrieval on archival data
Match pose implicitly rather than explicitly
New representation “Visual Sentences” Using Self-similarity (SSIM) and LDA
Built into Bag of Words framework With tweaks e.g. stop-word removal
Fusing Vision and Information Retrieval concepts Diversity re-ranking
Contact Sheets (Photos)
Performance videos
8John Collomosse
Performer Detection Dalal/Triggs-like pedestrian detection [CVPR 2005]
Trained across six videos (~5hrs) 5k positive annotations. 5k negatives sampled randomly outside BBs.
Horizontal poses included but rotated (twice) Output BBs rescaled to 64x128 for retrieval
9John Collomosse
Visual Sentence RepresentationBased on Self-similarity (SSIM) descriptor*
1) Computes a correlation surface Sq local to (x,y) using SSD.
2) Bins Sq into a log-polar representation
using local maxima.
3) Discards invalid (v. low/high variance) features.
* “Matching Local Self-similarities across Images and Video”. E. Shechtman and M. Irani. CVPR 2007.
10John Collomosse
SSIM for DanceUsing star ensemble, SSIM showcased* results including Dance pose detection.
* “Matching Local Self-similarities across Images and Video”. E. Shechtman and M. Irani. CVPR 2007.
1) Ensemble approach scales at best O(n)
- we need to search >>100k BBs
2) SSIM not characterized well for our data
- cross-domain, cross-performance
3) However most promising approach tested
- vs. SIFT/SURF, HOG, Shape Context
QueryQuery
CVPR’05CVPR’05
CVPR’07CVPR’07
11John Collomosse
Implicit Pose RepresentationSelf-similarity (SSIM) codebooked (HKM, hard-assignment), aggregated over scale
12John Collomosse
Representations and strategies (PSF1 of 4)
Pose similarity function (PSF) 1 serves as baseline – Multi-scale BoVW
Given a dataset of ROIs ,and query ROI evaluate for all and rank
where is the ith of n visual words present in
tf(.) yields word frequency within d, and |d| is word count
13John Collomosse
Representations and strategies (PSF2 of 4)Pose similarity function (PSF) 2 is a variant of MS-BoVW that individually weights
the importance of each layer (up to 5)
where are normalised weights bootstrapped via SVM over a small training set of 50
in practice indicate ~linear increase with finer scales.
14John Collomosse
Representations and strategies (PSF3 of 4)Visual sentence (VS) representation encodes fine-scale features + structural context
Semantic body zones unlikely to map explicitly to regions in structural hierarchy.
Set of VS capture membership implicitly over latent variables Topic discovery via LDA
Topic set learned via Gibbs over 1k training samples using 48 topics (c.f. Choreutics)
Variable length sentences padded Spatial relationships implicitly encoded via context
Variable length
15John Collomosse
Representations and strategies (PSF4 of 4)Explicit encoding spatial relationships via sliding window approach over and
2 x 2 window (at coarest level i.e. = 4 x 8 pixels) over compare all VS within footprint
Similarity between window pair (50-100 VS)
Randomly sample 1/3 of VS in window pair and search for pair of
sentences minimising
Where ||.|| is a count of in-place differences between VS
Reminiscent of text passage retrieval
16John Collomosse
Comparative Results (PSF1-4)Initial evaluation over 32 works over 4 cross-media collections Video subsampled @ 5s = 6.3k video stills + 1.7k photos ~= 8k BBs No stop-word identification at this stage
Independent treatment of scales sig. better VS outperforming Layered by ~10% PSF3 best (+4%) but drops sharply after 1k
17John Collomosse
Query set and Ground TruthMark-up task distribution over 3 professional archivists in UK-NRCD 65 queries – single BB (2/3 contact sheet photos, 1/3 video frames) 8k BB marked up as relevant/non-relevant with respect to each query
contact sheets video
18John Collomosse
Comparative Results (PSF1-4)Effect of stop-word removal on BoVW codebook Comparing best performing VS (PSF3) and Layered (PSF2) strategies. Stop-word identification via freq. distribution under Bernoulli or Poisson model
Indicates PSF3 (LDA) over k=1000, with Bernoulli stop-word removal at 0.85
19John Collomosse
Diversity Re-ranking (PSF5 = PSF3 re-ranked)Direct presentation of results can lead to unsatisfactory visual repetition (e.g. temporally
adjacent video frames) Not ideal for archive discovery. A run of poor results can also reduce precision.
Re-rank via Kruskal clustering of affinity graph A of top n results (scope of DR) A computed pairwise using PSF4 (sliding window approach) Spanning trees iteratively identified in graph to form cluster set – each is ranked
independently under the PSF3 score. Ranks merged.
20John Collomosse
Results - Qualitative
21John Collomosse
Results - Qualitative
Serendipitous recovery from failed BB isolation!
22John Collomosse
Results - QuantitativeComparison vs. BoVW (single and multiple scales) and variants including SPK
23John Collomosse
Scaling the datasetInitial dataset plus Siobhan Davies archive (200 videos, 562 contact sheets) ~= 68kBB Inverse index used for PSF1,2,3,5
Comparison to explicit HPE Pictorial Structures Revisited [Andriluka ‘09] Pose Search [Eichner ‘09]
24John Collomosse
Conclusions on Pose Search SoA pose search relies on explicit HPE
- This is impractical on low-resolution, cross-domain footage.
Visual sentences + LDA (PSF3) reach ~32% MAP >> SoA
- Encode local appearance with a spatial context
- Sufficient level of abstraction to match diverse footage.
Diversity re-ranking improves results by ~4% Query time <2s for 68k records
Given could pre-compute at this scale
Pose driven visual search
25John Collomosse
Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchivePose driven visual search Sketch driven
Choreography
26John Collomosse
Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchivePose driven visual search Sketch driven
Choreography
27John Collomosse
ReEnact: Contributions Major driver for the use of Dance Archives is the development of new choreography
ReEnact is a sketch based interface to the NRCD archives enabling this
Visual Narrative: A set of key-frame poses linked with gestures that describe a movement.
Conceptual extension of ‘storyboard’ sketches [Collomosse et al. ICCV 09]
28John Collomosse
Related Work No prior work on sketch based pose retrieval Several works on sketch based shape retrieval but these are aimed at inter- not
intra-class variation.
A Performance Evaluation of the Gradient Field HOG Descriptor for Sketch based Image Retrieval . R. Hu and J. Collomosse. Computer Vision and Image Understanding (CVIU). February 2003.
29John Collomosse
ReEnact: Pose retrieval pipeline
Training
Video parse
Sketch parse
Training pairs
Query
Sketch parse
Geodesic k-NN
All video
Map
Map
Manifold Mapping
Learn
Learn
30John Collomosse
ReEnact: Sketch ParsingSketches are converted into stick figures (joint angle representation)
1. Ellipse detection for head
2. Torso detection- Proximity to extreme points of other
strokes- Centre of mass
3. Intersections with torso are potential limbs
4. Heuristics select limb pairs for arms/legs
5. User may manipulate left/right labellings as these are ambiguous in sketch
Sketch parse
Skeletons from Sketches of Dancing Poses. M. Fonseca, S. James and J. Collomosse. Proc. VL/HCC. Nov 2012.
31John Collomosse
ReEnact: Performer ExtractionExtracting a silhouette of the performer with the bounding box
Saliency
FG/BG Texton
Motion diff.
MRF / Solve
Unary: weight sum of three fields
Pairwise: standard Boykov’01 term
32John Collomosse
ReEnact: Descriptor FormationSkeleton -> Joint angle representation Silhouette -> Gridded Zernike moments
Concatenate 22-D moments 2x2 grid Affine invariant (each cell)
Match
33John Collomosse
Learning a mapping between the manifolds
Geodesic distance as a shortest-path over the graph
34John Collomosse
Constructing the Graph (G) in space DAround 150 training pairs of sketches and video frames are gathered to seed
Training frames:
Training weights:
Test video frames are subsequently indexed by extending with new poses
- attached to nearest N training nodes
- N=1 for unconfident frames , >1 confident
- Confidence determined by temporal coherence (covariance) of descriptors
-
35John Collomosse
Domain Transfer S -> DGiven a training sketch s we can now infer similarity to any video pose in D (i.e. )
So given an arbitrary query q, and assuming local linearity in S:
nx – candidate video pose
nd – connection into D from S
a,b pairs of nodes on shortest path through D
36John Collomosse
Retrieval ResultsTrained on 150 frames, tested over ~6k. AP @ [1,80] averaged 6 queries.
- Training (Blueprint) MAP 60%
- Test (ThreeD) MAP 47%
37John Collomosse
Choreography SynthesisWeb UI for generating visual narratives via sketch / semantic label annotation
For free:
Can run inference backward from D->S to produce stick men from video.
Useful for visualizing / exploring alternative retrieval results
38John Collomosse
Video SynthesisInspired by Video Textures [Schodl’00] (a video form of Motion Graph [Kovar’02])
e.g.
time
39John Collomosse
Video Path OptimizationThe motion graph is formed by identifying transitions between frame pairs
Pose similarity via our geodesic distance Down-weighted by poor optical flow correspondence [Brox’04] Low-pass filtered to encourage motion coherence
40John Collomosse
Video Path OptimizationThe motion graph is duplicated and linked via “virtual” nodes (sketched poses)
41John Collomosse
Video Path OptimizationShortest path across the graph a function of three costs:
Pose similarity Gesture similarity Duration of sequence
(fidelity to visual narrative, visual smoothness)
or
Mean gesture similarity over path
Sliding window SVM trained for gesture recognition (black box)
Count of frames along path
Penalise deviation from an idealised duration (user specified with action labels).
42John Collomosse
Video Synthesis: Results Representative run for a 3 stage visual narrative over Three-D
Gradient domain compositing used against an “infinite background”
43John Collomosse
Video composited
44John Collomosse
Video composited
45John Collomosse
ReEnact: Conclusion Sketch based pose search using a learnable piecewise linear manifold mapping
Temporally coherent pose descriptor based on gridded Zernike moments 47% MAP on unseen video
Visual narratives to generate archival choreography Motion graph optimization fusing pose/action cost
Future work Improve compositing of the performer
Unwanted scale changes due to BB detection Alternative ways to specify intermediate gestures
Sketch driven Choreography
46John Collomosse
Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchivePose driven visual search Sketch driven
Choreography
47John Collomosse
Pose driven visual search
Research Landscape: Visual Search of DanceSketch driven Choreography
iWeave: 3D Costume Archive
48John Collomosse
iWeave – Interactive Wearable ArchiveOngoing project enabling users to experience costume and choreography from circa 1920s
Captured dance performance in 3D studio.
Create an animated character that is interactively controlled via human using Microsoft Kinect.
49John Collomosse
iWeave – Performance Capture (Raw)Daffodil dress from Natural Movement collection (1920s)
50John Collomosse
iWeave – Performance Capture (4D Video)Daffodil dress from Natural Movement collection (1920s)
51John Collomosse
iWeave – 4D Mesh and Skeleton Estimation
52John Collomosse
Interactive Animation
Match
Now showing
Target pose
Motion graph
Performance display
53John Collomosse
iWeave: Interactive Animation Pose similarity - joint angles
- Quaternions- Weighted to outer joints
Path search as ReEnact
Random walk when idle
54John Collomosse
Conclusions (Pose Search) Pose search in 2D possible over low resolution video using our new visual sentence descriptor
Visual sentences outperform explicit pose estimation based search on this footage
Pose search in 2D or 3D coupled with Motion Graphs enables interactive animated characters
Cultural heritage application to historic costume archives.
55John Collomosse
Contributors
John CollomosseReede RenRui Hu Stuart James
iWeave
Thanks for your attention
Qizhi Yu
Visual SentencesReEnact