1 john collomosse visual search of dance archives john collomosse [email protected] centre...

55
1 John Collomosse Visual Search of Dance Archives John Collomosse [email protected] Centre for Vision, Speech and Signal Processing University of Surrey Oxford Robotics Group. May 20

Upload: carol-reeves

Post on 17-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

1John Collomosse

Visual Search of Dance Archives

John [email protected]

Centre for Vision, Speech and Signal ProcessingUniversity of Surrey

Oxford Robotics Group. May 2013.

Page 2: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

2John Collomosse

Motivation: Add value to curated Dance collectionsDance archvies are currently searchable by text (curated metadata)

What if you want to search on the content e.g. choreography itself?

Metadata

Page 3: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

3John Collomosse

Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchivePose driven visual search Sketch driven

Choreography

Page 4: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

4John Collomosse

Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchiveSketch driven

ChoreographyPose driven visual search

Visual sentences for pose retrieval over low-resolution cross-media dance collections. R. Ren and J. Collomosse. IEEE Trans. Multimedia 14(6). Dec 2012.

Page 5: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

5John Collomosse

Grainy/noisy

Contrast bleaching

Blur / poor definition

Small performer e.g. 100 px

Ilumination aritfactsFeatureless bg

Inter & intra-occlusion

UK-NRCD Archival Dance Footage Digital Dance Archive (DDA) spanning ~100 years of UK dance history Videos transferred between several analogue formats prior to digitisation.

http://www.dance-archives.ac.uk

Page 6: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

6John Collomosse

Characterizing HPE on Archival FootageExplicit Human Pose Estimation (HPE) fails on typical NRCD archival footage

NRCD Footage

Eichner et al. [CVPR’09]

Andriluka et al. [CVPR’09]

Page 7: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

7John Collomosse

Contributions Cross-media pose retrieval on archival data

Match pose implicitly rather than explicitly

New representation “Visual Sentences” Using Self-similarity (SSIM) and LDA

Built into Bag of Words framework With tweaks e.g. stop-word removal

Fusing Vision and Information Retrieval concepts Diversity re-ranking

Contact Sheets (Photos)

Performance videos

Page 8: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

8John Collomosse

Performer Detection Dalal/Triggs-like pedestrian detection [CVPR 2005]

Trained across six videos (~5hrs) 5k positive annotations. 5k negatives sampled randomly outside BBs.

Horizontal poses included but rotated (twice) Output BBs rescaled to 64x128 for retrieval

Page 9: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

9John Collomosse

Visual Sentence RepresentationBased on Self-similarity (SSIM) descriptor*

1) Computes a correlation surface Sq local to (x,y) using SSD.

2) Bins Sq into a log-polar representation

using local maxima.

3) Discards invalid (v. low/high variance) features.

* “Matching Local Self-similarities across Images and Video”. E. Shechtman and M. Irani. CVPR 2007.

Page 10: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

10John Collomosse

SSIM for DanceUsing star ensemble, SSIM showcased* results including Dance pose detection.

* “Matching Local Self-similarities across Images and Video”. E. Shechtman and M. Irani. CVPR 2007.

1) Ensemble approach scales at best O(n)

- we need to search >>100k BBs

2) SSIM not characterized well for our data

- cross-domain, cross-performance

3) However most promising approach tested

- vs. SIFT/SURF, HOG, Shape Context

QueryQuery

CVPR’05CVPR’05

CVPR’07CVPR’07

Page 11: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

11John Collomosse

Implicit Pose RepresentationSelf-similarity (SSIM) codebooked (HKM, hard-assignment), aggregated over scale

Page 12: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

12John Collomosse

Representations and strategies (PSF1 of 4)

Pose similarity function (PSF) 1 serves as baseline – Multi-scale BoVW

Given a dataset of ROIs ,and query ROI evaluate for all and rank

where is the ith of n visual words present in

tf(.) yields word frequency within d, and |d| is word count

Page 13: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

13John Collomosse

Representations and strategies (PSF2 of 4)Pose similarity function (PSF) 2 is a variant of MS-BoVW that individually weights

the importance of each layer (up to 5)

where are normalised weights bootstrapped via SVM over a small training set of 50

in practice indicate ~linear increase with finer scales.

Page 14: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

14John Collomosse

Representations and strategies (PSF3 of 4)Visual sentence (VS) representation encodes fine-scale features + structural context

Semantic body zones unlikely to map explicitly to regions in structural hierarchy.

Set of VS capture membership implicitly over latent variables Topic discovery via LDA

Topic set learned via Gibbs over 1k training samples using 48 topics (c.f. Choreutics)

Variable length sentences padded Spatial relationships implicitly encoded via context

Variable length

Page 15: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

15John Collomosse

Representations and strategies (PSF4 of 4)Explicit encoding spatial relationships via sliding window approach over and

2 x 2 window (at coarest level i.e. = 4 x 8 pixels) over compare all VS within footprint

Similarity between window pair (50-100 VS)

Randomly sample 1/3 of VS in window pair and search for pair of

sentences minimising

Where ||.|| is a count of in-place differences between VS

Reminiscent of text passage retrieval

Page 16: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

16John Collomosse

Comparative Results (PSF1-4)Initial evaluation over 32 works over 4 cross-media collections Video subsampled @ 5s = 6.3k video stills + 1.7k photos ~= 8k BBs No stop-word identification at this stage

Independent treatment of scales sig. better VS outperforming Layered by ~10% PSF3 best (+4%) but drops sharply after 1k

Page 17: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

17John Collomosse

Query set and Ground TruthMark-up task distribution over 3 professional archivists in UK-NRCD 65 queries – single BB (2/3 contact sheet photos, 1/3 video frames) 8k BB marked up as relevant/non-relevant with respect to each query

contact sheets video

Page 18: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

18John Collomosse

Comparative Results (PSF1-4)Effect of stop-word removal on BoVW codebook Comparing best performing VS (PSF3) and Layered (PSF2) strategies. Stop-word identification via freq. distribution under Bernoulli or Poisson model

Indicates PSF3 (LDA) over k=1000, with Bernoulli stop-word removal at 0.85

Page 19: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

19John Collomosse

Diversity Re-ranking (PSF5 = PSF3 re-ranked)Direct presentation of results can lead to unsatisfactory visual repetition (e.g. temporally

adjacent video frames) Not ideal for archive discovery. A run of poor results can also reduce precision.

Re-rank via Kruskal clustering of affinity graph A of top n results (scope of DR) A computed pairwise using PSF4 (sliding window approach) Spanning trees iteratively identified in graph to form cluster set – each is ranked

independently under the PSF3 score. Ranks merged.

Page 20: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

20John Collomosse

Results - Qualitative

Page 21: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

21John Collomosse

Results - Qualitative

Serendipitous recovery from failed BB isolation!

Page 22: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

22John Collomosse

Results - QuantitativeComparison vs. BoVW (single and multiple scales) and variants including SPK

Page 23: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

23John Collomosse

Scaling the datasetInitial dataset plus Siobhan Davies archive (200 videos, 562 contact sheets) ~= 68kBB Inverse index used for PSF1,2,3,5

Comparison to explicit HPE Pictorial Structures Revisited [Andriluka ‘09] Pose Search [Eichner ‘09]

Page 24: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

24John Collomosse

Conclusions on Pose Search SoA pose search relies on explicit HPE

- This is impractical on low-resolution, cross-domain footage.

Visual sentences + LDA (PSF3) reach ~32% MAP >> SoA

- Encode local appearance with a spatial context

- Sufficient level of abstraction to match diverse footage.

Diversity re-ranking improves results by ~4% Query time <2s for 68k records

Given could pre-compute at this scale

Pose driven visual search

Page 25: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

25John Collomosse

Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchivePose driven visual search Sketch driven

Choreography

Page 26: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

26John Collomosse

Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchivePose driven visual search Sketch driven

Choreography

Page 27: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

27John Collomosse

ReEnact: Contributions Major driver for the use of Dance Archives is the development of new choreography

ReEnact is a sketch based interface to the NRCD archives enabling this

Visual Narrative: A set of key-frame poses linked with gestures that describe a movement.

Conceptual extension of ‘storyboard’ sketches [Collomosse et al. ICCV 09]

Page 28: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

28John Collomosse

Related Work No prior work on sketch based pose retrieval Several works on sketch based shape retrieval but these are aimed at inter- not

intra-class variation.

A Performance Evaluation of the Gradient Field HOG Descriptor for Sketch based Image Retrieval . R. Hu and J. Collomosse. Computer Vision and Image Understanding (CVIU). February 2003.

Page 29: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

29John Collomosse

ReEnact: Pose retrieval pipeline

Training

Video parse

Sketch parse

Training pairs

Query

Sketch parse

Geodesic k-NN

All video

Map

Map

Manifold Mapping

Learn

Learn

Page 30: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

30John Collomosse

ReEnact: Sketch ParsingSketches are converted into stick figures (joint angle representation)

1. Ellipse detection for head

2. Torso detection- Proximity to extreme points of other

strokes- Centre of mass

3. Intersections with torso are potential limbs

4. Heuristics select limb pairs for arms/legs

5. User may manipulate left/right labellings as these are ambiguous in sketch

Sketch parse

Skeletons from Sketches of Dancing Poses. M. Fonseca, S. James and J. Collomosse. Proc. VL/HCC. Nov 2012.

Page 31: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

31John Collomosse

ReEnact: Performer ExtractionExtracting a silhouette of the performer with the bounding box

Saliency

FG/BG Texton

Motion diff.

MRF / Solve

Unary: weight sum of three fields

Pairwise: standard Boykov’01 term

Page 32: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

32John Collomosse

ReEnact: Descriptor FormationSkeleton -> Joint angle representation Silhouette -> Gridded Zernike moments

Concatenate 22-D moments 2x2 grid Affine invariant (each cell)

Match

Page 33: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

33John Collomosse

Learning a mapping between the manifolds

Geodesic distance as a shortest-path over the graph

Page 34: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

34John Collomosse

Constructing the Graph (G) in space DAround 150 training pairs of sketches and video frames are gathered to seed

Training frames:

Training weights:

Test video frames are subsequently indexed by extending with new poses

- attached to nearest N training nodes

- N=1 for unconfident frames , >1 confident

- Confidence determined by temporal coherence (covariance) of descriptors

-

Page 35: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

35John Collomosse

Domain Transfer S -> DGiven a training sketch s we can now infer similarity to any video pose in D (i.e. )

So given an arbitrary query q, and assuming local linearity in S:

nx – candidate video pose

nd – connection into D from S

a,b pairs of nodes on shortest path through D

Page 36: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

36John Collomosse

Retrieval ResultsTrained on 150 frames, tested over ~6k. AP @ [1,80] averaged 6 queries.

- Training (Blueprint) MAP 60%

- Test (ThreeD) MAP 47%

Page 37: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

37John Collomosse

Choreography SynthesisWeb UI for generating visual narratives via sketch / semantic label annotation

For free:

Can run inference backward from D->S to produce stick men from video.

Useful for visualizing / exploring alternative retrieval results

Page 38: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

38John Collomosse

Video SynthesisInspired by Video Textures [Schodl’00] (a video form of Motion Graph [Kovar’02])

e.g.

time

Page 39: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

39John Collomosse

Video Path OptimizationThe motion graph is formed by identifying transitions between frame pairs

Pose similarity via our geodesic distance Down-weighted by poor optical flow correspondence [Brox’04] Low-pass filtered to encourage motion coherence

Page 40: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

40John Collomosse

Video Path OptimizationThe motion graph is duplicated and linked via “virtual” nodes (sketched poses)

Page 41: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

41John Collomosse

Video Path OptimizationShortest path across the graph a function of three costs:

Pose similarity Gesture similarity Duration of sequence

(fidelity to visual narrative, visual smoothness)

or

Mean gesture similarity over path

Sliding window SVM trained for gesture recognition (black box)

Count of frames along path

Penalise deviation from an idealised duration (user specified with action labels).

Page 42: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

42John Collomosse

Video Synthesis: Results Representative run for a 3 stage visual narrative over Three-D

Gradient domain compositing used against an “infinite background”

Page 43: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

43John Collomosse

Video composited

Page 44: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

44John Collomosse

Video composited

Page 45: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

45John Collomosse

ReEnact: Conclusion Sketch based pose search using a learnable piecewise linear manifold mapping

Temporally coherent pose descriptor based on gridded Zernike moments 47% MAP on unseen video

Visual narratives to generate archival choreography Motion graph optimization fusing pose/action cost

Future work Improve compositing of the performer

Unwanted scale changes due to BB detection Alternative ways to specify intermediate gestures

Sketch driven Choreography

Page 46: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

46John Collomosse

Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchivePose driven visual search Sketch driven

Choreography

Page 47: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

47John Collomosse

Pose driven visual search

Research Landscape: Visual Search of DanceSketch driven Choreography

iWeave: 3D Costume Archive

Page 48: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

48John Collomosse

iWeave – Interactive Wearable ArchiveOngoing project enabling users to experience costume and choreography from circa 1920s

Captured dance performance in 3D studio.

Create an animated character that is interactively controlled via human using Microsoft Kinect.

Page 49: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

49John Collomosse

iWeave – Performance Capture (Raw)Daffodil dress from Natural Movement collection (1920s)

Page 50: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

50John Collomosse

iWeave – Performance Capture (4D Video)Daffodil dress from Natural Movement collection (1920s)

Page 51: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

51John Collomosse

iWeave – 4D Mesh and Skeleton Estimation

Page 52: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

52John Collomosse

Interactive Animation

Match

Now showing

Target pose

Motion graph

Performance display

Page 53: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

53John Collomosse

iWeave: Interactive Animation Pose similarity - joint angles

- Quaternions- Weighted to outer joints

Path search as ReEnact

Random walk when idle

Page 54: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

54John Collomosse

Conclusions (Pose Search) Pose search in 2D possible over low resolution video using our new visual sentence descriptor

Visual sentences outperform explicit pose estimation based search on this footage

Pose search in 2D or 3D coupled with Motion Graphs enables interactive animated characters

Cultural heritage application to historic costume archives.

Page 55: 1 John Collomosse Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University

55John Collomosse

Contributors

John CollomosseReede RenRui Hu Stuart James

iWeave

Thanks for your attention

[email protected]

Qizhi Yu

Visual SentencesReEnact