1 john collomosse visual search of dance archives john collomosse [email protected] centre...

1John Collomosse

Visual Search of Dance Archives

John [email protected]

Centre for Vision, Speech and Signal ProcessingUniversity of Surrey

Oxford Robotics Group. May 2013.

2John Collomosse

Motivation: Add value to curated Dance collectionsDance archvies are currently searchable by text (curated metadata)

What if you want to search on the content e.g. choreography itself?

Metadata

3John Collomosse

Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchivePose driven visual search Sketch driven

Choreography

4John Collomosse

Research Landscape: Visual Search of DanceiWeave: 3D Costume ArchiveSketch driven

ChoreographyPose driven visual search

Visual sentences for pose retrieval over low-resolution cross-media dance collections. R. Ren and J. Collomosse. IEEE Trans. Multimedia 14(6). Dec 2012.

5John Collomosse

Grainy/noisy

Contrast bleaching

Blur / poor definition

Small performer e.g. 100 px

Ilumination aritfactsFeatureless bg

Inter & intra-occlusion

UK-NRCD Archival Dance Footage Digital Dance Archive (DDA) spanning ~100 years of UK dance history Videos transferred between several analogue formats prior to digitisation.

http://www.dance-archives.ac.uk

6John Collomosse

Characterizing HPE on Archival FootageExplicit Human Pose Estimation (HPE) fails on typical NRCD archival footage

NRCD Footage

Eichner et al. [CVPR’09]

Andriluka et al. [CVPR’09]

7John Collomosse

Contributions Cross-media pose retrieval on archival data

Match pose implicitly rather than explicitly

New representation “Visual Sentences” Using Self-similarity (SSIM) and LDA

Built into Bag of Words framework With tweaks e.g. stop-word removal

Fusing Vision and Information Retrieval concepts Diversity re-ranking

Contact Sheets (Photos)

Performance videos

8John Collomosse

Performer Detection Dalal/Triggs-like pedestrian detection [CVPR 2005]

Trained across six videos (~5hrs) 5k positive annotations. 5k negatives sampled randomly outside BBs.

Horizontal poses included but rotated (twice) Output BBs rescaled to 64x128 for retrieval

9John Collomosse

Visual Sentence RepresentationBased on Self-similarity (SSIM) descriptor*

1) Computes a correlation surface Sq local to (x,y) using SSD.

2) Bins Sq into a log-polar representation

using local maxima.

3) Discards invalid (v. low/high variance) features.

* “Matching Local Self-similarities across Images and Video”. E. Shechtman and M. Irani. CVPR 2007.

10John Collomosse

SSIM for DanceUsing star ensemble, SSIM showcased* results including Dance pose detection.

* “Matching Local Self-similarities across Images and Video”. E. Shechtman and M. Irani. CVPR 2007.

1) Ensemble approach scales at best O(n)

- we need to search >>100k BBs

2) SSIM not characterized well for our data

- cross-domain, cross-performance

3) However most promising approach tested

- vs. SIFT/SURF, HOG, Shape Context

QueryQuery

CVPR’05CVPR’05

CVPR’07CVPR’07

11John Collomosse

Implicit Pose RepresentationSelf-similarity (SSIM) codebooked (HKM, hard-assignment), aggregated over scale

12John Collomosse

Representations and strategies (PSF1 of 4)

Pose similarity function (PSF) 1 serves as baseline – Multi-scale BoVW

Given a dataset of ROIs ,and query ROI evaluate for all and rank

where is the ith of n visual words present in

tf(.) yields word frequency within d, and |d| is word count

13John Collomosse

Representations and strategies (PSF2 of 4)Pose similarity function (PSF) 2 is a variant of MS-BoVW that individually weights

the importance of each layer (up to 5)

where are normalised weights bootstrapped via SVM over a small training set of 50

in practice indicate ~linear increase with finer scales.

14John Collomosse

Representations and strategies (PSF3 of 4)Visual sentence (VS) representation encodes fine-scale features + structural context

Semantic body zones unlikely to map explicitly to regions in structural hierarchy.

Set of VS capture membership implicitly over latent variables Topic discovery via LDA

Topic set learned via Gibbs over 1k training samples using 48 topics (c.f. Choreutics)

Variable length sentences padded Spatial relationships implicitly encoded via context

Variable length

15John Collomosse

Representations and strategies (PSF4 of 4)Explicit encoding spatial relationships via sliding window approach over and

2 x 2 window (at coarest level i.e. = 4 x 8 pixels) over compare all VS within footprint

Similarity between window pair (50-100 VS)

Randomly sample 1/3 of VS in window pair and search for pair of

sentences minimising

Where ||.|| is a count of in-place differences between VS

Reminiscent of text passage retrieval

16John Collomosse

Comparative Results (PSF1-4)Initial evaluation over 32 works over 4 cross-media collections Video subsampled @ 5s = 6.3k video stills + 1.7k photos ~= 8k BBs No stop-word identification at this stage

Independent treatment of scales sig. better VS outperforming Layered by ~10% PSF3 best (+4%) but drops sharply after 1k

17John Collomosse

Query set and Ground TruthMark-up task distribution over 3 professional archivists in UK-NRCD 65 queries – single BB (2/3 contact sheet photos, 1/3 video frames) 8k BB marked up as relevant/non-relevant with respect to each query

contact sheets video

18John Collomosse

Comparative Results (PSF1-4)Effect of stop-word removal on BoVW codebook Comparing best performing VS (PSF3) and Layered (PSF2) strategies. Stop-word identification via freq. distribution under Bernoulli or Poisson model

Indicates PSF3 (LDA) over k=1000, with Bernoulli stop-word removal at 0.85

19John Collomosse

Diversity Re-ranking (PSF5 = PSF3 re-ranked)Direct presentation of results can lead to unsatisfactory visual repetition (e.g. temporally

adjacent video frames) Not ideal for archive discovery. A run of poor results can also reduce precision.

Re-rank via Kruskal clustering of affinity graph A of top n results (scope of DR) A computed pairwise using PSF4 (sliding window approach) Spanning trees iteratively identified in graph to form cluster set – each is ranked

independently under the PSF3 score. Ranks merged.

20John Collomosse

Results - Qualitative

21John Collomosse

Results - Qualitative

Serendipitous recovery from failed BB isolation!

22John Collomosse

Results - QuantitativeComparison vs. BoVW (single and multiple scales) and variants including SPK

23John Collomosse

Scaling the datasetInitial dataset plus Siobhan Davies archive (200 videos, 562 contact sheets) ~= 68kBB Inverse index used for PSF1,2,3,5

Comparison to explicit HPE Pictorial Structures Revisited [Andriluka ‘09] Pose Search [Eichner ‘09]

24John Collomosse

Conclusions on Pose Search SoA pose search relies on explicit HPE

- This is impractical on low-resolution, cross-domain footage.

Visual sentences + LDA (PSF3) reach ~32% MAP >> SoA

- Encode local appearance with a spatial context

- Sufficient level of abstraction to match diverse footage.

Diversity re-ranking improves results by ~4% Query time <2s for 68k records

Given could pre-compute at this scale

Pose driven visual search

25John Collomosse


Choreography

26John Collomosse


Choreography

27John Collomosse

ReEnact: Contributions Major driver for the use of Dance Archives is the development of new choreography

ReEnact is a sketch based interface to the NRCD archives enabling this

Visual Narrative: A set of key-frame poses linked with gestures that describe a movement.

Conceptual extension of ‘storyboard’ sketches [Collomosse et al. ICCV 09]

28John Collomosse

Related Work No prior work on sketch based pose retrieval Several works on sketch based shape retrieval but these are aimed at inter- not

intra-class variation.

A Performance Evaluation of the Gradient Field HOG Descriptor for Sketch based Image Retrieval . R. Hu and J. Collomosse. Computer Vision and Image Understanding (CVIU). February 2003.

29John Collomosse

ReEnact: Pose retrieval pipeline

Training

Video parse

Sketch parse

Training pairs

Query

Sketch parse

Geodesic k-NN

All video

Map

Map

Manifold Mapping

Learn

Learn

30John Collomosse

ReEnact: Sketch ParsingSketches are converted into stick figures (joint angle representation)

1. Ellipse detection for head

2. Torso detection- Proximity to extreme points of other

strokes- Centre of mass

3. Intersections with torso are potential limbs

4. Heuristics select limb pairs for arms/legs

5. User may manipulate left/right labellings as these are ambiguous in sketch

Sketch parse

Skeletons from Sketches of Dancing Poses. M. Fonseca, S. James and J. Collomosse. Proc. VL/HCC. Nov 2012.

31John Collomosse

ReEnact: Performer ExtractionExtracting a silhouette of the performer with the bounding box

Saliency

FG/BG Texton

Motion diff.

MRF / Solve

Unary: weight sum of three fields

Pairwise: standard Boykov’01 term

32John Collomosse

ReEnact: Descriptor FormationSkeleton -> Joint angle representation Silhouette -> Gridded Zernike moments

Concatenate 22-D moments 2x2 grid Affine invariant (each cell)

Match

33John Collomosse

Learning a mapping between the manifolds

Geodesic distance as a shortest-path over the graph

34John Collomosse

Constructing the Graph (G) in space DAround 150 training pairs of sketches and video frames are gathered to seed

Training frames:

Training weights:

Test video frames are subsequently indexed by extending with new poses

- attached to nearest N training nodes

- N=1 for unconfident frames , >1 confident

- Confidence determined by temporal coherence (covariance) of descriptors

-

35John Collomosse

Domain Transfer S -> DGiven a training sketch s we can now infer similarity to any video pose in D (i.e. )

So given an arbitrary query q, and assuming local linearity in S:

nx – candidate video pose

nd – connection into D from S

a,b pairs of nodes on shortest path through D

36John Collomosse

Retrieval ResultsTrained on 150 frames, tested over ~6k. AP @ [1,80] averaged 6 queries.

- Training (Blueprint) MAP 60%

- Test (ThreeD) MAP 47%

37John Collomosse

Choreography SynthesisWeb UI for generating visual narratives via sketch / semantic label annotation

For free:

Can run inference backward from D->S to produce stick men from video.

Useful for visualizing / exploring alternative retrieval results

38John Collomosse

Video SynthesisInspired by Video Textures [Schodl’00] (a video form of Motion Graph [Kovar’02])

e.g.

time

39John Collomosse

Video Path OptimizationThe motion graph is formed by identifying transitions between frame pairs

Pose similarity via our geodesic distance Down-weighted by poor optical flow correspondence [Brox’04] Low-pass filtered to encourage motion coherence

40John Collomosse

Video Path OptimizationThe motion graph is duplicated and linked via “virtual” nodes (sketched poses)

41John Collomosse

Video Path OptimizationShortest path across the graph a function of three costs:

Pose similarity Gesture similarity Duration of sequence

(fidelity to visual narrative, visual smoothness)

or

Mean gesture similarity over path

Sliding window SVM trained for gesture recognition (black box)

Count of frames along path

Penalise deviation from an idealised duration (user specified with action labels).

42John Collomosse

Video Synthesis: Results Representative run for a 3 stage visual narrative over Three-D

Gradient domain compositing used against an “infinite background”

43John Collomosse

Video composited

44John Collomosse

Video composited

45John Collomosse

ReEnact: Conclusion Sketch based pose search using a learnable piecewise linear manifold mapping

Temporally coherent pose descriptor based on gridded Zernike moments 47% MAP on unseen video

Visual narratives to generate archival choreography Motion graph optimization fusing pose/action cost

Future work Improve compositing of the performer

Unwanted scale changes due to BB detection Alternative ways to specify intermediate gestures

Sketch driven Choreography

46John Collomosse


Choreography

47John Collomosse

Pose driven visual search

Research Landscape: Visual Search of DanceSketch driven Choreography

iWeave: 3D Costume Archive

48John Collomosse

iWeave – Interactive Wearable ArchiveOngoing project enabling users to experience costume and choreography from circa 1920s

Captured dance performance in 3D studio.

Create an animated character that is interactively controlled via human using Microsoft Kinect.

49John Collomosse

iWeave – Performance Capture (Raw)Daffodil dress from Natural Movement collection (1920s)

50John Collomosse

iWeave – Performance Capture (4D Video)Daffodil dress from Natural Movement collection (1920s)

51John Collomosse

iWeave – 4D Mesh and Skeleton Estimation

52John Collomosse

Interactive Animation

Match

Now showing

Target pose

Motion graph

Performance display

53John Collomosse

iWeave: Interactive Animation Pose similarity - joint angles

- Quaternions- Weighted to outer joints

Path search as ReEnact

Random walk when idle

54John Collomosse

Conclusions (Pose Search) Pose search in 2D possible over low resolution video using our new visual sentence descriptor

Visual sentences outperform explicit pose estimation based search on this footage

Pose search in 2D or 3D coupled with Motion Graphs enables interactive animated characters

Cultural heritage application to historic costume archives.

55John Collomosse

Contributors

John CollomosseReede RenRui Hu Stuart James

iWeave

Thanks for your attention

[email protected]

Qizhi Yu

Visual SentencesReEnact

1 john collomosse visual search of dance archives john collomosse [email protected] centre...

Documents

john collomosse ssim

john collomosse motivation

retrieval slide

metadata slide

dance pose detection

visual search of dance

cvpr09 slide

scale slide