mediaeval 2015 - certh at mediaeval 2015 synchronization of multi-user event media task

CERTH at MediaEval 2015

Synchronization of Multi-User Event

Media Task

Konstantinos Apostolidis, Vasileios Mezaris Information Technologies Institute / Centre for Research and Technology Hellas

Overview

• Aims and objectives

• Proposed approach

• Experimental setup

• Experiments and results

• Conclusions & Future Work

Aims and objectives

• People attending large events collect dozens of photos and

video clips with their smartphones, tablets, cameras.

• Complications:

– Time information may be wrong.

– Geolocation maybe missing.

• SEM Task Challenge: align and present the media galleries of

different users in a consistent way, so as to preserve the

temporal evolution of the event.

Proposed Method

1. Media Similarity Assessment: Assess media (photos, videos and

audio files) similarity using multiple

similarity measures.

2. Media Similarity Combination: Combine all similarity measures to

construct a global similarity matrix.

4. Sub-event Clustering: Cluster media to

sub-events using the corrected timestamps,

geolocation information and DCNN features.

3. Temporal Synchronization: Very

similar media define links between

different collections. Create a graph

with galleries as nodes and those links

as edges. Traverse it starting from the

reference gallery.

1. Media Similarity Assessment

• Photo Similarity Measures: 1. Geometric Consistency of Local Features (GC): We check the geometric

consistency of SIFT keypoints for each pair of photos, using geometric

coding. The GC similarity can discover near-duplicate photos.

2. Scene Similarity (S): We calculate the pairwise cosine distances between

the extracted GIST descriptor of each photo. High S similarity indicates

photos captured at similar scenery (indoor, urban, nature).

3. Color Allocation Similarity (CA): We divide each image to three equal, non-

overlapping horizontal strips, and extract the HSV histogram of each. We

calculate the pairwise cosine distances between the concatenation of the

HSV histograms. High CA similarity indicates photos with similar colors.

4. DCNN Concept Scores (DCS): We use the Caffe DCNN and the googleNet

pre-trained model to extract concept scores for photos. We use the

Euclidean distance to calculate pairwise distances between concept scores

vectors of photos. High CSC similarity indicates semantically similar photos.

• Video Similarity Measures: – We perform Normalized Cross Correlation of video barcodes.

– We construct video barcodes as follows:



• Audio Similarity Measures: – We reduce to sampling rate of audio files to 11kHz.

– We perform Normalized Cross Correlation on the raw audio data.

2. Media Similarity Combination

• We combine the information of similarity measures for

photos, using the following procedure:

– Initially, the similarity O(i, j) of photos i and j is set equal to GC(i,j).

– If S(i,j) > ts and S(i,j) > GC(i,j) then O(i,j) is updated as O(i,j) = S(i,j).

– The same update process is subsequently repeated using CA similarity

and DCS similarity (and the respective tc, td thresholds).

– We weigh each similarity value so that the similarity of photos with

distance of capture locations lower than a m threshold is emphasized.

– The m threshold is calculated by estimating a Gaussian mixture model

of two Gaussian distributions on the histogram of all photo's pairwise

capture location distances. The Gaussian distribution with the lowest

mean (m) presumably signifies photos captured in the same sub-

event.

2. Media Similarity Combination

• All photo similarity measures information is combined into a

single similarity matrix.

• If different media types are present (audio or video) the

respective similarity matrices are normalized and added into

the same similarity matrix.

• Media that:

1. Exhibit similarity above a t threshold,

2. belong to different user galleries,

are treated as potential links between these photo galleries.

• The t threshold is computed empirically from the training

datasets.

3. Temporal Synchronization

• Having identified potential links, we construct a weighted

graph with:

– nodes representing the galleries,

– edges representing the links between galleries,

– weights calculated as the sum of similarities of the media linking the

two galleries.

• We compute the temporal offset of each gallery by traversing

the minimum spanning tree (MST) of the galleries graph.


• Two approaches of traversing the graph:

1. MSTt:

a) Start from the reference gallery node.

b) Select the next node of the MST.

c) Compute the temporal offset of the node as the median of the

capture time differences of the pairs of similar media.

d) Repeat for all nodes in MST.

2. MSTx:

a) Detect fully-connected triplets of nodes and average the offset of the

shortest path with the alternative path in each triplet.

b) Follow the procedure of MSTt method on the averaged offsets.


• MSTt and MSTx traverse methods comparison:

4. Sub-event Clustering

• Two different approaches for sub-event clustering:

1. MPC sub-event clustering approach:

a) Split the media timeline where consecutive photos have temporal

distance above the mean of all temporal distances.

b) Within each cluster, all media with capture locations distance above

m, are split to a different cluster. The process is repeated until no

splitting occurs.

c) Merge neighboring clusters where the maximum intra-cluster

temporal distances of media in each of these clusters are above the

minimum inter-cluster temporal distance.

Intra-cluster distances are the

pairwise distances between all the

media in a cluster.

intra-cluster

inter-cluster

Inter-cluster distances are the

pairwise distances between the media

of a cluster and another.

Media #2

Media #3

Media #1

Cluster X

Cluster X Cluster Y

Media #1 of X

Media #2 of X Media #2 of Y

Media #1 of Y


d) Merge neighboring clusters where the maximum intra-cluster

spatial distances of media in each of these clusters are above the

minimum inter-cluster spatial distance.

e) For the neighboring clusters that none of their media have

geolocation information, the merging is continued by checking if the

pairwise inter-cluster concept scores distance is above an empirical

set threshold.


2. APC sub-event clustering approach:

a) Augment the DCNN feature vectors with the normalized time

information.

b) Cluster the media using Affinity Propagation.

Experimental setup

• We selected the Tour De France 2014 (TDF14), The NAMM

Show 2015 (NAMM15) and the Salford Test Shoot (SAL)

datasets to evaluate our method.

• For Temporal Synchronization we used the following

evaluation metrics:

Precision (percentage of synchronized galleries).

Accuracy (the average temporal offset calculated over the

synchronized collections).

• For Sub-event clustering we used the F-Score metric.

Experiments and results

Time Synchronization

Evaluation

Sub-event Clustering

Evaluation

Run Precision Accuracy F-Score

NAMM15

dataset

MSTt + APC 0.8333 0.9083

0.241

MSTt + MPC 0.3658

MSTx + APC 0.8333 0.9083

0.241

MSTx + MPC 0.3658

TDF14

dataset

MSTt + APC 0.125 0.8446

0.1134

MSTt + MPC 0.0167

MSTx + APC 0.125 0.8446

0.1134

MSTx + MPC 0.0167

SAL

dataset

MSTt + APC 0.4242 0.9998

0.1229

MSTt + MPC 0.164

MSTx + APC 0.4242 0.9998

0.1229

MSTx + MPC 0.164

Experiments and results

• Our method achieved very good accuracy but only managed

to synchronize only a small number of galleries, specifically in

TDF14 dataset.

• The offset averaging of MSTx method takes place only if the

difference of the two paths is lower than maxDiff threshold.

In our experiments, we set maxDiff = 10, i.e. we perform the

averaging if the two paths' offsets in each triplet difference is

less than 10 seconds. The MSTt and MSTx methods

performed the same because maxDiff was set too low and

allowed only minute adjustments, degenerating the MSTx

method to MSTt.

Conclusions & Future Work

• Better fine-tuning of the algorithm parameters is required to

achieve constant good performance on diverse datasets.

• Extend the algorithm with automatic parameter selection

(e.g. select more links between galleries to improve

precision), experiment with different values of maxDiff

threshold.

• Perform cross correlation on extracted audio and video

features and not directly to raw data.

• Apply a more sophisticated method to combine different

similarity measures.

Thank you for your attention!

Questions?

More information and contact:

Dr. Vasileios Mezaris

[email protected]

http://www.iti.gr/~bmezaris

http://www.iti.gr/~bmezaris

mediaeval 2015 - certh at mediaeval 2015 synchronization of multi-user event media task

Education