mediaeval 2015 - certh at mediaeval 2015 synchronization of multi-user event media task
TRANSCRIPT
CERTH at MediaEval 2015
Synchronization of Multi-User Event
Media Task
Konstantinos Apostolidis, Vasileios Mezaris Information Technologies Institute / Centre for Research and Technology Hellas
Overview
• Aims and objectives
• Proposed approach
• Experimental setup
• Experiments and results
• Conclusions & Future Work
Aims and objectives
• People attending large events collect dozens of photos and
video clips with their smartphones, tablets, cameras.
• Complications:
– Time information may be wrong.
– Geolocation maybe missing.
• SEM Task Challenge: align and present the media galleries of
different users in a consistent way, so as to preserve the
temporal evolution of the event.
Proposed Method
1. Media Similarity Assessment: Assess media (photos, videos and
audio files) similarity using multiple
similarity measures.
2. Media Similarity Combination: Combine all similarity measures to
construct a global similarity matrix.
4. Sub-event Clustering: Cluster media to
sub-events using the corrected timestamps,
geolocation information and DCNN features.
3. Temporal Synchronization: Very
similar media define links between
different collections. Create a graph
with galleries as nodes and those links
as edges. Traverse it starting from the
reference gallery.
1. Media Similarity Assessment
• Photo Similarity Measures: 1. Geometric Consistency of Local Features (GC): We check the geometric
consistency of SIFT keypoints for each pair of photos, using geometric
coding. The GC similarity can discover near-duplicate photos.
2. Scene Similarity (S): We calculate the pairwise cosine distances between
the extracted GIST descriptor of each photo. High S similarity indicates
photos captured at similar scenery (indoor, urban, nature).
3. Color Allocation Similarity (CA): We divide each image to three equal, non-
overlapping horizontal strips, and extract the HSV histogram of each. We
calculate the pairwise cosine distances between the concatenation of the
HSV histograms. High CA similarity indicates photos with similar colors.
4. DCNN Concept Scores (DCS): We use the Caffe DCNN and the googleNet
pre-trained model to extract concept scores for photos. We use the
Euclidean distance to calculate pairwise distances between concept scores
vectors of photos. High CSC similarity indicates semantically similar photos.
• Video Similarity Measures: – We perform Normalized Cross Correlation of video barcodes.
– We construct video barcodes as follows:
1. Media Similarity Assessment
1. Media Similarity Assessment
• Audio Similarity Measures: – We reduce to sampling rate of audio files to 11kHz.
– We perform Normalized Cross Correlation on the raw audio data.
2. Media Similarity Combination
• We combine the information of similarity measures for
photos, using the following procedure:
– Initially, the similarity O(i, j) of photos i and j is set equal to GC(i,j).
– If S(i,j) > ts and S(i,j) > GC(i,j) then O(i,j) is updated as O(i,j) = S(i,j).
– The same update process is subsequently repeated using CA similarity
and DCS similarity (and the respective tc, td thresholds).
– We weigh each similarity value so that the similarity of photos with
distance of capture locations lower than a m threshold is emphasized.
– The m threshold is calculated by estimating a Gaussian mixture model
of two Gaussian distributions on the histogram of all photo's pairwise
capture location distances. The Gaussian distribution with the lowest
mean (m) presumably signifies photos captured in the same sub-
event.
2. Media Similarity Combination
• All photo similarity measures information is combined into a
single similarity matrix.
• If different media types are present (audio or video) the
respective similarity matrices are normalized and added into
the same similarity matrix.
• Media that:
1. Exhibit similarity above a t threshold,
2. belong to different user galleries,
are treated as potential links between these photo galleries.
• The t threshold is computed empirically from the training
datasets.
3. Temporal Synchronization
• Having identified potential links, we construct a weighted
graph with:
– nodes representing the galleries,
– edges representing the links between galleries,
– weights calculated as the sum of similarities of the media linking the
two galleries.
• We compute the temporal offset of each gallery by traversing
the minimum spanning tree (MST) of the galleries graph.
3. Temporal Synchronization
• Two approaches of traversing the graph:
1. MSTt:
a) Start from the reference gallery node.
b) Select the next node of the MST.
c) Compute the temporal offset of the node as the median of the
capture time differences of the pairs of similar media.
d) Repeat for all nodes in MST.
2. MSTx:
a) Detect fully-connected triplets of nodes and average the offset of the
shortest path with the alternative path in each triplet.
b) Follow the procedure of MSTt method on the averaged offsets.
3. Temporal Synchronization
• MSTt and MSTx traverse methods comparison:
4. Sub-event Clustering
• Two different approaches for sub-event clustering:
1. MPC sub-event clustering approach:
a) Split the media timeline where consecutive photos have temporal
distance above the mean of all temporal distances.
b) Within each cluster, all media with capture locations distance above
m, are split to a different cluster. The process is repeated until no
splitting occurs.
c) Merge neighboring clusters where the maximum intra-cluster
temporal distances of media in each of these clusters are above the
minimum inter-cluster temporal distance.
Intra-cluster distances are the
pairwise distances between all the
media in a cluster.
intra-cluster
inter-cluster
Inter-cluster distances are the
pairwise distances between the media
of a cluster and another.
Media #2
Media #3
Media #1
Cluster X
Cluster X Cluster Y
Media #1 of X
Media #2 of X Media #2 of Y
Media #1 of Y
4. Sub-event Clustering
d) Merge neighboring clusters where the maximum intra-cluster
spatial distances of media in each of these clusters are above the
minimum inter-cluster spatial distance.
e) For the neighboring clusters that none of their media have
geolocation information, the merging is continued by checking if the
pairwise inter-cluster concept scores distance is above an empirical
set threshold.
4. Sub-event Clustering
2. APC sub-event clustering approach:
a) Augment the DCNN feature vectors with the normalized time
information.
b) Cluster the media using Affinity Propagation.
Experimental setup
• We selected the Tour De France 2014 (TDF14), The NAMM
Show 2015 (NAMM15) and the Salford Test Shoot (SAL)
datasets to evaluate our method.
• For Temporal Synchronization we used the following
evaluation metrics:
Precision (percentage of synchronized galleries).
Accuracy (the average temporal offset calculated over the
synchronized collections).
• For Sub-event clustering we used the F-Score metric.
Experiments and results
Time Synchronization
Evaluation
Sub-event Clustering
Evaluation
Run Precision Accuracy F-Score
NAMM15
dataset
MSTt + APC 0.8333 0.9083
0.241
MSTt + MPC 0.3658
MSTx + APC 0.8333 0.9083
0.241
MSTx + MPC 0.3658
TDF14
dataset
MSTt + APC 0.125 0.8446
0.1134
MSTt + MPC 0.0167
MSTx + APC 0.125 0.8446
0.1134
MSTx + MPC 0.0167
SAL
dataset
MSTt + APC 0.4242 0.9998
0.1229
MSTt + MPC 0.164
MSTx + APC 0.4242 0.9998
0.1229
MSTx + MPC 0.164
Experiments and results
• Our method achieved very good accuracy but only managed
to synchronize only a small number of galleries, specifically in
TDF14 dataset.
• The offset averaging of MSTx method takes place only if the
difference of the two paths is lower than maxDiff threshold.
In our experiments, we set maxDiff = 10, i.e. we perform the
averaging if the two paths' offsets in each triplet difference is
less than 10 seconds. The MSTt and MSTx methods
performed the same because maxDiff was set too low and
allowed only minute adjustments, degenerating the MSTx
method to MSTt.
Conclusions & Future Work
• Better fine-tuning of the algorithm parameters is required to
achieve constant good performance on diverse datasets.
• Extend the algorithm with automatic parameter selection
(e.g. select more links between galleries to improve
precision), experiment with different values of maxDiff
threshold.
• Perform cross correlation on extracted audio and video
features and not directly to raw data.
• Apply a more sophisticated method to combine different
similarity measures.
Thank you for your attention!
Questions?
More information and contact:
Dr. Vasileios Mezaris
http://www.iti.gr/~bmezaris