placing images with refined language models and similarity search with pca-reduced vgg features

Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG FeaturesGiorgos Kordopatis-Zilos1, Adrian Popescu2, Symeon Papadopoulos1 and Yiannis Kompatsiaris1

1 Information Technologies Institute (ITI), CERTH, Greece2 CEA LIST, 91190 Gif-sur-Yvette, France

MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.

Summary

Tag-based location estimation (1 runs)• Built upon the scheme of our 2015 participation (Kordopatis-Zilos et al.,

MediaEval 2015)• Based on a refined probabilistic Language Model

Visual-based location estimation (1 run)• Extract PCA-reduced VGG features to compute image similarities• Geospatial clustering scheme of the most visually similar images

Hybrid location estimation (3 run)• Combination of the textual and visual approaches using a set of rules

Training sets• Training set released by the organisers (≈4.7M geotagged items)• YFCC dataset, excl. images from users in test set (≈40M geotagged items)• External data derived from gazetteers, i.e. Geonames and OpenStreetMap

G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and Y. Kompatsiaris. Socialsensor at mediaeval placing task 2015. In MediaEval 2015 Placing Task, 2015

Tag-based location estimation

• Processing steps of the approach– Offline: language model construction– Online: location estimation

OpenStreetMap

Pre-processing

• Tags and titles of the training set items are processed• Apply

– URL decoding– lowercase transformation– tokenization

• Remove– accents– symbols– punctuations

• The multi-word tags are split into their individual terms, which are also included in the item's term set

• Discard numerics or less than three characters terms

Language Model (LM)

• LM-based estimation– Most Likely Cell (mlc) considered the cell with the highest probability and

used to produce the estimation

Inspired from (Popescu, MediaEval 2013)

• LM generation scheme– divide earth surface in rectangular

cells with a side length of 0.01– calculate term-cell probabilities

A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In MediaEval 2013 Placing Task, 2013

Feature Selection and Weighting

Feature Weighting• Locality weight function, a function based on term relative position in T

• Spatial Entropy weight function, a Gaussian function based on the term’s spatial entropy

• Linear combination of the two weights

Feature Selection• Calculate terms locality using a grid of 0.01×0.01• When a user uses a given term, he/she is assigned to the

entire cell neighborhood instead of a unique cell:

• Terms with non-zero locality score form the term set

Refinements

• Multiple Grids– Built an additional LM using a finer

grid (cell side length of 0.001)– combine the MLC of the individual

language models

• Similarity search (Van Laere et al., ICMR 2011)– determine most similar training images in the MLC– their center-of-gravity is the final location estimation

From: (Kordopatis-Zilos et al., PAISI 2015)

G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social media content with a refined language

modelling approach. In Intelligence and Security Informatics, pages 21–40, 2015

Visual-based location estimation

Main Objectives• Ensure that the visual features are generic and transferable• Provide a compact representation of the features

Model building• CNN features extracted by fine-tuning the VGG model• Training: ~5K Points Of Interest (POIs), over 7M Flickr images using

queries with:– the POI name and a radius of 5km around its coordinates– the POI name and the associated city name

• Compressed outputs of fc7 layer (4096d) to 128d using PCA, learned on a subset of 250,000 train images

• Similarity Search based on the PCA-reduced CNN features

O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr resources using language models and similarity search. ICMR

’11, pages 48:1–48:8, New York, NY, USA, 2011. ACM

Visual-based location estimation

Location Estimation• Geospatial clustering of visually most similar images• The largest cluster (or the first in case of equal size) is selected and

its centroid is used as the location estimate

Visual Confidence• Confidence metric for the visual estimation is based on the size of

the largest cluster

: number of neighbors in the largest cluster of image i: configuration parameter of the confidence score ‘’strictness’’

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International

Conference on Learning Representations, 2015

Hybrid-based location estimation• A set of rules to determine the

source of estimation between the text and visual approaches

• The visual estimation is chosen in cases:→ No estimation could be produced by

the text approach→ Visual estimation fell inside the

borders of the mlc→ By comparing the confidence scores

and

• Otherwise the text estimation is selected

Runs and Results

RUN-1: Tag-based location estimation + released training setRUN-2: Visual-based location estimation + released training setRUN-3: Hybrid location estimation + released training setRUN-4: Hybrid location estimation + YFCC datasetRUN-5: Hybrid location estimation + YFCC + External dataRUN-E: Visual-based location estimation + entire YFCC dataset

Images

Runs and Results

RUN-1: Tag-based location estimation + released training setRUN-2: Visual-based location estimation + released training setRUN-3: Hybrid location estimation + released training setRUN-4: Hybrid location estimation + YFCC datasetRUN-5: Hybrid location estimation + YFCC + External data

Videos

References

G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and Y. Kompatsiaris. Socialsensor at

mediaeval placing task 2015. In MediaEval 2015 Placing Task, 2015

G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social media

content with a refined language modelling approach. In Intelligence and Security

Informatics, pages 21–40, 2015

A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In MediaEval 2013

Placing Task, 2013

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image

recognition. In International Conference on Learning Representations, 2015

O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr resources using

language models and similarity search. ICMR ’11, pages 48:1–48:8, New York, NY,

USA, 2011. ACM

Thank you!

Data/Code:– https://github.com/MKLab-ITI/multimedia-geotagging/

Get in touch:– Giorgos Kordopatis-Zilos: [email protected] – Symeon Papadopoulos: [email protected] / @sympap

With the support of:

https://github.com/MKLab-ITI/multimedia-geotagging/tree/master/samples

mailto:[email protected]






placing images with refined language models and similarity search with pca-reduced vgg features

Data & Analytics