placing images with refined language models and similarity search with pca-reduced vgg features
TRANSCRIPT
Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG FeaturesGiorgos Kordopatis-Zilos1, Adrian Popescu2, Symeon Papadopoulos1 and Yiannis Kompatsiaris1
1 Information Technologies Institute (ITI), CERTH, Greece2 CEA LIST, 91190 Gif-sur-Yvette, France
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.
Summary
Tag-based location estimation (1 runs)• Built upon the scheme of our 2015 participation (Kordopatis-Zilos et al.,
MediaEval 2015)• Based on a refined probabilistic Language Model
Visual-based location estimation (1 run)• Extract PCA-reduced VGG features to compute image similarities• Geospatial clustering scheme of the most visually similar images
Hybrid location estimation (3 run)• Combination of the textual and visual approaches using a set of rules
Training sets• Training set released by the organisers (≈4.7M geotagged items)• YFCC dataset, excl. images from users in test set (≈40M geotagged items)• External data derived from gazetteers, i.e. Geonames and OpenStreetMap
G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and Y. Kompatsiaris. Socialsensor at mediaeval placing task 2015. In MediaEval 2015 Placing Task, 2015
Tag-based location estimation
• Processing steps of the approach– Offline: language model construction– Online: location estimation
OpenStreetMap
Pre-processing
• Tags and titles of the training set items are processed• Apply
– URL decoding– lowercase transformation– tokenization
• Remove– accents– symbols– punctuations
• The multi-word tags are split into their individual terms, which are also included in the item's term set
• Discard numerics or less than three characters terms
Language Model (LM)
• LM-based estimation– Most Likely Cell (mlc) considered the cell with the highest probability and
used to produce the estimation
Inspired from (Popescu, MediaEval 2013)
• LM generation scheme– divide earth surface in rectangular
cells with a side length of 0.01– calculate term-cell probabilities
A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In MediaEval 2013 Placing Task, 2013
Feature Selection and Weighting
Feature Weighting• Locality weight function, a function based on term relative position in T
• Spatial Entropy weight function, a Gaussian function based on the term’s spatial entropy
• Linear combination of the two weights
Feature Selection• Calculate terms locality using a grid of 0.01×0.01• When a user uses a given term, he/she is assigned to the
entire cell neighborhood instead of a unique cell:
• Terms with non-zero locality score form the term set
Refinements
• Multiple Grids– Built an additional LM using a finer
grid (cell side length of 0.001)– combine the MLC of the individual
language models
• Similarity search (Van Laere et al., ICMR 2011)– determine most similar training images in the MLC– their center-of-gravity is the final location estimation
From: (Kordopatis-Zilos et al., PAISI 2015)
G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social media content with a refined language
modelling approach. In Intelligence and Security Informatics, pages 21–40, 2015
Visual-based location estimation
Main Objectives• Ensure that the visual features are generic and transferable• Provide a compact representation of the features
Model building• CNN features extracted by fine-tuning the VGG model• Training: ~5K Points Of Interest (POIs), over 7M Flickr images using
queries with:– the POI name and a radius of 5km around its coordinates– the POI name and the associated city name
• Compressed outputs of fc7 layer (4096d) to 128d using PCA, learned on a subset of 250,000 train images
• Similarity Search based on the PCA-reduced CNN features
O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr resources using language models and similarity search. ICMR
’11, pages 48:1–48:8, New York, NY, USA, 2011. ACM
Visual-based location estimation
Location Estimation• Geospatial clustering of visually most similar images• The largest cluster (or the first in case of equal size) is selected and
its centroid is used as the location estimate
Visual Confidence• Confidence metric for the visual estimation is based on the size of
the largest cluster
: number of neighbors in the largest cluster of image i: configuration parameter of the confidence score ‘’strictness’’
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International
Conference on Learning Representations, 2015
Hybrid-based location estimation• A set of rules to determine the
source of estimation between the text and visual approaches
• The visual estimation is chosen in cases:→ No estimation could be produced by
the text approach→ Visual estimation fell inside the
borders of the mlc→ By comparing the confidence scores
and
• Otherwise the text estimation is selected
Runs and Results
RUN-1: Tag-based location estimation + released training setRUN-2: Visual-based location estimation + released training setRUN-3: Hybrid location estimation + released training setRUN-4: Hybrid location estimation + YFCC datasetRUN-5: Hybrid location estimation + YFCC + External dataRUN-E: Visual-based location estimation + entire YFCC dataset
Images
Runs and Results
RUN-1: Tag-based location estimation + released training setRUN-2: Visual-based location estimation + released training setRUN-3: Hybrid location estimation + released training setRUN-4: Hybrid location estimation + YFCC datasetRUN-5: Hybrid location estimation + YFCC + External data
Videos
References
G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and Y. Kompatsiaris. Socialsensor at
mediaeval placing task 2015. In MediaEval 2015 Placing Task, 2015
G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social media
content with a refined language modelling approach. In Intelligence and Security
Informatics, pages 21–40, 2015
A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In MediaEval 2013
Placing Task, 2013
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In International Conference on Learning Representations, 2015
O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr resources using
language models and similarity search. ICMR ’11, pages 48:1–48:8, New York, NY,
USA, 2011. ACM
Thank you!
Data/Code:– https://github.com/MKLab-ITI/multimedia-geotagging/
Get in touch:– Giorgos Kordopatis-Zilos: [email protected] – Symeon Papadopoulos: [email protected] / @sympap
With the support of: