cvpr2010 tutorial: video search engines
TRANSCRIPT
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 1
Video Search EnginesVideo Search Engines
CeesCees G.M. G.M. SnoekSnoek & Arnold W.M. & Arnold W.M. SmeuldersSmeulders
Intelligent Systems Lab Amsterdam,University of Amsterdam, The Netherlands
A brief history of televisionA brief history of television
•• From broadcasting to narrowcastingFrom broadcasting to narrowcasting
~1955 ~1985 ~2005
•• …to …to thincastingthincasting
~>2008
~2010
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 2
The international business caseThe international business case
•• Everybody with a message uses video for deliveryEverybody with a message uses video for delivery
•• Growing Growing unmanageableunmanageable amounts of videoamounts of video
Example from Example from YoutubeYoutube
•• YoutubeYoutube users are uploading 24 hours of video users are uploading 24 hours of video every minuteevery minute
Hours of video uploaded per minute30
25
20
15
10
5
0
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 3
Example from the NetherlandsExample from the Netherlands
•• Yearly ingestYearly ingest–– 15.000 hours of video15.000 hours of video
–– 40.000 hours of radio40.000 hours of radio
•• Next 6 yearsNext 6 years–– 137.200 hours of video137.200 hours of video
–– 22.510 hours of film22.510 hours of film
•• Europe’s largest digitization projectEurope’s largest digitization project
–– >1 >1 petabytepetabyte per yearper year –– 123.900 hours of audio123.900 hours of audio
–– 2.900.000 photo’s2.900.000 photo’s
Lack of metadata
ExpertExpert--driven searchdriven search
http://e-culture.multimedian.nl
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 4
CrowdCrowd--given searchgiven search
What others say is in the video.
RawRaw--driven searchdriven search
www.science.uva.nl/research/isla
MultimediaN project
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 5
1. Short course outline1. Short course outline
.0 Problem statement.0 Problem statement
.1 Measuring features.1 Measuring features
.2 Concept detection.2 Concept detection
3 Lexicon learning3 Lexicon learning.3 Lexicon learning.3 Lexicon learning
.4 Query prediction.4 Query prediction
.5 Video browsing.5 Video browsing
Problem 1:Problem 1:Variation in appearanceVariation in appearance
So many images of one thing, due to minor differences in:illuminationbackground occlusionviewpoint, …
•• ThisThis is the is the sensorysensory gapgap
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 6
110101101101101101101100110101101111100
1101011011011011011011001101011011111001101011011111
1101011011011011011011001101011011111001101011011111
Problem 2:Problem 2:What defines things?What defines things?
SuitBasketball
Table
Tree
US flag Building
01011011111001101011011111
11010110111111101011011111
1101011011011011011011001101011011111001101011011111
1101011011011011011011001101011011111001101011011111
1101011011011011011011001101011011111001101011011111
11010110110110110110110011
Machine
Multimedia ArchivesAircraft
Dog TennisMountain
Fire
1101011011011011011011001101011011111001101011011111
01011011111001101011011111
1101011011011011011011001101011011111001101011011111
1101011011011011011011001101011011111001101011011111
1101011011011011011011001101011011111001101011011111Humans
Problem 3: Problem 3: Many things in the worldMany things in the world
•• ThisThis is the model gapis the model gap
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 7
ProblemProblem 4:4:VocabularyVocabulary problemproblem
Query-by-keyword
Query-by-concept
Query-by-example
Query
Find shots of peopleshaking hands
ThisThis is the is the queryquery--contextcontext gapgap
Query-by-humming
Any combination, any sequence?
Query-by-gesture
Prediction
Problem 5: Problem 5: Use is openUse is open--endedended
screen
scope
•• ThisThis is the interface gapis the interface gap
keywords
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 8
Conclusion on problemsConclusion on problems
•• Video search is a diverse and challengeVideo search is a diverse and challenge--rich rich research topicresearch topic–– Sensory gapSensory gap
–– SemanicSemanic gapgap
–– Model gapModel gap
–– QueryQuery--context gapcontext gap
–– Interface gapInterface gap
Today’s promiseToday’s promise
•• You will be acquainted with the theory and You will be acquainted with the theory and practice of the conceptpractice of the concept--based video search based video search paradigm. paradigm.
•• You will be able to recall the five major scientific You will be able to recall the five major scientific problems in video retrieval and explain, and value problems in video retrieval and explain, and value the presentthe present--day solutions.day solutions.
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 9
1. Short course outline1. Short course outline
.0 Problem statement.0 Problem statement
.1 Measuring features.1 Measuring features
.2 Concept detection.2 Concept detection
3 Lexicon learning3 Lexicon learning.3 Lexicon learning.3 Lexicon learning
.4 Query prediction.4 Query prediction
.5 Video browsing.5 Video browsing
A million appearancesA million appearances
There are a million appearances to one concept
Where are the patterns (of the same shoe)?
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 10
S h th
Invariance: the need for ~Invariance: the need for ~
Somewhere the variance must be removed.
The illumination and the viewing direction are removed as soon as the image has entered.
Common transformationsCommon transformations
Illumination transformationsIllumination transformations ContrastContrastIllumination transformationsIllumination transformations ContrastContrast
Intensity and ShadowIntensity and Shadow
ColorColor
ViewpointViewpoint Rotation and LateralRotation and Lateral
DistanceDistance
Viewing angleViewing angle
ProjectionProjection
CoverCover SpecularSpecular or matteor matte
Occlusion & Clutter, Wear & Tear, Aging, Night & day and Occlusion & Clutter, Wear & Tear, Aging, Night & day and so on into increasingly complex transformations.so on into increasingly complex transformations.
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 11
More than one transformationMore than one transformation
Features of selected points may be good enough to describe object
iff the selection & the feature set are both invariant forscene-accidental conditions Gevers TIP 2000
Design of invariants: OrbitsDesign of invariants: Orbits
For a property variant under W observations of a constantFor a property variant under W, observations of a constant property are spread over the orbit. The purpose of an invariant is to capture all of the orbit into one value.
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 12
Example: invarianceExample: invariance
Slide credit: Theo Gevers
projection
arctan)( RBGRc =
(R,G,B)-space(c(c11,c,c22,c,c33))--spacespace
},max{arctan),,(1 BG
BGRc =
},max{arctan),,(2 BR
GBGRc =
},max{arctan),,(3 GR
BBGRc =
Color invarianceColor invariance
Gevers, PRL, 1999Geusebroek, PAMI 2001
shadows shading highlights ill. intensity ill. shadows shading highlights ill. intensity ill. colorcolorEE -- -- -- -- --WW -- + + -- + + --CC + + + + -- + + --MM + + + + -- + ++ +NN + ++ + -- + ++ +σ = 1 σ = 2NN + + + + + ++ +LL + + + + + + + + --HH + + + + + + + + --
σ 1 σ 2H 300 315N 550 900C 600 850W 900 995E 950 990Retained from 1000 colours:
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 13
Local shape motivationLocal shape motivation
Perceptual importanceConcise data
Robust to occlusion & clutter
Tuytelaars FTCGV 2008
Meet the GaussiansMeet the Gaussians
Taylor expansion at xTaylor expansion at x
Robust additive differentialsRobust additive differentials
For discretely sampled signal use the Gaussians
Dimensions separableDimensions separable
No maxima No maxima introducedintroduced
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 14
Meet the Gaussians Meet the Gaussians
The basic video observables are:
Local receptive fields Local receptive fields ff((xx))
The receptive fields up to The receptive fields up to first first order.order.
Grey value as well as opponent color sets.Grey value as well as opponent color sets.
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 15
Taxonomy of image structureTaxonomy of image structure
Slide credit: Theo Gevers
T-junction Junction
Highlight
Corner
Meet GaborMeet Gabor
The 2D Gabor function is:
)(222
2
22
21),( vyuxj
yx
eeyxh ++
−= πδ
πσTuning parameters: u, v, σ + usual invariants by combinationM j th d M G b f t t i FManjunath and Ma on Gabor for texture as seen in F-space
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 16
Local receptive fields Local receptive fields FF((xx))
The receptive fields for (u, v) measured locallyThe receptive fields for (u, v) measured locally
Greyvalue as well as opponent color sets.Greyvalue as well as opponent color sets.
Hoang 2003
GaborGabor filters: filters: texturetexture
Hoang, ECCV 2002
Original image K-means clustering Segmentation
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 17
GaborGabor filters: filters: texturetexture
Local receptive field in f(Local receptive field in f(xx,t),t)
Gaussian equivalent over x and t:Gaussian equivalent over x and t:
zero orderzero order first order over tfirst order over t
Burghouts TIP 2006
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 18
Receptive fields: overviewReceptive fields: overview
All observables up to first order All observables up to first order colorcolor, , second order spatial scales, eight second order spatial scales, eight frequency bands & first order in time.frequency bands & first order in time.
Good observables > easy algorithmsGood observables > easy algorithms
Periodicity:
Detect periodic motion by one steered filter:
Deadly simple algorithm…
Burghouts TIP 2006
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 19
Meet the LoweansMeet the Loweans
So far we paid respect to the spatial order.So far we paid respect to the spatial order.
Now we will weakly follow the spatial order and form Now we will weakly follow the spatial order and form histograms on all directions we encounter locally, histograms on all directions we encounter locally, … better known as (the second part of) SIFT.… better known as (the second part of) SIFT.
Lowe IJCV 2004
Meet the Meet the LoweansLoweans
4 x 4 Gradient window after thresholding4 x 4 Gradient window after thresholding
Histogram of 4 x 4 samples per window in 8 directionsHistogram of 4 x 4 samples per window in 8 directions
Gaussian weighting around center (Gaussian weighting around center (σσ is 1/2 of is 1/2 of σσ keypoint)keypoint)
4 x 4 x 8 = 128 dimensional feature vector4 x 4 x 8 = 128 dimensional feature vector
Image: Jonas Hurrelmann
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 20
SIFT detectionSIFT detection
Slide credit: Jepson 2005
Enriching SIFT Enriching SIFT (in a nutshell)(in a nutshell)
Mikolajczyk, IJCV 2005Ke, CVPR 2004
Van de Sande, PAMI 2010
•• Affine SIFTAffine SIFT–– Choose prominent direction in SIFTChoose prominent direction in SIFT
•• PCAPCA--SIFTSIFT–– Robust and compact representationRobust and compact representation
•• ColorSIFTColorSIFT–– Add several invariant color descriptorsAdd several invariant color descriptors
•• TimeSIFTTimeSIFT–– Anyone?Anyone?
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 21
Quality of propertiesQuality of properties
Original blurring JPEG ill.direction viewpoint spectrum
1 in 1000 Harris patches
In the experiment, manual matching of the Harris’ point.
Quality of propertiesQuality of properties
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 22
Quality of propertiesQuality of properties
Good invariants are very powerful.
Tracking invariant appearanceTracking invariant appearance
Nguyen PAMI 2003
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 23
Where to sample in the video?Where to sample in the video?
•• Video shot is set of frames representing a Video shot is set of frames representing a continuous camera action in time and spacecontinuous camera action in time and space–– Analysis typically on a single key frame per shotAnalysis typically on a single key frame per shot
Shot Key Frame
WhereWhere to sample in the frame?to sample in the frame?
Tuytelaars 2008 FTCGV
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 24
Where to sample? contextWhere to sample? context
What is the object in the middle?
No segmentation …No pixel values of the object …
Context in codebooksContext in codebooks
Similarity to K prototype texturesy p yp
. . .
Sky Grass Road
. . .
=
Sky
Gras
sRo
ad Sky
Gras
sRo
ad Sky
Gras
sRo
ad Sky
Gras
sRo
ad Sky
Gras
sRo
ad
. . . =
=
. . . =
Sky
Gras
sRo
ad
Features: Weibull textures
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 25
Interest point examplesInterest point examples
Mikolajczyk, CVPR 2006van de Weijer, PAMI 2006
Original image Harris Laplacian Color salient points
DenseDense sampling sampling exampleexample
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 26
FastFast densedense descriptorsdescriptors
Uijlings et al, CIVR 2009
Image Patch
Reuse sub-regions: 16x speed-up
Conclusion on Conclusion on measuring featuresmeasuring features
•• Invariance is crucial when designing featuresInvariance is crucial when designing features–– More invariance means less stable…More invariance means less stable…
–– …but more robustness to sensory gap…but more robustness to sensory gap
•• Effective features strike a balance betweenEffective features strike a balance between•• Effective features strike a balance between Effective features strike a balance between invariance and discriminatory powerinvariance and discriminatory power–– And for video search efficiency is helpful also…And for video search efficiency is helpful also…
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 27
And there is always more …And there is always more …
For example:For example:
Local Invariant Feature Detectors: A SurveyLocal Invariant Feature Detectors: A Survey
TinneTinne TuytelaarsTuytelaars & & KrystianKrystian MikolajczykMikolajczyk
FTCGV 3:3(177FTCGV 3:3(177——280)280)
1. Short course outline1. Short course outline
.0 Problem statement.0 Problem statement
.1 Measuring features.1 Measuring features
.2 Concept detection.2 Concept detection3 Lexicon learning3 Lexicon learning.3 Lexicon learning.3 Lexicon learning
.4 Query prediction.4 Query prediction
.5 Video browsing.5 Video browsing
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 28
The semantic gapThe semantic gap
The semantic gap is the lack of coincidence The semantic gap is the lack of coincidence between the information that one can extract between the information that one can extract from the sensory data and the interpretation that from the sensory data and the interpretation that
Quote
the same data has for a user in a given situationthe same data has for a user in a given situation
Arnold Smeulders, PAMI, 2000
The science of labelingThe science of labeling
•• To understand anything in science, things need a To understand anything in science, things need a name that is universally recognizedname that is universally recognized
•• Worldwide endeavor in Worldwide endeavor in naming visual informationnaming visual information
living organisms chemical elements human genome‘categories’ text
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 29
How difficult is the problem?How difficult is the problem?
•• Human vision consumes 50% brain power…Human vision consumes 50% brain power…
Van Essen, Science 1992
Semantic concept detectionSemantic concept detection
•• The patient approachThe patient approach–– Building detectors oneBuilding detectors one--atat--aa--timetime
f d f f l fA face detector for frontal faces
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 30
A simple face detectorA simple face detector
Video document
Concept detection overviewConcept detection overview
Volleyball
Sport Music Documentary Commercial Cartoon Soap Sitcom Feature film Talk show News
Basketball Semiotic Tennis Car racing Wildlife Football Soccer Ice hockey Financial
Entertainment Information Communication
Semantic
Guest Host Anchor Report Weather Interview
HuntsViolence Car chase Highlights Walking Gathering Graphical Dramatic events Sport events
Dialogue Story Play Break Action
One PhD per detector requires too many students…
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 31
So how about these?So how about these?
and the thousands of others ………
Basic concept detectionBasic concept detection
Feature Extraction
Supervised Lea ne
Labeled examples
aircraftoutdoor
Extraction Learner
Training
Feature Measurement
Classification
Testing
Video
It is an aircraft probability 0.7It is outdoor probability
0.95
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 32
Demo: Demo: concept detectionconcept detection
Visualization byJasper Schulte
Support vector machineSupport vector machine
Vapnik, 1995
•• Learns statistical model from provided positive Learns statistical model from provided positive and negative examplesand negative examples
•• Maximizes margin between two classes in highMaximizes margin between two classes in high--dimensional feature spacedimensional feature space
MarginMargin
* Positive class+ Negative class
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 33
Support vector machine Support vector machine (cont’d)(cont’d)
Vapnik, 1995
•• Depends on many parametersDepends on many parameters–– Select best of multiple parameter combinationsSelect best of multiple parameter combinations
–– Using cross validationUsing cross validation
SVMVector
Semantic Concept Probability
C = cost of misclassification
K( ) = kernel function
Causes of poor generalizationCauses of poor generalization
•• OverOver--fittingfitting–– Separate your dataSeparate your data
•• Curse of dimensionalityCurse of dimensionalityInformation fusion helpsInformation fusion helps–– Information fusion helpsInformation fusion helps
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 34
Feature fusionFeature fusionSynchronization
Shot segmentedvideo
Concept confidence
Normalization
Transformation
Concatenation
Feature vector
Feature fusionFeature fusionSynchronization
Shot segmentedvideo
Concept confidence
Normalization
Transformation
Concatenation
Feature vector
+ Only one learning phase- Combination often ad hoc - One feature may dominate- Curse of dimenisonality
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 35
Avoiding dimensionality curseAvoiding dimensionality curse
Leung and Malik, IJCV, 2001Sivic and Zisserman, ICCV, 2003
•• Codebook aka bagCodebook aka bag--ofof--words modelwords model–– Create a codeword vocabularyCreate a codeword vocabulary
–– DiscretizeDiscretize image with image with codewordscodewords
–– Represent image as codebook histogramRepresent image as codebook histogram
0 100 200 300 400 5000
10
20
30
40
50
60
70
80
EmphasizingEmphasizing spatialspatial configurationsconfigurations
Grauman, ICCV 2005Lazebnik, CVPR 2006
Marszalek, VOC 2007
•• CodebookCodebook ignoresignores geometricgeometric correspondencecorrespondence
•• For video:For video:–– 1x1 entire image1x1 entire image
–– 2x2 image quarters2x2 image quarters
1x3 horizontal bars1x3 horizontal bars
•• SolutionSolution: : spatialspatial pyramidpyramid–– aggregate statistics of local features over fixed aggregate statistics of local features over fixed
subregionssubregions
–– 1x3 horizontal bars1x3 horizontal bars
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 36
Codebook modelCodebook model
•• Codebook consists of Codebook consists of codewordscodewords–– kk--means clustering of descriptorsmeans clustering of descriptors–– Commonly 4,000 Commonly 4,000 codewordscodewords per codebookper codebook
Cluster
AssignDense+OpponentSIFT Feature vector (length 4,000)
Codebook assignmentCodebook assignment
van Gemert, PAMI 2010
● Codeword
Hard assignmentHard assignment Soft assignmentSoft assignment
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 37
Fast quantizationFast quantization
Moosman, PAMI 2008Uijlings, CIVR 2009
•• Random Random forestsforests–– Randomized process makes it very fast to buildRandomized process makes it very fast to build
–– Tree structure allows fast vector quantizationTree structure allows fast vector quantization
–– Logarithmic rather than linear projection timeLogarithmic rather than linear projection time
•• RealReal timetime BoWBoW (!)(!)•• RealReal--timetime BoWBoW (!)(!)–– WhenWhen usedused withwith fastfast densedense samplingsampling
–– SURF 2x2 descriptor SURF 2x2 descriptor insteadinstead of 4x4of 4x4
–– RBF RBF kernelkernel
GPUGPU--empowered quantizationempowered quantization
•• Achieve dataAchieve data parallelism by writing Euclideanparallelism by writing Euclidean
Van de Sande, TMM 2011
8.00
10.00
12.00
14.00
e (s)
CPU Xeon (3,4GHz)
CPU Opteron 250 (2,4GHz)
CPU Core 2 Duo 6400 (2,13GHz)
CPU Core i7 (2,66GHz)
•• Achieve dataAchieve data--parallelism by writing Euclidean parallelism by writing Euclidean distance in vector formdistance in vector form
0.00
2.00
4.00
6.00
0 5000 10000 15000 20000
Time Pe
r Image
Number of SIFT Descriptors Per Image
GPU Geforce 8800GTX (128 cores)
GPU Geforce GTX260 (216 cores)
17x speed-up
GPU
CPU
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 38
Codebook libraryCodebook library
•• A single codebook is…A single codebook is…
Single CodebookSampling method Descriptor Construction Assignment
#1 Dense OpponentSIFT K-means Soft
#2 Harris-Laplace SIFT Radius-based Soft
#3 Dense rgSIFT K-means Hard
Dense + Spatial pyramid C SIFT K means Hard
•• Codebook library is…Codebook library is…–– a configuration of several codebooksa configuration of several codebooks
… Dense + Spatial pyramid C-SIFT K-means Hard
Codebook libraryCodebook library (cont’d)(cont’d)
•• Concatenate multiple codebooksConcatenate multiple codebooks
–– Spatial pyramid adds more dimensionsSpatial pyramid adds more dimensionsoo 1x1 = 4K1x1 = 4Koo 1x1 4K1x1 4K
oo 2x2 = 16K2x2 = 16K
oo 1x3 = 12K1x3 = 12K
–– Feature vector length easily Feature vector length easily >100K>100K……
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 39
SVM preSVM pre--computed kernel trickcomputed kernel trick
•• Use distance between feature vectors (!)Use distance between feature vectors (!)
•• Increase efficiency significantlyIncrease efficiency significantly
-γ dist( , )K( , ) = e
y g yy g y–– PrePre--compute the SVM kernel matrixcompute the SVM kernel matrix
–– Long vectors possible as we only need 2 in memoryLong vectors possible as we only need 2 in memory
–– Parameter optimization reParameter optimization re--uses preuses pre--computed matrixcomputed matrix
11 Compute average distances perCompute average distances per N²N² kernel subkernel sub--blockblock
GPUGPU--empoweredempoweredprepre--computed kernelcomputed kernel
Van de Sande, TMM 2011
1.1. Compute average distances per Compute average distances per N²N² kernel subkernel sub--blockblock
2.2. Compute kernel function valuesCompute kernel function values
1000
1200
1400
1600
1800
s)
1x Core i7 (2,66GHz)
1x Opteron 250 (2,4GHz)4x Opteron 250 (2,4GHz)16x Opteron 250 (2,4GHz)25x Opteron 25065x speed-up
1 CPU
4 CPU
0
200
400
600
800
0 20000 40000 60000 80000 100000 120000 140000
Time (
Total Feature Vector Length
25x Opteron 250 (2,4GHz)
p p
3x speed-up
GPU
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 40
Feature Feature fusionfusion
Point sampling strategy Color feature extraction Codebook model
Van de Sande, PAMI 2010
0
1
Relative frequency
1 2 3 4 5
Codebook element
Dense sampling
Harris-Laplace salient points
Point sampling strategy Color feature extraction Codeboo ode
1 1
0
1
Relative frequency
1 2 3 4 5
Codebook element
Bag-of-features
Bag-of-features
.
.
.
Image
.
.
.
Spatial pyramid
0
1 2 3 4 5
0
1 2 3 4 5
0
1
1 2 3 4 5
0
1
1 2 3 4 5
Spatial pyramid: multiple bags-of-features
. .
..
+ Codebook reduces dimensionality- Combination still ad hoc - One codebook may dominate
Classifier fusionClassifier fusion
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 41
Classifier fusionClassifier fusion
+ Focus on feature strength
+ Fusion in semantic space
- Expensive learning effort
- Loss of feature correlation
Unsupervised fusion of classifiersUnsupervised fusion of classifiers
+ Aggregation functions reduce learning effort
Snoek, TRECVID 2006Wang, ACM MIR 2007
Support Vector
Machine
Global Image Feature
Extraction
GeometricMean
Fisher Linear
Discriminant
Regional Image Feature
Extraction
+ Aggregation functions reduce learning effort+ Efficient use of training examples
Logistic Regression
Extraction
Keypoint Image Feature
Extraction
- Linear function unlikely to be optimal
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 42
Fusing conceptsFusing concepts
•• Exploitation of concept coExploitation of concept co--occurrence occurrence –– Concepts do not occur in vacuumConcepts do not occur in vacuum
Concept 1
Concept 2
Concept 3
Naphade Trans. MM 2001
SkyAircraft
HowHow to to fusefuse conceptsconcepts??
•• LearningLearning spatialspatial modelsmodels
•• LearningLearning temporal modelstemporal models
•• IncludeInclude ontologiesontologies
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 43
Learning spatial models Learning spatial models -- explicitlyexplicitly
•• Using graphical modelsUsing graphical models–– Computationally complexComputationally complex
–– Limited scalabilityLimited scalability
Qi, TOMCCAP 2009
Learning spatial models Learning spatial models -- implicitlyimplicitly
•• Using support vector machine, or data miningUsing support vector machine, or data mining–– Assumes classifier learns relationsAssumes classifier learns relations
–– Suffers from error propagationSuffers from error propagation
Weng, ACM MM 2008
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 44
Learning temporal modelsLearning temporal models
•• Extend spatial models with time dimensionExtend spatial models with time dimension–– Common approach is Hidden Markov ModelCommon approach is Hidden Markov Model
–– Relatively few have actually considered time… Relatively few have actually considered time…
Ebadollahi, ICME 2006
Including knowledgeIncluding knowledge
•• Can ontologies help?Can ontologies help?–– Symbolic ontolgoies Symbolic ontolgoies vsvs uncertain detectorsuncertain detectors
Wu, ICME 2004
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 45
Concept detection pipelineConcept detection pipeline
IBM 2003
Concept detection pipelineConcept detection pipeline
IBM 2003
Feature Classifier ConceptFusion Fusion
pFusion
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 46
Video diverVideo diver
Wang, ACM MIR 2007
Video diverVideo diver
Feature Classifier Concept
Wang, ACM MIR 2007
Fusion Fusionp
Fusion
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 47
Layout Features Extraction
Semantic PathfinderSemantic Pathfinder
Snoek, PAMI 2006
Supervised Learner
Supervised Learner
Visual Features
Semantic Features
Combination
Capture Features Extraction
Content Features Extraction
Select Best of 3 Paths
after Validation
Animal
Vehicle
FlagFire
Supervised Learner
Multimodal Features
Combination
Features Extraction
Textual Features Extraction
Content Analysis Step Style Analysis Step Context Analysis Step
Context Features Extraction Sports
Vehicle
Entertainment MonologueWeather
news
Hu Jintao
Layout Features Extraction
Semantic PathfinderSemantic PathfinderFeature Classifier Concept
Snoek, PAMI 2006
Supervised Learner
Supervised Learner
Visual Features
Semantic Features
Combination
Capture Features Extraction
Content Features Extraction
Select Best of 3 Paths
after Validation
Animal
Vehicle
FlagFire
Fusion Fusion Fusion
Supervised Learner
Multimodal Features
Combination
Features Extraction
Textual Features Extraction
Content Analysis Step Style Analysis Step Context Analysis Step
Context Features Extraction Sports
Vehicle
Entertainment MonologueWeather
news
Hu Jintao
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 48
StateState--ofof--thethe--ArtArtFeatureFusion
ClassifierFusion
Snoek et al, TRECVID 2008-2009Van Gemert et al, PAMI 2009
Van de Sande et al, PAMI 2010
Fusion Fusion
StateState--ofof--thethe--ArtArt
Snoek et al, TRECVID 2008-2009Van de Sande et al, PAMI 2010
Van Gemert et al, PAMI 2010
Software available for download at http://colordescriptors.com
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 49
Detecting Semantic Concepts in VideoDetecting Semantic Concepts in VideoConclusion on:Conclusion on:
•• We started with invariance and manual laborWe started with invariance and manual labor
•• We generalized with machine learningWe generalized with machine learning–– …but needed several abstractions to do appropriately…but needed several abstractions to do appropriately
•• For the moment, no oneFor the moment, no one--sizesize--fitsfits--all solution all solution –– Learn optimal machinery per conceptLearn optimal machinery per concept
1. Short course outline1. Short course outline
.0 Problem statement.0 Problem statement
.1 Measuring features.1 Measuring features
.2 Concept detection.2 Concept detection
3 Lexicon learning3 Lexicon learning.3 Lexicon learning.3 Lexicon learning.4 Query prediction.4 Query prediction
.5 Video browsing.5 Video browsing
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 50
Problem 3: Problem 3: Many things in the worldMany things in the world
•• ThisThis is the model gapis the model gap
Trial 1: counting dictionary wordsTrial 1: counting dictionary words
Biederman, Psychological Review 1987
Slide credit: Li Fei-Fei
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 51
Trial 2: reverseTrial 2: reverse--engineeringengineering
Hauptmann, PIEEE 2008
•• Estimation by Hauptmann et al.: 5000Estimation by Hauptmann et al.: 5000–– Using manually labeled queries and conceptsUsing manually labeled queries and concepts
–– But speculative, and questionable assumptionsBut speculative, and questionable assumptions
‘Google performance’
Oracle Combination + Noise
‘Realistic’ Combination
How to obtain labeled examples?How to obtain labeled examples?massive amounts of
3 billion
–– …but only human …but only human expertsexperts provide provide good qualitygood quality examplesexamples
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 52
Experts start with concept definitionExperts start with concept definition
•• MM078MM078--Police/Security PersonnelPolice/Security Personnel–– Shots depicting law enforcement or private Shots depicting law enforcement or private
security agency personnel.security agency personnel.
Expert annotation toolsExpert annotation tools
Volkmer, ACM MM 2005
•• Balance between:Balance between:–– Spatiotemporal level of Spatiotemporal level of
annotation detailannotation detail
–– Number of conceptsNumber of concepts
–– Number of positive Number of positive d ti ld ti land negative examplesand negative examples
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 53
LSCOM LSCOM (Large Scale Concept Ontology for Multimedia)(Large Scale Concept Ontology for Multimedia)
Naphade, IEEE MM 2006
•• Provides manual annotations for 449 conceptsProvides manual annotations for 449 concepts–– In international broadcast TV newsIn international broadcast TV news
•• Connection to Connection to CycCyc ontologyontology
•• LSCOMLSCOM--LiteLite–– 39 semantic concepts39 semantic concepts
http://www.lscom.org/
Verified positive Verified positive examplesexamples
•• ImageNetImageNet (11M images)(11M images)
–– 4000 categories4000 categories
–– > 100 examples> 100 examples
D t l CVPR 2009•• SUN SUN (130K images)(130K images)
–– 397 scene categories397 scene categories
–– > 100 examples> 100 examples
Deng et al, CVPR 2009
Xiao et al, CVPR 2010
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 54
Bridging the model gapBridging the model gap
•• RequirementsRequirements–– Generic concept detection methodGeneric concept detection method
–– Massive amounts of labeled examplesMassive amounts of labeled examples
–– Evaluation methodEvaluation method
–– Fair amount of computationFair amount of computationpp
Model gap best treated by Model gap best treated by TRECVIDTRECVID
•• Situation in 2000Situation in 2000–– Various concept definitionsVarious concept definitions
–– SpecificSpecific and and smallsmall data setsdata sets
–– Hard to compare methodologiesHard to compare methodologies = ResearchersNIST
•• Since 2001 worldwide evaluation by NISTSince 2001 worldwide evaluation by NIST–– TRECVID benchmarkTRECVID benchmark
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 55
NIST TRECVID benchmarkNIST TRECVID benchmark
P i id i l hP i id i l h•• Promote progress in video retrieval researchPromote progress in video retrieval research–– Provide common dataset Provide common dataset –– Challenging tasksChallenging tasks–– Independent evaluation protocolIndependent evaluation protocol–– Forum for researchers to compare resultsForum for researchers to compare results
http://trecvid.nist.gov/
Video data setsVideo data sets
•• US TV news US TV news (`03/`04)(`03/`04)
•• International TV news International TV news (`05/`06)(`05/`06)
•• Dutch TV infotainment Dutch TV infotainment (`07/`08/`09)(`07/`08/`09)
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 56
TRECVID 2010TRECVID 2010Internet Archive web videosInternet Archive web videos
Expert annotation effortsExpert annotation efforts
500
LSCOM
MediaMill - UvA
Others
Expert annotation efforts
17 32 39101
374…
TRECVID Edition
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 57
Measuring performanceMeasuring performance
Set of retrieved itemsSet of relevant items
•• PrecisionPrecision Set of relevant
1.
2.
3.
Results
retrieved items
inverse relationship•• RecallRecall
4.
5.
Evaluation measureEvaluation measure
•• Average PrecisionAverage Precision–– Combines precision and recallCombines precision and recall
–– Averages precision after relevant shotAverages precision after relevant shot
–– Top of ranked list most important Top of ranked list most important
1.
2.
3.
Results
AP
AP =1/1 + 2/3 + 3/4 + …
number of relevant documents
4.
5.
AP
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 58
De facto evaluation standardDe facto evaluation standard
Concept examplesConcept examples
fAircraft
Beach
Mountain
Note the variety in visual appearance
People marching
Police/Security
Flower
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 59
TRECVID TRECVID concept detection task resultsconcept detection task results
•• Hard to compareHard to compare–– Different video dataDifferent video data
–– Different conceptsDifferent concepts
•• Clear Clear top top performersperformers
2003
2004
2005
2006 pp pp–– Median skews to leftMedian skews to left
–– Learning effectLearning effect
–– Plenty of variationPlenty of variation
2007
2008
2009
UvAUvA--MediaMillMediaMill@TRECVID@TRECVID
Snoek et al, TRECVID 04-09
• 900 other detection systems
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 60
1,000,0001,000,000 frames analyzedframes analyzed
Snoek, ICME 2005
•• MultiMulti--frame biggest improvement in 2008 / 2009frame biggest improvement in 2008 / 2009–– We analyze up to 10 extra We analyze up to 10 extra ii--frames/shotframes/shot
–– For 2009 yields 1M frames to analyze for the For 2009 yields 1M frames to analyze for the test set test set
•• Need to speedNeed to speed--up by being “smart and strong”up by being “smart and strong”SpeedSpeed up feature extractionup feature extraction–– SpeedSpeed--up feature extractionup feature extraction
–– SpeedSpeed--up quantizationup quantization
–– SpeedSpeed--up kernelup kernel--based learningbased learning
–– SpeedSpeed--up by computingup by computing
ComputingComputing
•• Best 2009 system much more efficient than 2008 Best 2009 system much more efficient than 2008 systemsystem–– 6x more visual data analyzed using less compute power6x more visual data analyzed using less compute power
•• Some best estimates:Some best estimates:•• Some best estimates:Some best estimates:–– Visual feature extraction: 8400 Visual feature extraction: 8400 PProcessorrocessor--NNodeode--HHoursours
–– Training concept detectors: 4000 PNHTraining concept detectors: 4000 PNH
–– Applying concept detectors: ~1 week GPUApplying concept detectors: ~1 week GPU
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 61
MediaMill ChallengeMediaMill Challengehttp:/www.mediamill.nl/challenge/
•• The Challenge providesThe Challenge provides–– Manually annotated lexicon Manually annotated lexicon
of of 101101 semantic conceptssemantic concepts
–– PrePre--computed lowcomputed low--level level featuresfeatures
–– Trained classifier modelsTrained classifier models
55 experimentsexperiments
•• The Challenge allows toThe Challenge allows to–– Gain insight in intermediate Gain insight in intermediate
video analysis stepsvideo analysis steps
–– Foster repeatability of Foster repeatability of experimentsexperiments
–– Optimize video analysis Optimize video analysis systems on a component levelsystems on a component level–– 55 experimentsexperiments
–– Implementation + resultsImplementation + results
y py p
–– Compare and improveCompare and improve
• The Challenge lowers threshold for novice researchers
Columbia374 + VIREO374Columbia374 + VIREO374
•• Baseline detectors for 374 conceptsBaseline detectors for 374 concepts
http://www.ee.columbia.edu/ln/dvmm/columbia374/http://www.cs.cityu.edu.hk/~yjiang/vireo374/http://www.ee.columbia.edu/ln/dvmm/CU-VIREO374/
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 62
Community myths or facts?Community myths or facts?
•• Chua et al., Chua et al., ACM Multimedia 2007ACM Multimedia 2007
–– Video search is practically solved and progress Video search is practically solved and progress has only been incrementalhas only been incremental
•• Yang and Hauptmann, Yang and Hauptmann, ACM CIVR 2008ACM CIVR 2008
–– Current solutions are weak and generalize poorlyCurrent solutions are weak and generalize poorly
We have done an experimentWe have done an experiment
•• Two video search engines from 2006 and 2009Two video search engines from 2006 and 2009–– MediaMillMediaMill Challenge 2006 systemChallenge 2006 system
–– MediaMillMediaMill TRECVID 2009 systemTRECVID 2009 system
•• How well do they detect 36 LSCOM concepts?How well do they detect 36 LSCOM concepts?
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 63
Four video data set mixturesFour video data set mixtures
TRECVID 2005 TRECVID 2007
•• TrainingTrainingBroadcast
newsDocumentary
video
Within domain
•• TestingTestingDocumentary
videoBroadcast
news
Cross domain
Performance doubled in just 3 yearsPerformance doubled in just 3 years
Snoek & Smeulders, IEEE Computer 2010
• 36 concept detectors
–– Even when using training Even when using training data of different origindata of different origin
–– Vocabulary still limitedVocabulary still limited–– Vocabulary still limitedVocabulary still limited
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 64
500 detectors, a closer look500 detectors, a closer look
The number of labeled image examples used at trainingThe number of labeled image examples used at training time seems decisive in concept detector accuracy.
Demo timeDemo time
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 65
500 detectors, a closer look500 detectors, a closer look
Learning social tag relevance by Learning social tag relevance by neighbor votingneighbor voting
Xirong Li, TMM 2009
•• Exploit consistency in tagging behavior of Exploit consistency in tagging behavior of different users for visually similar imagesdifferent users for visually similar images
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 66
AlgorithmAlgorithm: : tagtag--relevancerelevance learninglearning
Why is this useful?
Image retrieval experimentsImage retrieval experiments
•• UserUser--tagged image databasetagged image database–– 3.5 million 3.5 million labeled Flickr imageslabeled Flickr images
•• Visual feature Visual feature –– A global colorA global color--texture texture
•• Evaluation setEvaluation set–– 20 concepts20 concepts
•• Evaluation criteriaEvaluation criteria–– Average precisionAverage precision
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 67
Image retrieval experimentsImage retrieval experiments
•• A A standardstandard tagtag--basedbased retrievalretrieval frameworkframework–– RankingRanking functionfunction: OKAPI: OKAPI--BM25BM25
•• ComparisonComparison•• ComparisonComparison–– Baseline:Baseline: retrieval using original tagsretrieval using original tags
–– Neighbor: Neighbor: retrieval using learned tag relevance as retrieval using learned tag relevance as tag frequencytag frequency
ResultsResults
24% relative improvement
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 68
Updated tag relevanceUpdated tag relevance
Results suggest…Results suggest…
•• Relevance of a tag can be predicted based on Relevance of a tag can be predicted based on ‘wisdom’ of crowds‘wisdom’ of crowds–– Even with a lightEven with a light--weight visual featureweight visual feature
–– And, a small database of 3.5M imagesAnd, a small database of 3.5M images
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 69
Conclusion on lexicon learningConclusion on lexicon learning
RequiresRequires•• Invariant featuresInvariant features
•• Concept detectionConcept detection
•• Many, many, annotationsMany, many, annotations
Suffers most Suffers most fromfrom•• Weakly labeled visual dataWeakly labeled visual data
•• Transfer across domainsTransfer across domains
•• Measuring performanceMeasuring performance
•• Lots of computationLots of computation
1. Short course outline1. Short course outline
.0 Problem statement.0 Problem statement
.1 Measuring features.1 Measuring features
.2 Concept detection.2 Concept detection
.3 Lexicon learning.3 Lexicon learninggg
.4 Query prediction.4 Query prediction
.5 Video browsing.5 Video browsing
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 70
ProblemProblem 4:4:VocabularyVocabulary problemproblem
Query-by-keyword
Query-by-concept
Query-by-example
Query
Find shots of peopleshaking hands
ThisThis is the is the queryquery--contextcontext gapgap
Query-by-humming
Any combination, any sequence?
Query-by-gesture
Prediction
Traditional approachesTraditional approaches
•• Parse topicParse topic--text, reframe as querytext, reframe as query--byby--keywordkeyword–– Using speech recognition or closed captionsUsing speech recognition or closed captions
•• Use possible images accompanying the topic for Use possible images accompanying the topic for bb llqueryquery--byby--exampleexample
–– Using shotUsing shot--based based keyframeskeyframes
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 71
A new hope?A new hope?
We are now seeing researchers starting to use We are now seeing researchers starting to use the confidence values from concept detectors, the confidence values from concept detectors, within the shot retrieval process and this appears within the shot retrieval process and this appears
Quote
Alan Smeaton, Inf. Sys., 2007Alan Smeaton, Inf. Sys., 2007
to be the roadmap for future work in this area.to be the roadmap for future work in this area.
Video query examplesVideo query examples
Find shots of a hockey rink with at least one of the nets fully
visible from some point of view.
Find shots of one or more helicopters in flight.
Find shots of a group including at least four people dressed in suits, seated, and with at least
one flag.
Find shots of an office setting, i.e., one or more desks/tables and one or more computers
and one or more people
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 72
Typical ‘oracle’ resultsTypical ‘oracle’ results
Find shots of a graphic map of Iraq, location of Bagdhad
marked - not a weather map.
Best 2nd Best
Maps OverlayedText
+
H t l t l t
Find shots of George Bush entering or leaving a vehicle
(e.g., car, van, airplane, helicopter, etc) (he and vehicle both visible at the same time)
IyadAllawi
+rocket
propelled grenades
How to select relevant detectors automatically?
??
Detector selection strategiesDetector selection strategies
Find shots of an office setting
Video Query
Visual‐based
Ontology‐based
Text‐basedbased
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 73
TextText--based selectionbased selection
•• Represent concept descriptions as term vectorRepresent concept descriptions as term vector–– Exact matching to link queries to detectorsExact matching to link queries to detectors
–– Vector space model to match queries to descriptionsVector space model to match queries to descriptions
–– CorpusCorpus--driven query expansion methodsdriven query expansion methods
Recap: concept definitionRecap: concept definition
•• MM078MM078--Police/Security PersonnelPolice/Security Personnel–– Shots depicting law enforcement or private Shots depicting law enforcement or private
security agency personnel.security agency personnel.
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 74
VisualVisual--based selectionbased selection
Rasiwasia et al, TMM 2007
FeatureExtraction
SupervisedLearner
Training stage
Testing stage
FeatureExtraction Classify
Query Image
face p=0.97outdoor p=0.98helicopter p=0.43…
1. Identify objects in WordNet
“Find a report from the desert showing a house or car on fire.”OntologyOntology--based selectionbased selection
Slide credit: Bouke Huurnink
car
desert house
fire
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 75
2. Identify related concept detectors
“Find a report from the desert showing a house or car on fire.”OntologyOntology--based selectionbased selection
car
desert house
fire
car
3. Find most similar and specific detector using ontology measure
“Find a report from the desert showing a house or car on fire.”OntologyOntology--based selectionbased selection
Wei, TMM 2008
car
vehicle
car
desert
fire
desert building
desert house
fire
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 76
Search strategy combinationSearch strategy combination
1.1. Parallel combinationParallel combination–– All retrieval results are taken into account simultaneouslyAll retrieval results are taken into account simultaneously
–– Weighted average of individual resultsWeighted average of individual results
22 Sequential combinationSequential combination2.2. Sequential combinationSequential combination–– Update video retrieval results in successionUpdate video retrieval results in succession
–– PseudoPseudo--relevance feedback variantsrelevance feedback variants
Parallel combinationParallel combination
Paul Natsev, 2005 - 2007
•• Estimating weights of individual retrieval Estimating weights of individual retrieval modules is main problemmodules is main problem–– Predefine by expertPredefine by expert
–– Learn from training dataLearn from training data
–– OracleOracle
–– ……
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 77
Sequential combinationSequential combination
Hsu, IEEE MM 2007
•• Quality depends on ranking in first stageQuality depends on ranking in first stage
Tackling the queryTackling the query--context gapcontext gap
•• RequirementsRequirements–– Several video retrieval methods Several video retrieval methods
(by(by--keywords/bykeywords/by--example/byexample/by--manymany--concepts)concepts)
–– Detector selection and combination methodDetector selection and combination method
–– Training dataTraining data
–– Search topicsSearch topics
–– Evaluation methodEvaluation method
–– WellWell--defined application domain?defined application domain?
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 78
QueryQuery--context gap is reasonably context gap is reasonably addressed at TRECVIDaddressed at TRECVID
•• Automatic search taskAutomatic search task–– Automatically solve 20+ search topicsAutomatically solve 20+ search topics
–– Return 1,000 ranked shotReturn 1,000 ranked shot--based results per topicbased results per topic
–– Evaluate using Average PrecisionEvaluate using Average Precision
•• DrawbacksDrawbacks–– Queries tend to be overly complex, limited in Queries tend to be overly complex, limited in
number, drifting away from realnumber, drifting away from real--world usageworld usage
–– Inclusion of recall lowers performanceInclusion of recall lowers performance
–– Lack of training dataLack of training data http://trecvid.nist.gov/
Query prediction at TRECVIDQuery prediction at TRECVID
•• Performance is humblePerformance is humble–– Lack of training dataLack of training data
–– Especially in 2007Especially in 2007--20082008
ition
•• Most pronouncedMost pronounced–– 2005: Large semantic lexicon2005: Large semantic lexicon
Mean Average Precision
Ed
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 79
Conclusion on query predictionConclusion on query prediction
•• Retrieval tasks (in)directly covered by concepts Retrieval tasks (in)directly covered by concepts in the lexicon benefit from detector selectionin the lexicon benefit from detector selection
•• Retrieval tasks not covered by the lexicon, Retrieval tasks not covered by the lexicon, result in humble performance onlyresult in humble performance onlyresult in humble performance onlyresult in humble performance only
•• We need better evaluation setupWe need better evaluation setup
1. Short course outline1. Short course outline
.0 Problem statement.0 Problem statement
.1 Measuring features.1 Measuring features
.2 Concept detection.2 Concept detection
.3 Lexicon learning.3 Lexicon learninggg
.4 Query prediction.4 Query prediction
.5 Video browsing.5 Video browsing
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 80
Problem 5: Problem 5: Use is openUse is open--endedended
screen
scope
•• ThisThis is the interface gapis the interface gap
keywords
So many choices for retrieval…So many choices for retrieval…
•• Why not let user decide interactively?Why not let user decide interactively?–– Navigate through query methodsNavigate through query methods
–– Visualize video retrieval resultsVisualize video retrieval results
–– Learn from browsing behaviorLearn from browsing behavior
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 81
Video search 1.0Video search 1.0
Note the influence of textual meta data such as theNote the influence of textual meta data, such as the video title, on the search results.
Query selectionQuery selection
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 82
‘Classic’ Informedia system‘Classic’ Informedia system
Carnegie Mellon University
•• First multimodal video search engineFirst multimodal video search engine
FíschlárFíschlár
Dublin City University
•• Optimized for use by “real” usersOptimized for use by “real” users
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 83
IBM iMARSIBM iMARS
IBM Research
•• A web based systemA web based system
http://mp7.watson.ibm.com/
MediaMagicMediaMagic
FxPal
•• Focus on the story levelFocus on the story level
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 84
VisionGoVisionGo
NUS & ICT-CAS
•• Extremely fast and efficientExtremely fast and efficient
CrossBrowsing through resultsCrossBrowsing through results
Snoek, TMM 2007
RankRank
TimeTime
Sphere variant
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 85
Demo: Demo: MediaMillMediaMill video search enginevideo search engine
http://www.mediamill.nlhttp://www.mediamill.nl
The RotorBrowserThe RotorBrowser
de Rooij, TMM 2009
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 86
Extreme video retrievalExtreme video retrieval= very demanding!= very demanding!
Carnegie Mellon University
•• ObservationObservation–– Correct results are retrieved, but not optimally rankedCorrect results are retrieved, but not optimally ranked
–– If user has time to scan results exhaustively, retrieval is If user has time to scan results exhaustively, retrieval is a matter of watching, selecting, and sorting a matter of watching, selecting, and sorting quicklyquickly
Learning from the userLearning from the user
•• Two common approachesTwo common approaches1.1. Relevance feedbackRelevance feedback
2.2. Active learningActive learning
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 87
Relevance feedbackRelevance feedback
Try to find boundary in Try to find boundary in F1
Slide credit: Marcel Worring
feature space best feature space best separating positiveseparating positivefrom negative examplesfrom negative examples
F1
In the next iteration the In the next iteration the ill hill h
F F2
Measure of class membership probabilityMeasure of class membership probability
user will have more user will have more samples hence a better samples hence a better estimate of the boundaryestimate of the boundary
Active learningActive learning
Slide credit: Marcel Worring
In active learning the In active learning the system decides which system decides which elements to show for elements to show for
F1
The system can safely assumethis sample is also negative
feedback and which not.feedback and which not.
F F2
For the system it isFor the system it isrelevant to know this labelrelevant to know this label
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 88
Demo: Demo: ForkBrowserForkBrowser
de Rooij, CIVR 2008
•• Learn positive and negative items from user Learn positive and negative items from user browse behaviorbrowse behavior
Demo: Timeline navigationDemo: Timeline navigation
http://hollandsglorieoppinkpop.nl/
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 89
The future of video retrieval?The future of video retrieval?
Jonathan Wang, Carnegie Mellon University
Interface gap best addressed byInterface gap best addressed by
•• TRECVID interactive search taskTRECVID interactive search task–– Interactively solve 20+ search topics Interactively solve 20+ search topics (10/15 minutes)(10/15 minutes)
–– Return 1,000 ranked shotReturn 1,000 ranked shot--based results per topicbased results per topic
–– Evaluate using Average PrecisionEvaluate using Average Precision
•• VideOlympicsVideOlympics showcaseshowcase
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 90
Video browsing at TRECVIDVideo browsing at TRECVID
•• Wide performance variationWide performance variation–– # concept detectors# concept detectors
–– Search interfaceSearch interface
–– Expert Expert vsvs novice user novice user
Edition
•• Most pronouncedMost pronounced–– 2003: 2003: InformediaInformedia classicclassic
–– 2005: Large semantic lexicon2005: Large semantic lexicon
–– 2008: Active learning2008: Active learningMean Average Precision
UvAUvA--MediaMillMediaMill browsersbrowsers@TRECVID@TRECVID
Snoek et al. TRECVID 04-09
CrossBrowser
ForkBrowserForkBrowser
• 228 other interactive systemsTraditional systems
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 91
CriticismCriticism
•• Retrieval performance cannot be the only Retrieval performance cannot be the only evaluation criterionevaluation criterion–– Quality of detectors countsQuality of detectors counts
–– Experience of searcher countsExperience of searcher counts
–– Visualization of interface countsVisualization of interface counts
–– Ease of use countsEase of use counts
–– ……
Video browsing at Video browsing at VideOlympicsVideOlympics
•• Promote multiple facets of video searchPromote multiple facets of video search–– RealReal--time interactive video search ‘competition’time interactive video search ‘competition’
–– Simultaneous exposure of multiple video search enginesSimultaneous exposure of multiple video search engines
–– Highlight possibilities and limitations of stateHighlight possibilities and limitations of state--ofof--thethe--artart
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 92
ParticipantsParticipants
Video trailerVideo trailer
http://www.VideOlympics.org
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 93
Conclusion on video browsingConclusion on video browsing
•• Interaction by browsing indispensible for any Interaction by browsing indispensible for any practical video search enginepractical video search engine
•• System should support user by active learning System should support user by active learning and intuitive (mobile) visualizationsand intuitive (mobile) visualizationsand intuitive (mobile) visualizationsand intuitive (mobile) visualizations
Interactive Video RetrievalInteractive Video RetrievalConclusion on:Conclusion on:
queryprediction
measuring conceptdetection
lexiconlearning
browsingvideovideo
videomeasuringfeatures
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 94
And there is always more …And there is always more …
•• Recommended special issues Recommended special issues –– IEEE Transactions on Pattern Analysis and Machine IEEE Transactions on Pattern Analysis and Machine
Intelligence, 30(11), November 2008Intelligence, 30(11), November 2008
–– Proceedings of the IEEE, 96(4), April 2008Proceedings of the IEEE, 96(4), April 2008
–– IEEE Transactions on Multimedia, 9(5), August 2007IEEE Transactions on Multimedia, 9(5), August 2007
•• 300 references on video search300 references on video search–– Snoek and Snoek and WorringWorring, Concept, Concept--Based Video Retrieval, Based Video Retrieval,
Foundations and Trends in Information Retrieval, Foundations and Trends in Information Retrieval, Vol. 2: No 4, pp 215Vol. 2: No 4, pp 215--322, 322, 2009.2009.
General references IGeneral references IColor Invariance. Jan-Mark Geusebroek, R. van den Boomgaard, Arnold W. M. Smeulders, H. Geerts. IEEE Trans. Pattern Analysis and Machine Intelligence, Volume 23 (12) page 1338 1350 2001Volume 23 (12), page 1338-1350, 2001.
Distinctive Image Features from Scale-Invariant Keypoints. D. G. Lowe. Int'l Journal of Computer Vision, vol. 60, pp. 91-110, 2004.
Large-Scale Concept Ontology for Multimedia. M. R. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. Hsu, L. S. Kennedy, A. G. Hauptmann, and J. Curtis,. IEEE MultiMedia, vol. 13, pp. 86-91, 2006.
Efficient Visual Search for Objects in Videos. J. Sivic and A. Zisserman. Proceedings of the IEEE, vol. 96, pp. 548-566, 2008.
High Level Feature Detection from Video in TRECVid: A 5-year Retrospective of Achievements. A. F. Smeaton, P. Over, and W. Kraaij, In Multimedia Content Analysis, Theory and Applications, (A. Divakaran, ed.), Springer, 2008.
Visually Searching the Web for Content. J. R. Smith and S.-F. Chang. IEEE MultiMedia, vol. 4, pp. 12-20, 1997.
Content Based Image Retrieval at the End of the Early Years. Arnold W. M. Smeulders, Marcel Worring, S. Santini, A. Gupta, R. Jain. IEEE Trans. Pattern Analysis and Machine Intelligence, Volume 22 (12), page 1349-1380, 2000.
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 95
General references IIGeneral references IIThe Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. Cees G. M. Snoek, Marcel Worring, Jan C. van Gemert, Jan-Mark Geusebroek Arnold W M Smeulders ACM Multimedia page 421 430 2006Geusebroek, Arnold W. M. Smeulders. ACM Multimedia, page 421-430, 2006.
The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, Frank J. Seinstra, Arnold W. M. Smeulders. IEEE Trans. Pattern Analysis and Machine Intelligence, Volume 28 (10), page 1678-1689, 2006.
A Learned Lexicon-Driven Paradigm for Interactive Video Retrieval. Cees G. M. Snoek, Marcel Worring, Dennis C. Koelma, Arnold W. M. Smeulders. IEEE Trans. Multimedia, Volume 9 (2), page 280-292, 2007.
Adding Semantics to Detectors for Video Retrieval. Cees G. M. Snoek, BoukeHuurnink, Laura Hollink, Maarten de Rijke, Guus Schreiber, Marcel Worring. IEEE Trans. Multimedia, Volume 9 (5), page 975-986, 2007.
The MediaMill TRECVID 2004-2009 Semantic Video Search Engine. Cees G. M. Snoek et al. Proceedings of the TRECVID Workshop, 2004-2009.
Visual-Concept Search Solved?. Cees G.M. Snoek and Arnold W.M. Smeulders. IEEE Computer, Volume 43 (6) (in press), 2010.
General references IIIGeneral references IIIConcept-Based Video Retrieval. Cees G. M. Snoek, Marcel Worring. Foundations and Trends in Information Retrieval, Vol. 4 (2), page 215-322, 2009.
http://www.science.uva.nl/research/publications/
Local Invariant Feature Detectors: A Survey. T. Tuytelaars and K. Mikolajczyk. Foundations and Trends in Computer Graphics and Vision, vol. 3, pp. 177-280, 2008.
Evaluating Color Descriptors for Object and Scene Recognition. Koen E. A. van de Sande, Theo Gevers, Cees G. M. Snoek. IEEE Trans. Pattern Analysis and Machine Intelligence (in press), 2010.
Visual Word Ambiguity. Jan C. van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, Jan-Mark Geusebroek. IEEE Trans. Pattern Analysis and Machine Intelligence (in press), 2009.
Real-Time Bag of Words Approximately Jasper R R Uijlings, Arnold W MReal Time Bag of Words, Approximately. Jasper R. R. Uijlings, Arnold W. M. Smeulders, R. J. H. Scha. ACM Int'l Conference on Image and Video Retrieval, 2009.
Lessons Learned from Building a Terabyte Digital Video Library. H. D. Wactlar, M. G. Christel, Y. Gong, and A. G. Hauptmann. IEEE Computer, vol. 32, pp. 66-73, 1999.
Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Int'l Journal of Computer Vision, vol. 73, pp. 213-238, 2007.
Cees Snoek & Arnold SmeuldersUniversity of Amsterdam
14‐6‐2010
Video Search Engines ‐ CVPR 2010 96
Contact infoContact info
•• Cees Snoek Cees Snoek http://staff.science.uva.nl/~cgmsnoekhttp://staff.science.uva.nl/~cgmsnoek
•• Arnold Arnold SmeuldersSmeuldershttp://staff science uva nl/~smeulderhttp://staff science uva nl/~smeulderhttp://staff.science.uva.nl/~smeulderhttp://staff.science.uva.nl/~smeulder
Further informationFurther information
M di Mill lwww.MediaMill.nl