cvpr2010 tutorial: video search engines

Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

14‐6‐2010

Video Search Engines ‐ CVPR 2010 1

Video Search EnginesVideo Search Engines

CeesCees G.M. G.M. SnoekSnoek & Arnold W.M. & Arnold W.M. SmeuldersSmeulders

Intelligent Systems Lab Amsterdam,University of Amsterdam, The Netherlands

A brief history of televisionA brief history of television

•• From broadcasting to narrowcastingFrom broadcasting to narrowcasting

~1955 ~1985 ~2005

•• …to …to thincastingthincasting

~>2008

~2010


14‐6‐2010


The international business caseThe international business case

•• Everybody with a message uses video for deliveryEverybody with a message uses video for delivery

•• Growing Growing unmanageableunmanageable amounts of videoamounts of video

Example from Example from YoutubeYoutube

•• YoutubeYoutube users are uploading 24 hours of video users are uploading 24 hours of video every minuteevery minute

Hours of video uploaded per minute30

25

20

15

10

5

0


14‐6‐2010


Example from the NetherlandsExample from the Netherlands

•• Yearly ingestYearly ingest–– 15.000 hours of video15.000 hours of video

–– 40.000 hours of radio40.000 hours of radio

•• Next 6 yearsNext 6 years–– 137.200 hours of video137.200 hours of video

–– 22.510 hours of film22.510 hours of film

•• Europe’s largest digitization projectEurope’s largest digitization project

–– >1 >1 petabytepetabyte per yearper year –– 123.900 hours of audio123.900 hours of audio

–– 2.900.000 photo’s2.900.000 photo’s

Lack of metadata

ExpertExpert--driven searchdriven search

http://e-culture.multimedian.nl


14‐6‐2010


CrowdCrowd--given searchgiven search

What others say is in the video.

RawRaw--driven searchdriven search

www.science.uva.nl/research/isla

MultimediaN project


14‐6‐2010


1. Short course outline1. Short course outline

.0 Problem statement.0 Problem statement

.1 Measuring features.1 Measuring features

.2 Concept detection.2 Concept detection

3 Lexicon learning3 Lexicon learning.3 Lexicon learning.3 Lexicon learning

.4 Query prediction.4 Query prediction

.5 Video browsing.5 Video browsing

Problem 1:Problem 1:Variation in appearanceVariation in appearance

So many images of one thing, due to minor differences in:illuminationbackground occlusionviewpoint, …

•• ThisThis is the is the sensorysensory gapgap


14‐6‐2010


110101101101101101101100110101101111100

1101011011011011011011001101011011111001101011011111

1101011011011011011011001101011011111001101011011111

Problem 2:Problem 2:What defines things?What defines things?

SuitBasketball

Table

Tree

US flag Building

01011011111001101011011111

11010110111111101011011111

1101011011011011011011001101011011111001101011011111

1101011011011011011011001101011011111001101011011111

1101011011011011011011001101011011111001101011011111

11010110110110110110110011

Machine

Multimedia ArchivesAircraft

Dog TennisMountain

Fire

1101011011011011011011001101011011111001101011011111

01011011111001101011011111

1101011011011011011011001101011011111001101011011111

1101011011011011011011001101011011111001101011011111

1101011011011011011011001101011011111001101011011111Humans

Problem 3: Problem 3: Many things in the worldMany things in the world

•• ThisThis is the model gapis the model gap


14‐6‐2010


ProblemProblem 4:4:VocabularyVocabulary problemproblem

Query-by-keyword

Query-by-concept

Query-by-example

Query

Find shots of peopleshaking hands

ThisThis is the is the queryquery--contextcontext gapgap

Query-by-humming

Any combination, any sequence?

Query-by-gesture

Prediction

Problem 5: Problem 5: Use is openUse is open--endedended

screen

scope

•• ThisThis is the interface gapis the interface gap

keywords


14‐6‐2010


Conclusion on problemsConclusion on problems

•• Video search is a diverse and challengeVideo search is a diverse and challenge--rich rich research topicresearch topic–– Sensory gapSensory gap

–– SemanicSemanic gapgap

–– Model gapModel gap

–– QueryQuery--context gapcontext gap

–– Interface gapInterface gap

Today’s promiseToday’s promise

•• You will be acquainted with the theory and You will be acquainted with the theory and practice of the conceptpractice of the concept--based video search based video search paradigm. paradigm.

•• You will be able to recall the five major scientific You will be able to recall the five major scientific problems in video retrieval and explain, and value problems in video retrieval and explain, and value the presentthe present--day solutions.day solutions.


14‐6‐2010






3 Lexicon learning3 Lexicon learning.3 Lexicon learning.3 Lexicon learning



A million appearancesA million appearances

There are a million appearances to one concept

Where are the patterns (of the same shoe)?


14‐6‐2010


S h th

Invariance: the need for ~Invariance: the need for ~

Somewhere the variance must be removed.

The illumination and the viewing direction are removed as soon as the image has entered.

Common transformationsCommon transformations

Illumination transformationsIllumination transformations ContrastContrastIllumination transformationsIllumination transformations ContrastContrast

Intensity and ShadowIntensity and Shadow

ColorColor

ViewpointViewpoint Rotation and LateralRotation and Lateral

DistanceDistance

Viewing angleViewing angle

ProjectionProjection

CoverCover SpecularSpecular or matteor matte

Occlusion & Clutter, Wear & Tear, Aging, Night & day and Occlusion & Clutter, Wear & Tear, Aging, Night & day and so on into increasingly complex transformations.so on into increasingly complex transformations.


14‐6‐2010


More than one transformationMore than one transformation

Features of selected points may be good enough to describe object

iff the selection & the feature set are both invariant forscene-accidental conditions Gevers TIP 2000

Design of invariants: OrbitsDesign of invariants: Orbits

For a property variant under W observations of a constantFor a property variant under W, observations of a constant property are spread over the orbit. The purpose of an invariant is to capture all of the orbit into one value.


14‐6‐2010


Example: invarianceExample: invariance

Slide credit: Theo Gevers

projection

arctan)( RBGRc =

(R,G,B)-space(c(c11,c,c22,c,c33))--spacespace

},max{arctan),,(1 BG

BGRc =

},max{arctan),,(2 BR

GBGRc =

},max{arctan),,(3 GR

BBGRc =

Color invarianceColor invariance

Gevers, PRL, 1999Geusebroek, PAMI 2001

shadows shading highlights ill. intensity ill. shadows shading highlights ill. intensity ill. colorcolorEE -- -- -- -- --WW -- + + -- + + --CC + + + + -- + + --MM + + + + -- + ++ +NN + ++ + -- + ++ +σ = 1 σ = 2NN + + + + + ++ +LL + + + + + + + + --HH + + + + + + + + --

σ 1 σ 2H 300 315N 550 900C 600 850W 900 995E 950 990Retained from 1000 colours:


14‐6‐2010


Local shape motivationLocal shape motivation

Perceptual importanceConcise data

Robust to occlusion & clutter

Tuytelaars FTCGV 2008

Meet the GaussiansMeet the Gaussians

Taylor expansion at xTaylor expansion at x

Robust additive differentialsRobust additive differentials

For discretely sampled signal use the Gaussians

Dimensions separableDimensions separable

No maxima No maxima introducedintroduced


14‐6‐2010


Meet the Gaussians Meet the Gaussians

The basic video observables are:

Local receptive fields Local receptive fields ff((xx))

The receptive fields up to The receptive fields up to first first order.order.

Grey value as well as opponent color sets.Grey value as well as opponent color sets.


14‐6‐2010


Taxonomy of image structureTaxonomy of image structure

Slide credit: Theo Gevers

T-junction Junction

Highlight

Corner

Meet GaborMeet Gabor

The 2D Gabor function is:

)(222

2

22

21),( vyuxj

yx

eeyxh ++

−= πδ

πσTuning parameters: u, v, σ + usual invariants by combinationM j th d M G b f t t i FManjunath and Ma on Gabor for texture as seen in F-space


14‐6‐2010


Local receptive fields Local receptive fields FF((xx))

The receptive fields for (u, v) measured locallyThe receptive fields for (u, v) measured locally

Greyvalue as well as opponent color sets.Greyvalue as well as opponent color sets.

Hoang 2003

GaborGabor filters: filters: texturetexture

Hoang, ECCV 2002

Original image K-means clustering Segmentation


14‐6‐2010


GaborGabor filters: filters: texturetexture

Local receptive field in f(Local receptive field in f(xx,t),t)

Gaussian equivalent over x and t:Gaussian equivalent over x and t:

zero orderzero order first order over tfirst order over t

Burghouts TIP 2006


14‐6‐2010


Receptive fields: overviewReceptive fields: overview

All observables up to first order All observables up to first order colorcolor, , second order spatial scales, eight second order spatial scales, eight frequency bands & first order in time.frequency bands & first order in time.

Good observables > easy algorithmsGood observables > easy algorithms

Periodicity:

Detect periodic motion by one steered filter:

Deadly simple algorithm…

Burghouts TIP 2006


14‐6‐2010


Meet the LoweansMeet the Loweans

So far we paid respect to the spatial order.So far we paid respect to the spatial order.

Now we will weakly follow the spatial order and form Now we will weakly follow the spatial order and form histograms on all directions we encounter locally, histograms on all directions we encounter locally, … better known as (the second part of) SIFT.… better known as (the second part of) SIFT.

Lowe IJCV 2004

Meet the Meet the LoweansLoweans

4 x 4 Gradient window after thresholding4 x 4 Gradient window after thresholding

Histogram of 4 x 4 samples per window in 8 directionsHistogram of 4 x 4 samples per window in 8 directions

Gaussian weighting around center (Gaussian weighting around center (σσ is 1/2 of is 1/2 of σσ keypoint)keypoint)

4 x 4 x 8 = 128 dimensional feature vector4 x 4 x 8 = 128 dimensional feature vector

Image: Jonas Hurrelmann


14‐6‐2010


SIFT detectionSIFT detection

Slide credit: Jepson 2005

Enriching SIFT Enriching SIFT (in a nutshell)(in a nutshell)

Mikolajczyk, IJCV 2005Ke, CVPR 2004

Van de Sande, PAMI 2010

•• Affine SIFTAffine SIFT–– Choose prominent direction in SIFTChoose prominent direction in SIFT

•• PCAPCA--SIFTSIFT–– Robust and compact representationRobust and compact representation

•• ColorSIFTColorSIFT–– Add several invariant color descriptorsAdd several invariant color descriptors

•• TimeSIFTTimeSIFT–– Anyone?Anyone?


14‐6‐2010


Quality of propertiesQuality of properties

Original blurring JPEG ill.direction viewpoint spectrum

1 in 1000 Harris patches

In the experiment, manual matching of the Harris’ point.



14‐6‐2010



Good invariants are very powerful.

Tracking invariant appearanceTracking invariant appearance

Nguyen PAMI 2003


14‐6‐2010


Where to sample in the video?Where to sample in the video?

•• Video shot is set of frames representing a Video shot is set of frames representing a continuous camera action in time and spacecontinuous camera action in time and space–– Analysis typically on a single key frame per shotAnalysis typically on a single key frame per shot

Shot Key Frame

WhereWhere to sample in the frame?to sample in the frame?

Tuytelaars 2008 FTCGV


14‐6‐2010


Where to sample? contextWhere to sample? context

What is the object in the middle?

No segmentation …No pixel values of the object …

Context in codebooksContext in codebooks

Similarity to K prototype texturesy p yp

. . .

Sky Grass Road

. . .

=

Sky

Gras

sRo

ad Sky

Gras

sRo

ad Sky

Gras

sRo

ad Sky

Gras

sRo

ad Sky

Gras

sRo

ad

. . . =

=

. . . =

Sky

Gras

sRo

ad

Features: Weibull textures


14‐6‐2010


Interest point examplesInterest point examples

Mikolajczyk, CVPR 2006van de Weijer, PAMI 2006

Original image Harris Laplacian Color salient points

DenseDense sampling sampling exampleexample


14‐6‐2010


FastFast densedense descriptorsdescriptors

Uijlings et al, CIVR 2009

Image Patch

Reuse sub-regions: 16x speed-up

Conclusion on Conclusion on measuring featuresmeasuring features

•• Invariance is crucial when designing featuresInvariance is crucial when designing features–– More invariance means less stable…More invariance means less stable…

–– …but more robustness to sensory gap…but more robustness to sensory gap

•• Effective features strike a balance betweenEffective features strike a balance between•• Effective features strike a balance between Effective features strike a balance between invariance and discriminatory powerinvariance and discriminatory power–– And for video search efficiency is helpful also…And for video search efficiency is helpful also…


14‐6‐2010


And there is always more …And there is always more …

For example:For example:

Local Invariant Feature Detectors: A SurveyLocal Invariant Feature Detectors: A Survey

TinneTinne TuytelaarsTuytelaars & & KrystianKrystian MikolajczykMikolajczyk

FTCGV 3:3(177FTCGV 3:3(177——280)280)




.2 Concept detection.2 Concept detection3 Lexicon learning3 Lexicon learning.3 Lexicon learning.3 Lexicon learning




14‐6‐2010


The semantic gapThe semantic gap

The semantic gap is the lack of coincidence The semantic gap is the lack of coincidence between the information that one can extract between the information that one can extract from the sensory data and the interpretation that from the sensory data and the interpretation that

Quote

the same data has for a user in a given situationthe same data has for a user in a given situation

Arnold Smeulders, PAMI, 2000

The science of labelingThe science of labeling

•• To understand anything in science, things need a To understand anything in science, things need a name that is universally recognizedname that is universally recognized

•• Worldwide endeavor in Worldwide endeavor in naming visual informationnaming visual information

living organisms chemical elements human genome‘categories’ text


14‐6‐2010


How difficult is the problem?How difficult is the problem?

•• Human vision consumes 50% brain power…Human vision consumes 50% brain power…

Van Essen, Science 1992

Semantic concept detectionSemantic concept detection

•• The patient approachThe patient approach–– Building detectors oneBuilding detectors one--atat--aa--timetime

f d f f l fA face detector for frontal faces


14‐6‐2010


A simple face detectorA simple face detector

Video document

Concept detection overviewConcept detection overview

Volleyball

Sport Music Documentary Commercial Cartoon Soap Sitcom Feature film Talk show News

Basketball Semiotic Tennis Car racing Wildlife Football Soccer Ice hockey Financial

Entertainment Information Communication

Semantic

Guest Host Anchor Report Weather Interview

HuntsViolence Car chase Highlights Walking Gathering Graphical Dramatic events Sport events

Dialogue Story Play Break Action

One PhD per detector requires too many students…


14‐6‐2010


So how about these?So how about these?

and the thousands of others ………

Basic concept detectionBasic concept detection

Feature Extraction

Supervised Lea ne

Labeled examples

aircraftoutdoor

Extraction Learner

Training

Feature Measurement

Classification

Testing

Video

It is an aircraft probability 0.7It is outdoor probability

0.95


14‐6‐2010


Demo: Demo: concept detectionconcept detection

Visualization byJasper Schulte

Support vector machineSupport vector machine

Vapnik, 1995

•• Learns statistical model from provided positive Learns statistical model from provided positive and negative examplesand negative examples

•• Maximizes margin between two classes in highMaximizes margin between two classes in high--dimensional feature spacedimensional feature space

MarginMargin

* Positive class+ Negative class


14‐6‐2010


Support vector machine Support vector machine (cont’d)(cont’d)

Vapnik, 1995

•• Depends on many parametersDepends on many parameters–– Select best of multiple parameter combinationsSelect best of multiple parameter combinations

–– Using cross validationUsing cross validation

SVMVector

Semantic Concept Probability

C = cost of misclassification

K( ) = kernel function

Causes of poor generalizationCauses of poor generalization

•• OverOver--fittingfitting–– Separate your dataSeparate your data

•• Curse of dimensionalityCurse of dimensionalityInformation fusion helpsInformation fusion helps–– Information fusion helpsInformation fusion helps


14‐6‐2010


Feature fusionFeature fusionSynchronization

Shot segmentedvideo

Concept confidence

Normalization

Transformation

Concatenation

Feature vector

Feature fusionFeature fusionSynchronization

Shot segmentedvideo

Concept confidence

Normalization

Transformation

Concatenation

Feature vector

+ Only one learning phase- Combination often ad hoc - One feature may dominate- Curse of dimenisonality


14‐6‐2010


Avoiding dimensionality curseAvoiding dimensionality curse

Leung and Malik, IJCV, 2001Sivic and Zisserman, ICCV, 2003

•• Codebook aka bagCodebook aka bag--ofof--words modelwords model–– Create a codeword vocabularyCreate a codeword vocabulary

–– DiscretizeDiscretize image with image with codewordscodewords

–– Represent image as codebook histogramRepresent image as codebook histogram

0 100 200 300 400 5000

10

20

30

40

50

60

70

80

EmphasizingEmphasizing spatialspatial configurationsconfigurations

Grauman, ICCV 2005Lazebnik, CVPR 2006

Marszalek, VOC 2007

•• CodebookCodebook ignoresignores geometricgeometric correspondencecorrespondence

•• For video:For video:–– 1x1 entire image1x1 entire image

–– 2x2 image quarters2x2 image quarters

1x3 horizontal bars1x3 horizontal bars

•• SolutionSolution: : spatialspatial pyramidpyramid–– aggregate statistics of local features over fixed aggregate statistics of local features over fixed

subregionssubregions

–– 1x3 horizontal bars1x3 horizontal bars


14‐6‐2010


Codebook modelCodebook model

•• Codebook consists of Codebook consists of codewordscodewords–– kk--means clustering of descriptorsmeans clustering of descriptors–– Commonly 4,000 Commonly 4,000 codewordscodewords per codebookper codebook

Cluster

AssignDense+OpponentSIFT Feature vector (length 4,000)

Codebook assignmentCodebook assignment

van Gemert, PAMI 2010

● Codeword

Hard assignmentHard assignment Soft assignmentSoft assignment


14‐6‐2010


Fast quantizationFast quantization

Moosman, PAMI 2008Uijlings, CIVR 2009

•• Random Random forestsforests–– Randomized process makes it very fast to buildRandomized process makes it very fast to build

–– Tree structure allows fast vector quantizationTree structure allows fast vector quantization

–– Logarithmic rather than linear projection timeLogarithmic rather than linear projection time

•• RealReal timetime BoWBoW (!)(!)•• RealReal--timetime BoWBoW (!)(!)–– WhenWhen usedused withwith fastfast densedense samplingsampling

–– SURF 2x2 descriptor SURF 2x2 descriptor insteadinstead of 4x4of 4x4

–– RBF RBF kernelkernel

GPUGPU--empowered quantizationempowered quantization

•• Achieve dataAchieve data parallelism by writing Euclideanparallelism by writing Euclidean

Van de Sande, TMM 2011

8.00

10.00

12.00

14.00

e (s)

CPU Xeon (3,4GHz)

CPU Opteron 250 (2,4GHz)

CPU Core 2 Duo 6400 (2,13GHz)

CPU Core i7 (2,66GHz)

•• Achieve dataAchieve data--parallelism by writing Euclidean parallelism by writing Euclidean distance in vector formdistance in vector form

0.00

2.00

4.00

6.00

0 5000 10000 15000 20000

Time Pe

r Image

Number of SIFT Descriptors Per Image

GPU Geforce 8800GTX (128 cores)

GPU Geforce GTX260 (216 cores)

17x speed-up

GPU

CPU


14‐6‐2010


Codebook libraryCodebook library

•• A single codebook is…A single codebook is…

Single CodebookSampling method Descriptor Construction Assignment

#1 Dense OpponentSIFT K-means Soft

#2 Harris-Laplace SIFT Radius-based Soft

#3 Dense rgSIFT K-means Hard

Dense + Spatial pyramid C SIFT K means Hard

•• Codebook library is…Codebook library is…–– a configuration of several codebooksa configuration of several codebooks

… Dense + Spatial pyramid C-SIFT K-means Hard

Codebook libraryCodebook library (cont’d)(cont’d)

•• Concatenate multiple codebooksConcatenate multiple codebooks

–– Spatial pyramid adds more dimensionsSpatial pyramid adds more dimensionsoo 1x1 = 4K1x1 = 4Koo 1x1 4K1x1 4K

oo 2x2 = 16K2x2 = 16K

oo 1x3 = 12K1x3 = 12K

–– Feature vector length easily Feature vector length easily >100K>100K……


14‐6‐2010


SVM preSVM pre--computed kernel trickcomputed kernel trick

•• Use distance between feature vectors (!)Use distance between feature vectors (!)

•• Increase efficiency significantlyIncrease efficiency significantly

-γ dist( , )K( , ) = e

y g yy g y–– PrePre--compute the SVM kernel matrixcompute the SVM kernel matrix

–– Long vectors possible as we only need 2 in memoryLong vectors possible as we only need 2 in memory

–– Parameter optimization reParameter optimization re--uses preuses pre--computed matrixcomputed matrix

11 Compute average distances perCompute average distances per N²N² kernel subkernel sub--blockblock

GPUGPU--empoweredempoweredprepre--computed kernelcomputed kernel

Van de Sande, TMM 2011

1.1. Compute average distances per Compute average distances per N²N² kernel subkernel sub--blockblock

2.2. Compute kernel function valuesCompute kernel function values

1000

1200

1400

1600

1800

s)

1x Core i7 (2,66GHz)

1x Opteron 250 (2,4GHz)4x Opteron 250 (2,4GHz)16x Opteron 250 (2,4GHz)25x Opteron 25065x speed-up

1 CPU

4 CPU

0

200

400

600

800

0 20000 40000 60000 80000 100000 120000 140000

Time (

Total Feature Vector Length

25x Opteron 250 (2,4GHz)

p p

3x speed-up

GPU


14‐6‐2010


Feature Feature fusionfusion

Point sampling strategy Color feature extraction Codebook model

Van de Sande, PAMI 2010

0

1

Relative frequency

1 2 3 4 5

Codebook element

Dense sampling

Harris-Laplace salient points

Point sampling strategy Color feature extraction Codeboo ode

1 1

0

1

Relative frequency

1 2 3 4 5

Codebook element

Bag-of-features

Bag-of-features

.

.

.

Image

.

.

.

Spatial pyramid

0

1 2 3 4 5

0

1 2 3 4 5

0

1

1 2 3 4 5

0

1

1 2 3 4 5

Spatial pyramid: multiple bags-of-features

. .

..

+ Codebook reduces dimensionality- Combination still ad hoc - One codebook may dominate

Classifier fusionClassifier fusion


14‐6‐2010


Classifier fusionClassifier fusion

+ Focus on feature strength

+ Fusion in semantic space

- Expensive learning effort

- Loss of feature correlation

Unsupervised fusion of classifiersUnsupervised fusion of classifiers

+ Aggregation functions reduce learning effort

Snoek, TRECVID 2006Wang, ACM MIR 2007

Support Vector

Machine

Global Image Feature

Extraction

GeometricMean

Fisher Linear

Discriminant

Regional Image Feature

Extraction

+ Aggregation functions reduce learning effort+ Efficient use of training examples

Logistic Regression

Extraction

Keypoint Image Feature

Extraction

- Linear function unlikely to be optimal


14‐6‐2010


Fusing conceptsFusing concepts

•• Exploitation of concept coExploitation of concept co--occurrence occurrence –– Concepts do not occur in vacuumConcepts do not occur in vacuum

Concept 1

Concept 2

Concept 3

Naphade Trans. MM 2001

SkyAircraft

HowHow to to fusefuse conceptsconcepts??

•• LearningLearning spatialspatial modelsmodels

•• LearningLearning temporal modelstemporal models

•• IncludeInclude ontologiesontologies


14‐6‐2010


Learning spatial models Learning spatial models -- explicitlyexplicitly

•• Using graphical modelsUsing graphical models–– Computationally complexComputationally complex

–– Limited scalabilityLimited scalability

Qi, TOMCCAP 2009

Learning spatial models Learning spatial models -- implicitlyimplicitly

•• Using support vector machine, or data miningUsing support vector machine, or data mining–– Assumes classifier learns relationsAssumes classifier learns relations

–– Suffers from error propagationSuffers from error propagation

Weng, ACM MM 2008


14‐6‐2010


Learning temporal modelsLearning temporal models

•• Extend spatial models with time dimensionExtend spatial models with time dimension–– Common approach is Hidden Markov ModelCommon approach is Hidden Markov Model

–– Relatively few have actually considered time… Relatively few have actually considered time…

Ebadollahi, ICME 2006

Including knowledgeIncluding knowledge

•• Can ontologies help?Can ontologies help?–– Symbolic ontolgoies Symbolic ontolgoies vsvs uncertain detectorsuncertain detectors

Wu, ICME 2004


14‐6‐2010


Concept detection pipelineConcept detection pipeline

IBM 2003

Concept detection pipelineConcept detection pipeline

IBM 2003

Feature Classifier ConceptFusion Fusion

pFusion


14‐6‐2010


Video diverVideo diver

Wang, ACM MIR 2007

Video diverVideo diver

Feature Classifier Concept

Wang, ACM MIR 2007

Fusion Fusionp

Fusion


14‐6‐2010


Layout Features Extraction

Semantic PathfinderSemantic Pathfinder

Snoek, PAMI 2006

Supervised Learner

Supervised Learner

Visual Features

Semantic Features

Combination

Capture Features Extraction

Content Features Extraction

Select Best of 3 Paths

after Validation

Animal

Vehicle

FlagFire

Supervised Learner

Multimodal Features

Combination

Features Extraction

Textual Features Extraction

Content Analysis Step Style Analysis Step Context Analysis Step

Context Features Extraction Sports

Vehicle

Entertainment MonologueWeather

news

Hu Jintao

Layout Features Extraction

Semantic PathfinderSemantic PathfinderFeature Classifier Concept

Snoek, PAMI 2006

Supervised Learner

Supervised Learner

Visual Features

Semantic Features

Combination

Capture Features Extraction

Content Features Extraction

Select Best of 3 Paths

after Validation

Animal

Vehicle

FlagFire

Fusion Fusion Fusion

Supervised Learner

Multimodal Features

Combination

Features Extraction

Textual Features Extraction

Content Analysis Step Style Analysis Step Context Analysis Step

Context Features Extraction Sports

Vehicle

Entertainment MonologueWeather

news

Hu Jintao


14‐6‐2010


StateState--ofof--thethe--ArtArtFeatureFusion

ClassifierFusion

Snoek et al, TRECVID 2008-2009Van Gemert et al, PAMI 2009

Van de Sande et al, PAMI 2010

Fusion Fusion

StateState--ofof--thethe--ArtArt

Snoek et al, TRECVID 2008-2009Van de Sande et al, PAMI 2010

Van Gemert et al, PAMI 2010

Software available for download at http://colordescriptors.com


14‐6‐2010


Detecting Semantic Concepts in VideoDetecting Semantic Concepts in VideoConclusion on:Conclusion on:

•• We started with invariance and manual laborWe started with invariance and manual labor

•• We generalized with machine learningWe generalized with machine learning–– …but needed several abstractions to do appropriately…but needed several abstractions to do appropriately

•• For the moment, no oneFor the moment, no one--sizesize--fitsfits--all solution all solution –– Learn optimal machinery per conceptLearn optimal machinery per concept





3 Lexicon learning3 Lexicon learning.3 Lexicon learning.3 Lexicon learning.4 Query prediction.4 Query prediction



14‐6‐2010


Problem 3: Problem 3: Many things in the worldMany things in the world

•• ThisThis is the model gapis the model gap

Trial 1: counting dictionary wordsTrial 1: counting dictionary words

Biederman, Psychological Review 1987

Slide credit: Li Fei-Fei


14‐6‐2010


Trial 2: reverseTrial 2: reverse--engineeringengineering

Hauptmann, PIEEE 2008

•• Estimation by Hauptmann et al.: 5000Estimation by Hauptmann et al.: 5000–– Using manually labeled queries and conceptsUsing manually labeled queries and concepts

–– But speculative, and questionable assumptionsBut speculative, and questionable assumptions

‘Google performance’

Oracle Combination + Noise

‘Realistic’ Combination

How to obtain labeled examples?How to obtain labeled examples?massive amounts of

3 billion

–– …but only human …but only human expertsexperts provide provide good qualitygood quality examplesexamples


14‐6‐2010


Experts start with concept definitionExperts start with concept definition

•• MM078MM078--Police/Security PersonnelPolice/Security Personnel–– Shots depicting law enforcement or private Shots depicting law enforcement or private

security agency personnel.security agency personnel.

Expert annotation toolsExpert annotation tools

Volkmer, ACM MM 2005

•• Balance between:Balance between:–– Spatiotemporal level of Spatiotemporal level of

annotation detailannotation detail

–– Number of conceptsNumber of concepts

–– Number of positive Number of positive d ti ld ti land negative examplesand negative examples


14‐6‐2010


LSCOM LSCOM (Large Scale Concept Ontology for Multimedia)(Large Scale Concept Ontology for Multimedia)

Naphade, IEEE MM 2006

•• Provides manual annotations for 449 conceptsProvides manual annotations for 449 concepts–– In international broadcast TV newsIn international broadcast TV news

•• Connection to Connection to CycCyc ontologyontology

•• LSCOMLSCOM--LiteLite–– 39 semantic concepts39 semantic concepts

http://www.lscom.org/

Verified positive Verified positive examplesexamples

•• ImageNetImageNet (11M images)(11M images)

–– 4000 categories4000 categories

–– > 100 examples> 100 examples

D t l CVPR 2009•• SUN SUN (130K images)(130K images)

–– 397 scene categories397 scene categories

–– > 100 examples> 100 examples

Deng et al, CVPR 2009

Xiao et al, CVPR 2010


14‐6‐2010


Bridging the model gapBridging the model gap

•• RequirementsRequirements–– Generic concept detection methodGeneric concept detection method

–– Massive amounts of labeled examplesMassive amounts of labeled examples

–– Evaluation methodEvaluation method

–– Fair amount of computationFair amount of computationpp

Model gap best treated by Model gap best treated by TRECVIDTRECVID

•• Situation in 2000Situation in 2000–– Various concept definitionsVarious concept definitions

–– SpecificSpecific and and smallsmall data setsdata sets

–– Hard to compare methodologiesHard to compare methodologies = ResearchersNIST

•• Since 2001 worldwide evaluation by NISTSince 2001 worldwide evaluation by NIST–– TRECVID benchmarkTRECVID benchmark


14‐6‐2010


NIST TRECVID benchmarkNIST TRECVID benchmark

P i id i l hP i id i l h•• Promote progress in video retrieval researchPromote progress in video retrieval research–– Provide common dataset Provide common dataset –– Challenging tasksChallenging tasks–– Independent evaluation protocolIndependent evaluation protocol–– Forum for researchers to compare resultsForum for researchers to compare results

http://trecvid.nist.gov/

Video data setsVideo data sets

•• US TV news US TV news (`03/`04)(`03/`04)

•• International TV news International TV news (`05/`06)(`05/`06)

•• Dutch TV infotainment Dutch TV infotainment (`07/`08/`09)(`07/`08/`09)


14‐6‐2010


TRECVID 2010TRECVID 2010Internet Archive web videosInternet Archive web videos

Expert annotation effortsExpert annotation efforts

500

LSCOM

MediaMill - UvA

Others

Expert annotation efforts

17 32 39101

374…

TRECVID Edition


14‐6‐2010


Measuring performanceMeasuring performance

Set of retrieved itemsSet of relevant items

•• PrecisionPrecision Set of relevant

1.

2.

3.

Results

retrieved items

inverse relationship•• RecallRecall

4.

5.

Evaluation measureEvaluation measure

•• Average PrecisionAverage Precision–– Combines precision and recallCombines precision and recall

–– Averages precision after relevant shotAverages precision after relevant shot

–– Top of ranked list most important Top of ranked list most important

1.

2.

3.

Results

AP

AP =1/1 + 2/3 + 3/4 + …

number of relevant documents

4.

5.

AP


14‐6‐2010


De facto evaluation standardDe facto evaluation standard

Concept examplesConcept examples

fAircraft

Beach

Mountain

Note the variety in visual appearance

People marching

Police/Security

Flower


14‐6‐2010


TRECVID TRECVID concept detection task resultsconcept detection task results

•• Hard to compareHard to compare–– Different video dataDifferent video data

–– Different conceptsDifferent concepts

•• Clear Clear top top performersperformers

2003

2004

2005

2006 pp pp–– Median skews to leftMedian skews to left

–– Learning effectLearning effect

–– Plenty of variationPlenty of variation

2007

2008

2009

UvAUvA--MediaMillMediaMill@TRECVID@TRECVID

Snoek et al, TRECVID 04-09

• 900 other detection systems


14‐6‐2010


1,000,0001,000,000 frames analyzedframes analyzed

Snoek, ICME 2005

•• MultiMulti--frame biggest improvement in 2008 / 2009frame biggest improvement in 2008 / 2009–– We analyze up to 10 extra We analyze up to 10 extra ii--frames/shotframes/shot

–– For 2009 yields 1M frames to analyze for the For 2009 yields 1M frames to analyze for the test set test set

•• Need to speedNeed to speed--up by being “smart and strong”up by being “smart and strong”SpeedSpeed up feature extractionup feature extraction–– SpeedSpeed--up feature extractionup feature extraction

–– SpeedSpeed--up quantizationup quantization

–– SpeedSpeed--up kernelup kernel--based learningbased learning

–– SpeedSpeed--up by computingup by computing

ComputingComputing

•• Best 2009 system much more efficient than 2008 Best 2009 system much more efficient than 2008 systemsystem–– 6x more visual data analyzed using less compute power6x more visual data analyzed using less compute power

•• Some best estimates:Some best estimates:•• Some best estimates:Some best estimates:–– Visual feature extraction: 8400 Visual feature extraction: 8400 PProcessorrocessor--NNodeode--HHoursours

–– Training concept detectors: 4000 PNHTraining concept detectors: 4000 PNH

–– Applying concept detectors: ~1 week GPUApplying concept detectors: ~1 week GPU


14‐6‐2010


MediaMill ChallengeMediaMill Challengehttp:/www.mediamill.nl/challenge/

•• The Challenge providesThe Challenge provides–– Manually annotated lexicon Manually annotated lexicon

of of 101101 semantic conceptssemantic concepts

–– PrePre--computed lowcomputed low--level level featuresfeatures

–– Trained classifier modelsTrained classifier models

55 experimentsexperiments

•• The Challenge allows toThe Challenge allows to–– Gain insight in intermediate Gain insight in intermediate

video analysis stepsvideo analysis steps

–– Foster repeatability of Foster repeatability of experimentsexperiments

–– Optimize video analysis Optimize video analysis systems on a component levelsystems on a component level–– 55 experimentsexperiments

–– Implementation + resultsImplementation + results

y py p

–– Compare and improveCompare and improve

• The Challenge lowers threshold for novice researchers

Columbia374 + VIREO374Columbia374 + VIREO374

•• Baseline detectors for 374 conceptsBaseline detectors for 374 concepts

http://www.ee.columbia.edu/ln/dvmm/columbia374/http://www.cs.cityu.edu.hk/~yjiang/vireo374/http://www.ee.columbia.edu/ln/dvmm/CU-VIREO374/


14‐6‐2010


Community myths or facts?Community myths or facts?

•• Chua et al., Chua et al., ACM Multimedia 2007ACM Multimedia 2007

–– Video search is practically solved and progress Video search is practically solved and progress has only been incrementalhas only been incremental

•• Yang and Hauptmann, Yang and Hauptmann, ACM CIVR 2008ACM CIVR 2008

–– Current solutions are weak and generalize poorlyCurrent solutions are weak and generalize poorly

We have done an experimentWe have done an experiment

•• Two video search engines from 2006 and 2009Two video search engines from 2006 and 2009–– MediaMillMediaMill Challenge 2006 systemChallenge 2006 system

–– MediaMillMediaMill TRECVID 2009 systemTRECVID 2009 system

•• How well do they detect 36 LSCOM concepts?How well do they detect 36 LSCOM concepts?


14‐6‐2010


Four video data set mixturesFour video data set mixtures

TRECVID 2005 TRECVID 2007

•• TrainingTrainingBroadcast

newsDocumentary

video

Within domain

•• TestingTestingDocumentary

videoBroadcast

news

Cross domain

Performance doubled in just 3 yearsPerformance doubled in just 3 years

Snoek & Smeulders, IEEE Computer 2010

• 36 concept detectors

–– Even when using training Even when using training data of different origindata of different origin

–– Vocabulary still limitedVocabulary still limited–– Vocabulary still limitedVocabulary still limited


14‐6‐2010


500 detectors, a closer look500 detectors, a closer look

The number of labeled image examples used at trainingThe number of labeled image examples used at training time seems decisive in concept detector accuracy.

Demo timeDemo time


14‐6‐2010


500 detectors, a closer look500 detectors, a closer look

Learning social tag relevance by Learning social tag relevance by neighbor votingneighbor voting

Xirong Li, TMM 2009

•• Exploit consistency in tagging behavior of Exploit consistency in tagging behavior of different users for visually similar imagesdifferent users for visually similar images


14‐6‐2010


AlgorithmAlgorithm: : tagtag--relevancerelevance learninglearning

Why is this useful?

Image retrieval experimentsImage retrieval experiments

•• UserUser--tagged image databasetagged image database–– 3.5 million 3.5 million labeled Flickr imageslabeled Flickr images

•• Visual feature Visual feature –– A global colorA global color--texture texture

•• Evaluation setEvaluation set–– 20 concepts20 concepts

•• Evaluation criteriaEvaluation criteria–– Average precisionAverage precision


14‐6‐2010


Image retrieval experimentsImage retrieval experiments

•• A A standardstandard tagtag--basedbased retrievalretrieval frameworkframework–– RankingRanking functionfunction: OKAPI: OKAPI--BM25BM25

•• ComparisonComparison•• ComparisonComparison–– Baseline:Baseline: retrieval using original tagsretrieval using original tags

–– Neighbor: Neighbor: retrieval using learned tag relevance as retrieval using learned tag relevance as tag frequencytag frequency

ResultsResults

24% relative improvement


14‐6‐2010


Updated tag relevanceUpdated tag relevance

Results suggest…Results suggest…

•• Relevance of a tag can be predicted based on Relevance of a tag can be predicted based on ‘wisdom’ of crowds‘wisdom’ of crowds–– Even with a lightEven with a light--weight visual featureweight visual feature

–– And, a small database of 3.5M imagesAnd, a small database of 3.5M images


14‐6‐2010


Conclusion on lexicon learningConclusion on lexicon learning

RequiresRequires•• Invariant featuresInvariant features

•• Concept detectionConcept detection

•• Many, many, annotationsMany, many, annotations

Suffers most Suffers most fromfrom•• Weakly labeled visual dataWeakly labeled visual data

•• Transfer across domainsTransfer across domains

•• Measuring performanceMeasuring performance

•• Lots of computationLots of computation





.3 Lexicon learning.3 Lexicon learninggg




14‐6‐2010


ProblemProblem 4:4:VocabularyVocabulary problemproblem

Query-by-keyword

Query-by-concept

Query-by-example

Query

Find shots of peopleshaking hands

ThisThis is the is the queryquery--contextcontext gapgap

Query-by-humming

Any combination, any sequence?

Query-by-gesture

Prediction

Traditional approachesTraditional approaches

•• Parse topicParse topic--text, reframe as querytext, reframe as query--byby--keywordkeyword–– Using speech recognition or closed captionsUsing speech recognition or closed captions

•• Use possible images accompanying the topic for Use possible images accompanying the topic for bb llqueryquery--byby--exampleexample

–– Using shotUsing shot--based based keyframeskeyframes


14‐6‐2010


A new hope?A new hope?

We are now seeing researchers starting to use We are now seeing researchers starting to use the confidence values from concept detectors, the confidence values from concept detectors, within the shot retrieval process and this appears within the shot retrieval process and this appears

Quote

Alan Smeaton, Inf. Sys., 2007Alan Smeaton, Inf. Sys., 2007

to be the roadmap for future work in this area.to be the roadmap for future work in this area.

Video query examplesVideo query examples

Find shots of a hockey rink with at least one of the nets fully

visible from some point of view.

Find shots of one or more helicopters in flight.

Find shots of a group including at least four people dressed in suits, seated, and with at least

one flag.

Find shots of an office setting, i.e., one or more desks/tables and one or more computers

and one or more people


14‐6‐2010


Typical ‘oracle’ resultsTypical ‘oracle’ results

Find shots of a graphic map of Iraq, location of Bagdhad

marked - not a weather map.

Best 2nd Best

Maps OverlayedText

+

H t l t l t

Find shots of George Bush entering or leaving a vehicle

(e.g., car, van, airplane, helicopter, etc) (he and vehicle both visible at the same time)

IyadAllawi

+rocket

propelled grenades

How to select relevant detectors automatically?

??

Detector selection strategiesDetector selection strategies

Find shots of an office setting

Video Query

Visual‐based

Ontology‐based

Text‐basedbased


14‐6‐2010


TextText--based selectionbased selection

•• Represent concept descriptions as term vectorRepresent concept descriptions as term vector–– Exact matching to link queries to detectorsExact matching to link queries to detectors

–– Vector space model to match queries to descriptionsVector space model to match queries to descriptions

–– CorpusCorpus--driven query expansion methodsdriven query expansion methods

Recap: concept definitionRecap: concept definition

•• MM078MM078--Police/Security PersonnelPolice/Security Personnel–– Shots depicting law enforcement or private Shots depicting law enforcement or private

security agency personnel.security agency personnel.


14‐6‐2010


VisualVisual--based selectionbased selection

Rasiwasia et al, TMM 2007

FeatureExtraction

SupervisedLearner

Training stage

Testing stage

FeatureExtraction Classify

Query Image

face p=0.97outdoor p=0.98helicopter p=0.43…

1. Identify objects in WordNet

“Find a report from the desert showing a house or car on fire.”OntologyOntology--based selectionbased selection

Slide credit: Bouke Huurnink

car

desert house

fire


14‐6‐2010


2. Identify related concept detectors


car

desert house

fire

car

3. Find most similar and specific detector using ontology measure


Wei, TMM 2008

car

vehicle

car

desert

fire

desert building

desert house

fire


14‐6‐2010


Search strategy combinationSearch strategy combination

1.1. Parallel combinationParallel combination–– All retrieval results are taken into account simultaneouslyAll retrieval results are taken into account simultaneously

–– Weighted average of individual resultsWeighted average of individual results

22 Sequential combinationSequential combination2.2. Sequential combinationSequential combination–– Update video retrieval results in successionUpdate video retrieval results in succession

–– PseudoPseudo--relevance feedback variantsrelevance feedback variants

Parallel combinationParallel combination

Paul Natsev, 2005 - 2007

•• Estimating weights of individual retrieval Estimating weights of individual retrieval modules is main problemmodules is main problem–– Predefine by expertPredefine by expert

–– Learn from training dataLearn from training data

–– OracleOracle

–– ……


14‐6‐2010


Sequential combinationSequential combination

Hsu, IEEE MM 2007

•• Quality depends on ranking in first stageQuality depends on ranking in first stage

Tackling the queryTackling the query--context gapcontext gap

•• RequirementsRequirements–– Several video retrieval methods Several video retrieval methods

(by(by--keywords/bykeywords/by--example/byexample/by--manymany--concepts)concepts)

–– Detector selection and combination methodDetector selection and combination method

–– Training dataTraining data

–– Search topicsSearch topics

–– Evaluation methodEvaluation method

–– WellWell--defined application domain?defined application domain?


14‐6‐2010


QueryQuery--context gap is reasonably context gap is reasonably addressed at TRECVIDaddressed at TRECVID

•• Automatic search taskAutomatic search task–– Automatically solve 20+ search topicsAutomatically solve 20+ search topics

–– Return 1,000 ranked shotReturn 1,000 ranked shot--based results per topicbased results per topic

–– Evaluate using Average PrecisionEvaluate using Average Precision

•• DrawbacksDrawbacks–– Queries tend to be overly complex, limited in Queries tend to be overly complex, limited in

number, drifting away from realnumber, drifting away from real--world usageworld usage

–– Inclusion of recall lowers performanceInclusion of recall lowers performance

–– Lack of training dataLack of training data http://trecvid.nist.gov/

Query prediction at TRECVIDQuery prediction at TRECVID

•• Performance is humblePerformance is humble–– Lack of training dataLack of training data

–– Especially in 2007Especially in 2007--20082008

ition

•• Most pronouncedMost pronounced–– 2005: Large semantic lexicon2005: Large semantic lexicon

Mean Average Precision

Ed


14‐6‐2010


Conclusion on query predictionConclusion on query prediction

•• Retrieval tasks (in)directly covered by concepts Retrieval tasks (in)directly covered by concepts in the lexicon benefit from detector selectionin the lexicon benefit from detector selection

•• Retrieval tasks not covered by the lexicon, Retrieval tasks not covered by the lexicon, result in humble performance onlyresult in humble performance onlyresult in humble performance onlyresult in humble performance only

•• We need better evaluation setupWe need better evaluation setup





.3 Lexicon learning.3 Lexicon learninggg




14‐6‐2010


Problem 5: Problem 5: Use is openUse is open--endedended

screen

scope

•• ThisThis is the interface gapis the interface gap

keywords

So many choices for retrieval…So many choices for retrieval…

•• Why not let user decide interactively?Why not let user decide interactively?–– Navigate through query methodsNavigate through query methods

–– Visualize video retrieval resultsVisualize video retrieval results

–– Learn from browsing behaviorLearn from browsing behavior


14‐6‐2010


Video search 1.0Video search 1.0

Note the influence of textual meta data such as theNote the influence of textual meta data, such as the video title, on the search results.

Query selectionQuery selection


14‐6‐2010


‘Classic’ Informedia system‘Classic’ Informedia system

Carnegie Mellon University

•• First multimodal video search engineFirst multimodal video search engine

FíschlárFíschlár

Dublin City University

•• Optimized for use by “real” usersOptimized for use by “real” users


14‐6‐2010


IBM iMARSIBM iMARS

IBM Research

•• A web based systemA web based system

http://mp7.watson.ibm.com/

MediaMagicMediaMagic

FxPal

•• Focus on the story levelFocus on the story level


14‐6‐2010


VisionGoVisionGo

NUS & ICT-CAS

•• Extremely fast and efficientExtremely fast and efficient

CrossBrowsing through resultsCrossBrowsing through results

Snoek, TMM 2007

RankRank

TimeTime

Sphere variant


14‐6‐2010


Demo: Demo: MediaMillMediaMill video search enginevideo search engine

http://www.mediamill.nlhttp://www.mediamill.nl

The RotorBrowserThe RotorBrowser

de Rooij, TMM 2009


14‐6‐2010


Extreme video retrievalExtreme video retrieval= very demanding!= very demanding!

Carnegie Mellon University

•• ObservationObservation–– Correct results are retrieved, but not optimally rankedCorrect results are retrieved, but not optimally ranked

–– If user has time to scan results exhaustively, retrieval is If user has time to scan results exhaustively, retrieval is a matter of watching, selecting, and sorting a matter of watching, selecting, and sorting quicklyquickly

Learning from the userLearning from the user

•• Two common approachesTwo common approaches1.1. Relevance feedbackRelevance feedback

2.2. Active learningActive learning


14‐6‐2010


Relevance feedbackRelevance feedback

Try to find boundary in Try to find boundary in F1

Slide credit: Marcel Worring

feature space best feature space best separating positiveseparating positivefrom negative examplesfrom negative examples

F1

In the next iteration the In the next iteration the ill hill h

F F2

Measure of class membership probabilityMeasure of class membership probability

user will have more user will have more samples hence a better samples hence a better estimate of the boundaryestimate of the boundary

Active learningActive learning

Slide credit: Marcel Worring

In active learning the In active learning the system decides which system decides which elements to show for elements to show for

F1

The system can safely assumethis sample is also negative

feedback and which not.feedback and which not.

F F2

For the system it isFor the system it isrelevant to know this labelrelevant to know this label


14‐6‐2010


Demo: Demo: ForkBrowserForkBrowser

de Rooij, CIVR 2008

•• Learn positive and negative items from user Learn positive and negative items from user browse behaviorbrowse behavior

Demo: Timeline navigationDemo: Timeline navigation

http://hollandsglorieoppinkpop.nl/


14‐6‐2010


The future of video retrieval?The future of video retrieval?

Jonathan Wang, Carnegie Mellon University

Interface gap best addressed byInterface gap best addressed by

•• TRECVID interactive search taskTRECVID interactive search task–– Interactively solve 20+ search topics Interactively solve 20+ search topics (10/15 minutes)(10/15 minutes)

–– Return 1,000 ranked shotReturn 1,000 ranked shot--based results per topicbased results per topic

–– Evaluate using Average PrecisionEvaluate using Average Precision

•• VideOlympicsVideOlympics showcaseshowcase


14‐6‐2010


Video browsing at TRECVIDVideo browsing at TRECVID

•• Wide performance variationWide performance variation–– # concept detectors# concept detectors

–– Search interfaceSearch interface

–– Expert Expert vsvs novice user novice user

Edition

•• Most pronouncedMost pronounced–– 2003: 2003: InformediaInformedia classicclassic

–– 2005: Large semantic lexicon2005: Large semantic lexicon

–– 2008: Active learning2008: Active learningMean Average Precision

UvAUvA--MediaMillMediaMill browsersbrowsers@TRECVID@TRECVID

Snoek et al. TRECVID 04-09

CrossBrowser

ForkBrowserForkBrowser

• 228 other interactive systemsTraditional systems


14‐6‐2010


CriticismCriticism

•• Retrieval performance cannot be the only Retrieval performance cannot be the only evaluation criterionevaluation criterion–– Quality of detectors countsQuality of detectors counts

–– Experience of searcher countsExperience of searcher counts

–– Visualization of interface countsVisualization of interface counts

–– Ease of use countsEase of use counts

–– ……

Video browsing at Video browsing at VideOlympicsVideOlympics

•• Promote multiple facets of video searchPromote multiple facets of video search–– RealReal--time interactive video search ‘competition’time interactive video search ‘competition’

–– Simultaneous exposure of multiple video search enginesSimultaneous exposure of multiple video search engines

–– Highlight possibilities and limitations of stateHighlight possibilities and limitations of state--ofof--thethe--artart


14‐6‐2010


ParticipantsParticipants

Video trailerVideo trailer

http://www.VideOlympics.org


14‐6‐2010


Conclusion on video browsingConclusion on video browsing

•• Interaction by browsing indispensible for any Interaction by browsing indispensible for any practical video search enginepractical video search engine

•• System should support user by active learning System should support user by active learning and intuitive (mobile) visualizationsand intuitive (mobile) visualizationsand intuitive (mobile) visualizationsand intuitive (mobile) visualizations

Interactive Video RetrievalInteractive Video RetrievalConclusion on:Conclusion on:

queryprediction

measuring conceptdetection

lexiconlearning

browsingvideovideo

videomeasuringfeatures


14‐6‐2010


And there is always more …And there is always more …

•• Recommended special issues Recommended special issues –– IEEE Transactions on Pattern Analysis and Machine IEEE Transactions on Pattern Analysis and Machine

Intelligence, 30(11), November 2008Intelligence, 30(11), November 2008

–– Proceedings of the IEEE, 96(4), April 2008Proceedings of the IEEE, 96(4), April 2008

–– IEEE Transactions on Multimedia, 9(5), August 2007IEEE Transactions on Multimedia, 9(5), August 2007

•• 300 references on video search300 references on video search–– Snoek and Snoek and WorringWorring, Concept, Concept--Based Video Retrieval, Based Video Retrieval,

Foundations and Trends in Information Retrieval, Foundations and Trends in Information Retrieval, Vol. 2: No 4, pp 215Vol. 2: No 4, pp 215--322, 322, 2009.2009.

General references IGeneral references IColor Invariance. Jan-Mark Geusebroek, R. van den Boomgaard, Arnold W. M. Smeulders, H. Geerts. IEEE Trans. Pattern Analysis and Machine Intelligence, Volume 23 (12) page 1338 1350 2001Volume 23 (12), page 1338-1350, 2001.

Distinctive Image Features from Scale-Invariant Keypoints. D. G. Lowe. Int'l Journal of Computer Vision, vol. 60, pp. 91-110, 2004.

Large-Scale Concept Ontology for Multimedia. M. R. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. Hsu, L. S. Kennedy, A. G. Hauptmann, and J. Curtis,. IEEE MultiMedia, vol. 13, pp. 86-91, 2006.

Efficient Visual Search for Objects in Videos. J. Sivic and A. Zisserman. Proceedings of the IEEE, vol. 96, pp. 548-566, 2008.

High Level Feature Detection from Video in TRECVid: A 5-year Retrospective of Achievements. A. F. Smeaton, P. Over, and W. Kraaij, In Multimedia Content Analysis, Theory and Applications, (A. Divakaran, ed.), Springer, 2008.

Visually Searching the Web for Content. J. R. Smith and S.-F. Chang. IEEE MultiMedia, vol. 4, pp. 12-20, 1997.

Content Based Image Retrieval at the End of the Early Years. Arnold W. M. Smeulders, Marcel Worring, S. Santini, A. Gupta, R. Jain. IEEE Trans. Pattern Analysis and Machine Intelligence, Volume 22 (12), page 1349-1380, 2000.


14‐6‐2010


General references IIGeneral references IIThe Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. Cees G. M. Snoek, Marcel Worring, Jan C. van Gemert, Jan-Mark Geusebroek Arnold W M Smeulders ACM Multimedia page 421 430 2006Geusebroek, Arnold W. M. Smeulders. ACM Multimedia, page 421-430, 2006.

The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, Frank J. Seinstra, Arnold W. M. Smeulders. IEEE Trans. Pattern Analysis and Machine Intelligence, Volume 28 (10), page 1678-1689, 2006.

A Learned Lexicon-Driven Paradigm for Interactive Video Retrieval. Cees G. M. Snoek, Marcel Worring, Dennis C. Koelma, Arnold W. M. Smeulders. IEEE Trans. Multimedia, Volume 9 (2), page 280-292, 2007.

Adding Semantics to Detectors for Video Retrieval. Cees G. M. Snoek, BoukeHuurnink, Laura Hollink, Maarten de Rijke, Guus Schreiber, Marcel Worring. IEEE Trans. Multimedia, Volume 9 (5), page 975-986, 2007.

The MediaMill TRECVID 2004-2009 Semantic Video Search Engine. Cees G. M. Snoek et al. Proceedings of the TRECVID Workshop, 2004-2009.

Visual-Concept Search Solved?. Cees G.M. Snoek and Arnold W.M. Smeulders. IEEE Computer, Volume 43 (6) (in press), 2010.

General references IIIGeneral references IIIConcept-Based Video Retrieval. Cees G. M. Snoek, Marcel Worring. Foundations and Trends in Information Retrieval, Vol. 4 (2), page 215-322, 2009.

http://www.science.uva.nl/research/publications/

Local Invariant Feature Detectors: A Survey. T. Tuytelaars and K. Mikolajczyk. Foundations and Trends in Computer Graphics and Vision, vol. 3, pp. 177-280, 2008.

Evaluating Color Descriptors for Object and Scene Recognition. Koen E. A. van de Sande, Theo Gevers, Cees G. M. Snoek. IEEE Trans. Pattern Analysis and Machine Intelligence (in press), 2010.

Visual Word Ambiguity. Jan C. van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, Jan-Mark Geusebroek. IEEE Trans. Pattern Analysis and Machine Intelligence (in press), 2009.

Real-Time Bag of Words Approximately Jasper R R Uijlings, Arnold W MReal Time Bag of Words, Approximately. Jasper R. R. Uijlings, Arnold W. M. Smeulders, R. J. H. Scha. ACM Int'l Conference on Image and Video Retrieval, 2009.

Lessons Learned from Building a Terabyte Digital Video Library. H. D. Wactlar, M. G. Christel, Y. Gong, and A. G. Hauptmann. IEEE Computer, vol. 32, pp. 66-73, 1999.

Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Int'l Journal of Computer Vision, vol. 73, pp. 213-238, 2007.


14‐6‐2010


Contact infoContact info

•• Cees Snoek Cees Snoek http://staff.science.uva.nl/~cgmsnoekhttp://staff.science.uva.nl/~cgmsnoek

•• Arnold Arnold SmeuldersSmeuldershttp://staff science uva nl/~smeulderhttp://staff science uva nl/~smeulderhttp://staff.science.uva.nl/~smeulderhttp://staff.science.uva.nl/~smeulder

Further informationFurther information

M di Mill lwww.MediaMill.nl

cvpr2010 tutorial: video search engines

Documents