internet video search a brief history of television...grey value as well as opponent color sets....

33
Cees Snoek & Arnold Smeulders University of Amsterdam 29112011 Internet Video Search 1 Internet Video Search Internet Video Search Cees Cees G.M. G.M. Snoek Snoek & Arnold W.M. & Arnold W.M. Smeulders Smeulders Intelligent Systems Lab Amsterdam, Ui it fA t d Th N th l d University of Amsterdam, The Netherlands A brief history of television A brief history of television From broadcasting to narrowcasting From broadcasting to narrowcasting …to …to thincasting thincasting ~1955 ~1985 ~2005 ~>2008 ~2010 …as of May 2011 The international business case The international business case Everybody with a message uses video for delivery Everybody with a message uses video for delivery Growing Growing unmanageable unmanageable amounts of video amounts of video Crowd Crowd-given search given search What others say is in the video. Raw Raw-driven search driven search www.science.uva.nl/research/isla MultimediaN project

Upload: others

Post on 09-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 1

    Internet Video SearchInternet Video Search

    CeesCees G.M. G.M. SnoekSnoek & Arnold W.M. & Arnold W.M. SmeuldersSmeulders

    Intelligent Systems Lab Amsterdam,U i it f A t d Th N th l dUniversity of Amsterdam, The Netherlands

    A brief history of televisionA brief history of television

    •• From broadcasting to narrowcastingFrom broadcasting to narrowcasting

    •• …to …to thincastingthincasting

    ~1955 ~1985 ~2005

    ~>2008

    ~2010

    …as of May 2011

    The international business caseThe international business case

    •• Everybody with a message uses video for deliveryEverybody with a message uses video for delivery

    •• Growing Growing unmanageableunmanageable amounts of videoamounts of video

    CrowdCrowd--given searchgiven search

    What others say is in the video.

    RawRaw--driven searchdriven search

    www.science.uva.nl/research/isla

    MultimediaN project

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 2

    1. Short course outline1. Short course outline

    .0 Problem statement.0 Problem statement

    .1 Measuring features.1 Measuring features

    .2 Concept detection.2 Concept detection

    .3 Lexicon learning.3 Lexicon learning

    .4 Telling stories.4 Telling stories

    .5 Video browsing.5 Video browsing

    Problem 1:Problem 1:Variation in appearanceVariation in appearance

    So many images of one thing, due to minor differences in:illuminationbackgroundbackground occlusionviewpoint, …

    •• ThisThis is the is the sensorysensory gapgap

    SuitBasketball

    Tree

    1101011011011011011011001101011011111001101011011111

    1101011011011011011011001101011011111001101011011111

    1101011011011011011011001101011011111001101011011111

    1101011011011

    110101101101101101101100110101101111100

    110101101101101101101100110101101111100

    Problem 2:Problem 2:What defines things?What defines things?

    Machine

    Multimedia Archives

    Table

    US flag

    Aircraft

    Dog Tennis Mountain

    Fire

    Building

    011011011001101011011111001101011011111

    01011011111001101011011111

    1101011011011011011011001101011011111001101011011111

    01011011111001101011011111

    1101011011011011011011001101011011111001101011011111

    1101011011011011011011001101011011111001101011011111

    1101011011011011011011001101011011111001101011011111

    1101011011011011011011001101011011111001101011011111Humans

    Problem 3: Problem 3: Many things in the worldMany things in the world

    •• ThisThis is the model gapis the model gap

    ProblemProblem 4:4:WhatWhat story story tellstells a video?a video?

    •• This is the narrative gapThis is the narrative gap

    Problem 5: Problem 5: Use is openUse is open--endedended

    scope

    •• ThisThis is the interface gapis the interface gap

    screen

    keywords

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 3

    Conclusion on problemsConclusion on problems

    •• Video search is a diverse and challengeVideo search is a diverse and challenge--rich rich research topicresearch topic

    Sensory gapSensory gap–– Sensory gapSensory gap

    –– SemanicSemanic gapgap

    –– Model gapModel gap

    –– Narrative gapNarrative gap

    –– Interface gapInterface gap

    Today’s promiseToday’s promise

    •• You will be acquainted with the theory and You will be acquainted with the theory and practice of the semantic video search paradigm. practice of the semantic video search paradigm.

    •• You will be able to recall the five major scientific You will be able to recall the five major scientific problems in video retrieval and explain, and value problems in video retrieval and explain, and value the presentthe present--day solutions.day solutions.

    1. Short course outline1. Short course outline

    .0 Problem statement.0 Problem statement

    .1 Measuring features.1 Measuring featuresgg

    .2 Concept detection.2 Concept detection

    .3 Lexicon learning.3 Lexicon learning

    .4 Telling stories.4 Telling stories

    .5 Video browsing.5 Video browsing

    There are a million appearances to one concept

    A million appearancesA million appearances

    Where are the patterns (of the same shoe)?

    Somewhere the variance must be removed.

    Invariance: the need for ~Invariance: the need for ~

    The illumination and the viewing direction are removed as soon as the image has entered.

    Common transformationsCommon transformations

    Illumination transformationsIllumination transformations ContrastContrast

    Intensity and ShadowIntensity and Shadow

    ColorColor

    ViewpointViewpoint Rotation and LateralRotation and Lateral

    DistanceDistance

    Viewing angleViewing angle

    ProjectionProjection

    CoverCover SpecularSpecular or matteor matte

    Occlusion & Clutter, Wear & Tear, Aging, Night & day and Occlusion & Clutter, Wear & Tear, Aging, Night & day and so on into increasingly complex transformations.so on into increasingly complex transformations.

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 4

    Features of selected points may be good enough to describe object

    More than one transformationMore than one transformation

    iff the selection & the feature set are both invariant forscene-accidental conditions Gevers TIP 2000

    Design of invariants: OrbitsDesign of invariants: Orbits

    For a property variant under W, observations of a constant property are spread over the orbit. The purpose of an invariant is to capture all of the orbit into one value.

    Example: invarianceExample: invariance

    projection

    Slide credit: Theo Gevers

    (R,G,B)-space(c(c11,c,c22,c,c33))--spacespace

    },max{arctan),,(1 BG

    RBGRc =

    },max{arctan),,(2 BR

    GBGRc =

    },max{arctan),,(3 GR

    BBGRc =

    shadows shading highlights ill. intensity ill. shadows shading highlights ill. intensity ill. colorcolorEE -- -- -- -- --WW -- + + -- + + --

    Color invarianceColor invariance

    Gevers, PRL, 1999Geusebroek, PAMI 2001

    CC + + + + -- + + --MM + + + + -- + ++ +NN + + + + -- + ++ +LL + + + + + + + + --HH + + + + + + + + --

    Local shape motivationLocal shape motivation

    Perceptual importanceConcise data

    Robust to occlusion & clutter

    Tuytelaars FTCGV 2008

    Meet the GaussiansMeet the Gaussians

    Taylor expansion at xTaylor expansion at x

    For discretely sampled signal use the Gaussians

    Robust additive differentialsRobust additive differentials

    Dimensions separableDimensions separable

    No maxima No maxima introducedintroduced

    For discretely sampled signal use the Gaussians

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 5

    Meet the Gaussians Meet the Gaussians

    The basic video observables are:

    Local receptive fields Local receptive fields ff((xx))

    The receptive fields up to The receptive fields up to second second order.order.

    Grey value as well as opponent color sets.Grey value as well as opponent color sets.

    Taxonomy of image structureTaxonomy of image structure

    Slide credit: Theo Gevers

    T-junction

    Highlight

    Corner

    Junction

    From receptive fields to meaningFrom receptive fields to meaning

    Lee et al, Comm. ACM 2011

    With examples

    The 2D Gabor function is:

    )(22 222

    1)( vyuxjyx

    eeyxh ++

    −= πδ

    Meet GaborMeet Gabor

    )(222

    ),( yjeeyxh = δπσ

    Tuning parameters: u, v, σ + usual invariants by combinationManjunath and Ma on Gabor for texture as seen in F-space

    Local receptive fields Local receptive fields FF((xx))

    The receptive fields for (u, v) measured locallyThe receptive fields for (u, v) measured locally

    Greyvalue as well as opponent color sets.Greyvalue as well as opponent color sets.

    Hoang 2003

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 6

    GaborGabor filters: filters: texturetexture

    Hoang, ECCV 2002

    Original image K-means clustering Segmentation

    GaborGabor filters: filters: texturetexture

    Local receptive field in f(Local receptive field in f(xx,t),t)

    Gaussian equivalent over x and t:Gaussian equivalent over x and t:

    zero orderzero order first order over tfirst order over t

    Burghouts TIP 2006

    Receptive fields: overviewReceptive fields: overview

    All observables up to first order All observables up to first order colorcolor, , second order spatial scales, eight second order spatial scales, eight frequency bands & first order in time.frequency bands & first order in time.

    Good observables > easy algorithmsGood observables > easy algorithms

    Periodicity:

    Detect periodic motion by one steered filter:

    Deadly simple algorithm…

    Burghouts TIP 2006

    Meet the LoweansMeet the Loweans

    So far we paid respect to the spatial order.So far we paid respect to the spatial order.

    Now we will weakly follow the spatial order and form Now we will weakly follow the spatial order and form histograms on all directions we encounter locallyhistograms on all directions we encounter locallyhistograms on all directions we encounter locally, histograms on all directions we encounter locally, … better known as (the second part of) SIFT.… better known as (the second part of) SIFT.

    Lowe IJCV 2004

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 7

    Meet the Meet the LoweansLoweans

    4 x 4 Gradient window after thresholding4 x 4 Gradient window after thresholding

    Histogram of 4 x 4 samples per window in 8 directionsHistogram of 4 x 4 samples per window in 8 directions

    Gaussian weighting around center (Gaussian weighting around center (σσ is 1/2 of is 1/2 of σσ keypoint)keypoint)

    4 x 4 x 8 = 128 dimensional feature vector4 x 4 x 8 = 128 dimensional feature vector

    Image: Jonas Hurrelmann

    SIFT detectionSIFT detection

    Slide credit: Jepson 2005

    Enriching SIFT Enriching SIFT (in a nutshell)(in a nutshell)

    •• Affine SIFTAffine SIFT–– Choose prominent direction in SIFTChoose prominent direction in SIFT

    Mikolajczyk, IJCV 2005Ke, CVPR 2004

    Van de Sande, PAMI 2010

    •• PCAPCA--SIFTSIFT–– Robust and compact representationRobust and compact representation

    •• ColorSIFTColorSIFT–– Add several invariant color descriptorsAdd several invariant color descriptors

    •• TimeSIFTTimeSIFT–– Anyone?Anyone?

    Illumination invarianceIllumination invariance

    Invariance properties of the descriptors usedLight intensity

    changeLight intensity

    shiftLight intensity

    change and shiftLight color

    changeLight color change

    and shift

    SIFT + + + + +

    Van de Sande, PAMI 2010

    SIFT + + + + +OpponentSIFT +/- + +/- +/- +/-C-SIFT + + + +/- +/-rgSIFT + + + +/- +/-RGB-SIFT + + + + +

    Where to sample in the video?Where to sample in the video?

    •• Video shot is set of frames representing a Video shot is set of frames representing a continuous camera action in time and spacecontinuous camera action in time and space

    Analysis typically on a single key frame per shotAnalysis typically on a single key frame per shot–– Analysis typically on a single key frame per shotAnalysis typically on a single key frame per shot

    Shot Key Frame

    WhereWhere to sample in the frame?to sample in the frame?

    Tuytelaars 2008 FTCGV

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 8

    Interest point examplesInterest point examples

    Original image Harris Laplacian Color salient points

    Mikolajczyk, CVPR 2006van de Weijer, PAMI 2006

    DenseDense sampling sampling exampleexample

    What What is the object in the is the object in the middle?middle?

    No segmentation …No pixel values of the object …

    FastFast densedense descriptorsdescriptors

    Uijlings et al, CIVR 2009

    Image Patch

    Reuse sub-regions: 16x speed-up

    Conclusion on Conclusion on measuring featuresmeasuring features

    •• Invariance is crucial when designing featuresInvariance is crucial when designing features–– More invariance means less stable…More invariance means less stable…

    b t b t tb t b t t–– …but more robustness to sensory gap…but more robustness to sensory gap

    •• Effective features strike a balance between Effective features strike a balance between invariance and discriminatory powerinvariance and discriminatory power–– And for video search efficiency is helpful also…And for video search efficiency is helpful also…

    And there is always more …And there is always more …

    For example:For example:

    Local Invariant Feature Detectors: A SurveyLocal Invariant Feature Detectors: A Survey

    TinneTinne TuytelaarsTuytelaars & & KrystianKrystian MikolajczykMikolajczyk

    FTCGV 3:3(177FTCGV 3:3(177——280)280)

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 9

    1. Short course outline1. Short course outline

    .0 Problem statement.0 Problem statement

    .1 Measuring features.1 Measuring features

    .2 Concept detection.2 Concept detection

    .3 Lexicon learning.3 Lexicon learning

    .4 Telling stories.4 Telling stories

    .5 Video browsing.5 Video browsing

    The semantic gapThe semantic gap

    The semantic gap is the lack of coincidenceThe semantic gap is the lack of coincidence

    Quote

    The semantic gap is the lack of coincidence The semantic gap is the lack of coincidence between the information that one can extract between the information that one can extract from the sensory data and the interpretation that from the sensory data and the interpretation that the same data has for a user in a given situationthe same data has for a user in a given situation

    Arnold Smeulders, PAMI, 2000

    The science of labelingThe science of labeling

    •• To understand anything in science, things need a To understand anything in science, things need a name that is universally recognizedname that is universally recognized

    •• Worldwide endeavor in Worldwide endeavor in naming visual informationnaming visual information

    living organisms chemical elements human genome‘categories’ text

    Naming visual informationNaming visual information

    •• Concept detectionConcept detection–– Does the image contain an airplane?Does the image contain an airplane?

    Slide credit: Andrew Zisserman

    Focus of today’s Focus of today’s lecturelecture

    •• Object localizationObject localization–– Where is the airplane, (if any)?Where is the airplane, (if any)?

    •• Object segmentationObject segmentation–– Which pixels are part of an airplane, Which pixels are part of an airplane,

    (if any)?(if any)?

    How difficult is the problem?How difficult is the problem?

    •• Human vision consumes 50% brain power…Human vision consumes 50% brain power…

    Van Essen, Science 1992

    Semantic concept detectionSemantic concept detection

    •• The patient approachThe patient approach–– Building detectors oneBuilding detectors one--atat--aa--timetime

    A face detector for frontal faces

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 10

    A simple face detectorA simple face detector

    One PhD per detector requires too many students…

    So how about these?So how about these?

    and the thousands of others ………

    Labeled

    Basic concept detectionBasic concept detection

    aircraftoutdoor

    Feature Extraction

    Supervised Learner

    Training

    Feature Measurement

    Classification

    Testing

    Video

    Labeled examples

    It is an aircraft probability 0.7It is outdoor probability

    0.95

    Demo: concept detectionDemo: concept detection

    Visualization byJasper Schulte

    Linear classifiers Linear classifiers -- marginmargin

    Slide credit: Cordelia Schmid

    Nonlinear SVMNonlinear SVM

    Slide credit: Cordelia Schmid

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 11

    Breakthrough: nonlinear SVMsBreakthrough: nonlinear SVMs

    Vapnik, 1995

    Nonlinear SVMsNonlinear SVMs

    Slide credit: Cordelia Schmid

    Kernels for concept detectionKernels for concept detection

    Slide credit: Cordelia Schmid

    Comparing kernelsComparing kernels

    Zhang, IJCV ‘07

    Causes of poor generalizationCauses of poor generalization

    •• OverOver--fittingfitting–– Separate your dataSeparate your data

    •• Curse of dimensionalityCurse of dimensionality–– Information fusion helpsInformation fusion helps

    Feature fusionFeature fusionSynchronization

    Normalization

    Transformation

    Concatenation

    Feature vector

    Shot segmentedvideo 

    Concept confidence

    Concatenation

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 12

    Feature fusionFeature fusionSynchronization

    Normalization

    Transformation

    Concatenation

    Feature vector

    Shot segmentedvideo 

    Concept confidence

    Concatenation

    + Only one learning phase- Combination often ad hoc - One feature may dominate- Curse of dimenisonality

    Avoiding dimensionality curseAvoiding dimensionality curse

    •• Codebook aka bagCodebook aka bag--ofof--words modelwords model–– Create a codeword vocabularyCreate a codeword vocabulary

    Di tiDi ti i ithi ith d dd d

    Leung and Malik, IJCV, 2001Sivic and Zisserman, ICCV, 2003

    and many others…

    –– DiscretizeDiscretize image with image with codewordscodewords

    –– Count codewordsCount codewords

    0 100 200 300 400 5000

    10

    20

    30

    40

    50

    60

    70

    80

    EmphasizingEmphasizing spatialspatial configurationsconfigurations

    •• CodebookCodebook ignoresignores geometricgeometric correspondencecorrespondence

    Grauman, ICCV 2005Lazebnik, CVPR 2006

    Marszalek, VOC 2007

    •• For video:For video:

    •• SolutionSolution: : spatialspatial pyramidpyramid–– aggregate statistics of local features over fixed aggregate statistics of local features over fixed

    subregionssubregions

    –– 1x1 entire image1x1 entire image

    –– 2x2 image quarters2x2 image quarters

    –– 1x3 horizontal bars1x3 horizontal bars

    Codebook modelCodebook model

    •• Codebook consists of Codebook consists of codewordscodewords–– kk--means clustering of descriptorsmeans clustering of descriptors–– Commonly 4 000Commonly 4 000 codewordscodewords per codebookper codebookCommonly 4,000 Commonly 4,000 codewordscodewords per codebookper codebook

    Cluster

    AssignDense+OpponentSIFT Feature vector (length 4,000)

    Codebook assignmentCodebook assignment

    van Gemert, PAMI 2010

    Hard assignmentHard assignment Soft assignmentSoft assignment

    ● Codeword

    –– Soft assignment: assign to multiple clusters, weighted by Soft assignment: assign to multiple clusters, weighted by distance to centerdistance to center

    –– Single sigma (distance weighting) for all codebook elementsSingle sigma (distance weighting) for all codebook elements

    ExtendingExtending soft soft assignmentassignment

    •• FisherFisher VectorVector–– Train a Train a GaussianGaussian Mixture Model, Mixture Model, wherewhere eacheach codebookcodebook

    element haselement has itsits ownown sigmasigma –– oneone per dimensionper dimension

    Perronnin ECCV 2010

    element has element has itsits ownown sigma sigma oneone per dimensionper dimension

    –– Do Do notnot store store assigmentassigment, , butbut differencesdifferences in all descriptor in all descriptor dimensionsdimensions

    •• GreatlyGreatly increasesincreases complexitycomplexity–– feature vector is feature vector is #codewords x #descriptor#codewords x #descriptor

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 13

    ExtendingExtending soft soft assignmentassignment

    •• Super Vector CodingSuper Vector Coding–– also counts the dimensionalso counts the dimension--wise difference of a SIFT wise difference of a SIFT

    descriptor to a visual worddescriptor to a visual word

    Zhou ECCV 2010

    descriptor to a visual worddescriptor to a visual word

    •• Key insight: these methods propose many new Key insight: these methods propose many new components and algorithms, but components and algorithms, but difference difference coding coding is their main contributionis their main contribution

    Difference codingDifference coding

    •• Fisher vectorFisher vector

    Perronnin ECCV 2010Zhou ECCV 2010

    Jegou CVPR 2010

    •• Super vector codingSuper vector coding

    •• VLADVLAD

    Fast quantizationFast quantization

    •• Random Random forestsforests–– Randomized process makes it very fast to buildRandomized process makes it very fast to build

    T t t ll f t t ti tiT t t ll f t t ti ti

    Moosman, PAMI 2008Uijlings, CIVR 2009

    –– Tree structure allows fast vector quantizationTree structure allows fast vector quantization

    –– Logarithmic rather than linear projection timeLogarithmic rather than linear projection time

    •• RealReal--timetime BoWBoW (!)(!)–– WhenWhen usedused withwith fastfast densedense samplingsampling

    –– SURF 2x2 descriptor SURF 2x2 descriptor insteadinstead of 4x4of 4x4

    –– RBF RBF kernelkernel

    Codebooks grow big…Codebooks grow big…

    •• Researchers concatenate multiple codebooksResearchers concatenate multiple codebooks

    Spatial pyramid adds more dimensionsSpatial pyramid adds more dimensionsoo 1x1 = 4K1x1 = 4K

    oo 2x2 = 16K2x2 = 16K

    oo 1x3 = 12K1x3 = 12K

    –– Feature vector length easily Feature vector length easily >100K>100K……

    SVM preSVM pre--computed kernel trickcomputed kernel trick

    •• Use distance between feature vectorsUse distance between feature vectors

    -γ dist(f1, f2)K(f f ) = e

    •• Increase efficiency significantlyIncrease efficiency significantly–– PrePre--compute the SVM kernel matrixcompute the SVM kernel matrix

    –– Long vectors possible as we only need 2 in memoryLong vectors possible as we only need 2 in memory

    –– Parameter optimization reParameter optimization re--uses preuses pre--computed matrixcomputed matrix

    K(f1, f2) = e

    GPUGPU--empowered empowered prepre--computed kernelcomputed kernel

    1 CPU 4 CPU

    Van de Sande, TMM 2011

    40000 1x CPU Opteron 250 (2,4GHz)

    1x CPU Core i7 920 (2 66GHz)

    GPU0

    10000

    20000

    30000

    4000 8000 16000 32000 64000 128000

    Time (s)

    Total Feature Vector Length

    1x CPU Core i7 920 (2,66GHz)

    4x CPU Opteron 250 (2,4 GHz)

    4x CPU Core i7 920 (2,66GHz)

    25x CPU Opteron 250 (2,4GHz)

    49x CPU Opteron 250 (2,4GHz)

    1x GPU Geforce GTX260 (27 cores)

    37x speed-upsingle CPU

    10x speed-upquad core

    2x speed-up49cpu cluster

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 14

    Efficient classificationEfficient classification

    Maji et al., CVPR 2008

    For the Intersection Kernel hi is piecewise linear, and quite smooth, blue plot.  We can approximate with fewer uniformly spaced segments, red plot. Saves time & space!   

    vsvs HIKHIKHIK 75 times faster, negligible loss in average precision

    χ²χ²

    Moving object appearanceMoving object appearance

    = Emphasis added

    keyframeShot boundary Shot boundary

    Prob

    abili

    ty

    0

    1

    Moving object appearanceMoving object appearance

    = Emphasis added

    keyframeShot boundary Shot boundary

    Prob

    abili

    ty

    0

    1Max

    Avg

    Feature Feature fusionfusion

    0

    1

    Relative frequency

    1 2 3 4 5

    Codebook element

    Harris -Laplace salient points

    Point sampling strategy Color feature extraction Codebook model

    Bag-of-features

    .

    .

    .

    Van de Sande, PAMI 2010

    Spatial pyramid

    Dense sampling

    0

    1

    1 2 3 4 5

    0

    1

    1 2 3 4 5

    0

    1

    1 2 3 4 5

    0

    1

    1 2 3 4 5

    0

    1

    Relative frequency

    1 2 3 4 5

    Codebook element

    Bag-of-features

    Spatial pyramid: multiple bags-of-features

    .

    Image

    .

    .

    .

    .

    ..

    + Codebook reduces dimensionality- Combination still ad hoc - One codebook may dominate

    Classifier Classifier fusionfusion

    + Focus on feature strength

    + Fusion in semantic spacep

    - Expensive learning effort

    - Loss of feature correlation

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 15

    Unsupervised fusion of classifiersUnsupervised fusion of classifiers

    Support Vector

    Global Image

    + Aggregation functions reduce learning effort+ Efficient use of training examples

    Snoek, TRECVID 2006Wang, ACM MIR 2007

    Vector MachineFeature Extraction

    GeometricMean

    Logistic Regression

    Fisher Linear

    Discriminant

    Regional Image Feature

    Extraction

    Keypoint Image Feature

    Extraction

    - Linear function unlikely to be optimal

    Fusing conceptsFusing concepts

    •• Exploitation of concept coExploitation of concept co--occurrence occurrence –– Concepts do not occur in vacuumConcepts do not occur in vacuum

    Concept 1

    Concept 2

    Concept 3

    Naphade Trans. MM 2001

    SkyAircraft

    HowHow to to fusefuse conceptsconcepts??

    •• LearningLearning modelsmodels

    •• IncludeInclude ontologiesontologies

    Learning models Learning models -- explicitlyexplicitly

    •• Using graphical modelsUsing graphical models–– Computationally complexComputationally complex

    Li it d l bilitLi it d l bilit–– Limited scalabilityLimited scalability

    Qi, TOMCCAP 2009

    Learning models Learning models -- implicitlyimplicitly

    •• Using support vector machine, or data miningUsing support vector machine, or data mining–– Assumes classifier learns relationsAssumes classifier learns relations

    S ff f tiS ff f ti–– Suffers from error propagationSuffers from error propagation

    Weng, ACM MM 2008

    Including knowledgeIncluding knowledge

    •• Can ontologies help?Can ontologies help?–– Symbolic ontolgoies Symbolic ontolgoies vsvs uncertain detectorsuncertain detectors

    Wu, ICME 2004

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 16

    Concept detection pipelineConcept detection pipeline

    IBM 2003

    Concept detection pipelineConcept detection pipeline

    IBM 2003

    FeatureFusion

    ClassifierFusion

    ConceptFusion

    Video diverVideo diver

    Wang, ACM MIR 2007

    Video diverVideo diver

    FeatureFusion

    ClassifierFusion

    ConceptFusion

    Wang, ACM MIR 2007

    Supervised L

    Supervised Learner

    Semantic Features

    Combination

    Content Features Extraction

    Layout Features Extraction

    Semantic PathfinderSemantic Pathfinder

    Snoek, PAMI 2006

    Supervised Learner

    Multimodal Features

    Combination

    Learner

    Visual Features Extraction

    Textual Features Extraction

    Content Analysis Step Style Analysis Step Context Analysis Step

    Context Features Extraction

    Capture Features Extraction

    Select Best of 3 Paths

    after Validation

    Animal

    Sports

    Vehicle

    Entertainment Monologue

    FlagFire

    Weather news

    Hu Jintao

    Supervised L

    Supervised Learner

    Semantic Features

    Combination

    Content Features Extraction

    Layout Features Extraction

    Semantic PathfinderSemantic PathfinderFeatureFusion

    ClassifierFusion

    ConceptFusion

    Snoek, PAMI 2006

    Supervised Learner

    Multimodal Features

    Combination

    Learner

    Visual Features Extraction

    Textual Features Extraction

    Content Analysis Step Style Analysis Step Context Analysis Step

    Context Features Extraction

    Capture Features Extraction

    Select Best of 3 Paths

    after Validation

    Animal

    Sports

    Vehicle

    Entertainment Monologue

    FlagFire

    Weather news

    Hu Jintao

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 17

    StateState--ofof--thethe--ArtArt

    Snoek et al, TRECVID 2008-2011Van Gemert et al, PAMI 2010

    Van de Sande et al, PAMI 2010

    StateState--ofof--thethe--ArtArtFeatureFusion

    ClassifierFusion

    Snoek et al, TRECVID 2008-2011Van Gemert et al, PAMI 2010

    Van de Sande et al, PAMI 2010

    StateState--ofof--thethe--ArtArt

    Snoek et al, TRECVID 2008-2011Van de Sande et al, PAMI 2010

    Van Gemert et al, PAMI 2010

    Software available for download at http://colordescriptors.com

    Demo: Demo: MediaMillMediaMill video search enginevideo search engine

    http://www.mediamill.nlhttp://www.mediamill.nl

    Detecting Semantic Concepts in VideoDetecting Semantic Concepts in VideoConclusion on:Conclusion on:

    •• We started with invariance and manual laborWe started with invariance and manual labor

    •• We generalized with machine learningWe generalized with machine learning–– …but needed several abstractions to do appropriately…but needed several abstractions to do appropriately

    •• For the moment, no oneFor the moment, no one--sizesize--fitsfits--all solution all solution –– Learn optimal machinery per conceptLearn optimal machinery per concept

    1. Short course outline1. Short course outline

    .0 Problem statement.0 Problem statement

    .1 Measuring features.1 Measuring features

    .2 Concept detection.2 Concept detection

    .3 Lexicon learning.3 Lexicon learning

    .4 Telling stories.4 Telling stories

    .5 Video browsing.5 Video browsing

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 18

    Problem 3: Problem 3: Many things in the worldMany things in the world

    •• ThisThis is the model gapis the model gap

    Trial 1: counting dictionary wordsTrial 1: counting dictionary words

    Biederman, Psychological Review 1987

    Slide credit: Li Fei-Fei

    Trial 2: reverseTrial 2: reverse--engineeringengineering

    •• Estimation by Hauptmann et al.: 5000Estimation by Hauptmann et al.: 5000–– Using manually labeled queries and conceptsUsing manually labeled queries and concepts

    But speculative and questionable assumptionsBut speculative and questionable assumptions

    Hauptmann, PIEEE 2008

    –– But speculative, and questionable assumptionsBut speculative, and questionable assumptions

    ‘Google performance’

    Oracle Combination + Noise

    ‘Realistic’ Combination

    How to obtain labeled examples?How to obtain labeled examples?massive amounts of

    –– …but only human …but only human expertsexperts provide provide good qualitygood quality examplesexamples

    •• MM078MM078--Police/Security PersonnelPolice/Security Personnel–– Shots depicting law enforcement or private Shots depicting law enforcement or private

    security agency personnel.security agency personnel.

    Experts start with concept definitionExperts start with concept definition

    y g y py g y p

    Expert annotation toolsExpert annotation tools

    •• Balance between:Balance between:–– Spatiotemporal level of Spatiotemporal level of

    annotation detailannotation detail

    Volkmer, ACM MM 2005

    annotation detailannotation detail

    –– Number of conceptsNumber of concepts

    –– Number of positive Number of positive and negative examplesand negative examples

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 19

    LSCOM LSCOM (Large Scale Concept Ontology for Multimedia)(Large Scale Concept Ontology for Multimedia)

    •• Provides manual annotations for 449 conceptsProvides manual annotations for 449 concepts–– In international broadcast TV newsIn international broadcast TV news

    Naphade, IEEE MM 2006

    •• Connection to Connection to CycCyc ontologyontology

    •• LSCOMLSCOM--LiteLite–– 39 semantic concepts39 semantic concepts

    http://www.lscom.org/

    Verified positive examplesVerified positive examples

    •• ImageNetImageNet (15M images)(15M images)–– 22,000 22,000 categoriescategories

    100 l100 l–– > 100 examples> 100 examples

    •• SUN SUN (130K images)(130K images)–– 397 scene categories397 scene categories

    –– > 100 examples> 100 examples

    Deng et al, CVPR 2009

    Xiao et al, CVPR 2010

    Random negatives are not Random negatives are not necessarily informativenecessarily informative

    PositivesNegatives Decision boundary

    Xirong Li et al, ICMR 2011

    ?Active Learning?AdaBoost?

    Social Negative

    Bootstrapping

    Sampling informative negatives Sampling informative negatives

    •• Iteratively selecting the Iteratively selecting the most misclassified most misclassified negatives as the informative onesnegatives as the informative ones

    Selection

    Prediction*

    * airplane classifier

    Virtually labeled negatives

    Most misclassified negatives

    Social negative bootstrappingSocial negative bootstrapping

    aviationstation tennis outreach

    lotusvagrantsketchpeople

    b hlif i ft i l cow

    airplaneNegative examples

    beachlife aircraft airplane cow

    planesister street

    tokyolithuania aeroplane aeroplane

    ****

    tradeoff between effectiveness and efficiency

    to find the most informative negatives

    Concept detection Concept detection (on VOC08(on VOC08--val)val)

    •• Social negative bootstrapping is much better Social negative bootstrapping is much better than the baselinesthan the baselines

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 20

    Informative negatives of ‘Informative negatives of ‘tvtv’’

    Tag cloud of informative negatives

    Bridging the model gapBridging the model gap

    •• RequirementsRequirements–– Generic concept detection methodGeneric concept detection method

    M i t fM i t f l b l dl b l d ll–– Massive amounts of Massive amounts of labeled labeled examplesexamples

    –– Evaluation methodEvaluation method

    –– Fair amount of computationFair amount of computation

    Model gap best treated by Model gap best treated by TRECVIDTRECVID

    •• Situation in 2000Situation in 2000–– Various concept definitionsVarious concept definitions

    S ifiS ifi dd llll d t td t t–– SpecificSpecific and and smallsmall data setsdata sets

    –– Hard to compare methodologiesHard to compare methodologies

    •• Since 2001 worldwide evaluation by NISTSince 2001 worldwide evaluation by NIST–– TRECVID benchmarkTRECVID benchmark

    = ResearchersNIST

    NIST TRECVID benchmarkNIST TRECVID benchmark

    •• Promote progress in video retrieval researchPromote progress in video retrieval research–– Provide common dataset Provide common dataset –– Challenging tasksChallenging tasks

    I d d t l ti t lI d d t l ti t l–– Independent evaluation protocolIndependent evaluation protocol–– Forum for researchers to compare resultsForum for researchers to compare results

    http://trecvid.nist.gov/

    Video data setsVideo data sets

    •• US TV news US TV news (`03/`04)(`03/`04)

    •• International TV news International TV news (`05/`06)(`05/`06)

    •• Dutch TV infotainment Dutch TV infotainment (`07/`08/`09)(`07/`08/`09)

    TRECVID 2010 and 2011TRECVID 2010 and 2011Internet Archive web videosInternet Archive web videos

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 21

    Expert annotation effortsExpert annotation efforts

    LSCOM

    MediaMill - UvA

    Expert annotation efforts

    17 32 39101

    374

    500

    LSCOM

    Others

    TRECVID Edition

    Measuring performanceMeasuring performance

    Set of retrieved itemsSet of relevant items

    1.

    Results

    •• PrecisionPrecision Set of relevant retrieved items

    inverse relationship•• RecallRecall

    2.

    3.

    4.

    5.

    Evaluation measureEvaluation measure

    •• Average PrecisionAverage Precision–– Combines precision and recallCombines precision and recall

    A i i ft l t h tA i i ft l t h t

    1.

    Results

    –– Averages precision after relevant shotAverages precision after relevant shot

    –– Top of ranked list most important Top of ranked list most important

    AP =1/1 + 2/3 + 3/4 + …

    number of relevant documents

    2.

    3.

    4.

    5.

    AP 

    De facto evaluation standardDe facto evaluation standard

    Concept examplesConcept examples

    Aircraft

    BeachNote the variety in visual appearance

    Mountain

    People marching

    Police/Security

    Flower

    pp

    TRECVID TRECVID concept detection task resultsconcept detection task results

    •• Hard to compareHard to compare–– Different video dataDifferent video data

    2003

    2004

    –– Different conceptsDifferent concepts

    •• Clear Clear top top performersperformers–– Median skews to leftMedian skews to left

    –– Learning effectLearning effect

    –– Plenty of variationPlenty of variation

    2005

    2006

    2007

    2008

    2009

    2010

    2011

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 22

    UvAUvA--MediaMillMediaMill@TRECVID@TRECVID

    Snoek et al, TRECVID 04-11

    • >1000 other systems

    MediaMill ChallengeMediaMill Challenge

    •• The Challenge providesThe Challenge provides–– Manually annotated lexicon Manually annotated lexicon

    of of 101101 semantic conceptssemantic concepts

    •• The Challenge allows toThe Challenge allows to–– Gain insight in intermediate Gain insight in intermediate

    video analysis stepsvideo analysis steps

    http:/www.mediamill.nl/challenge/

    –– PrePre--computed lowcomputed low--level level featuresfeatures

    –– Trained classifier modelsTrained classifier models

    –– 55 experimentsexperiments–– Implementation + resultsImplementation + results

    –– Foster repeatability of Foster repeatability of experimentsexperiments

    –– Optimize video analysis Optimize video analysis systems on a component levelsystems on a component level

    –– Compare and improveCompare and improve

    • The Challenge lowers threshold for novice researchers

    Columbia374 + VIREO374Columbia374 + VIREO374

    •• Baseline detectors for 374 conceptsBaseline detectors for 374 concepts

    http://www.ee.columbia.edu/ln/dvmm/columbia374/http://www.cs.cityu.edu.hk/~yjiang/vireo374/http://www.ee.columbia.edu/ln/dvmm/CU-VIREO374/

    Community myths or facts?Community myths or facts?

    •• Chua et al., Chua et al., ACM Multimedia 2007ACM Multimedia 2007

    –– Video search is practically solved and progress Video search is practically solved and progress has only been incrementalhas only been incrementalhas only been incrementalhas only been incremental

    •• Yang and Hauptmann, Yang and Hauptmann, ACM CIVR 2008ACM CIVR 2008–– Current solutions are weak and generalize poorlyCurrent solutions are weak and generalize poorly

    We have done an experimentWe have done an experiment

    •• Two video search engines from 2006 and 2009Two video search engines from 2006 and 2009–– MediaMillMediaMill Challenge 2006 systemChallenge 2006 system

    M di MillM di Mill TRECVID 2009 tTRECVID 2009 t–– MediaMillMediaMill TRECVID 2009 systemTRECVID 2009 system

    •• How well do they detect 36 LSCOM concepts?How well do they detect 36 LSCOM concepts?

    Four video data set mixturesFour video data set mixtures

    •• TrainingTrainingBroadcast

    newsDocumentary

    video

    TRECVID 2005 TRECVID 2007

    •• TestingTestingDocumentary

    videoBroadcast

    news

    Within domain

    Cross domain

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 23

    Performance doubled in just 3 yearsPerformance doubled in just 3 years

    • 36 concept detectors

    –– Even when using training Even when using training data of different origindata of different origin

    Snoek & Smeulders, IEEE Computer 2010

    –– Vocabulary still limitedVocabulary still limited

    500 detectors, a closer look500 detectors, a closer look

    The number of labeled image examples used at training time seems decisive in concept detector accuracy.

    Demo timeDemo time 500 detectors, a closer look500 detectors, a closer look

    Learning social tag relevance by Learning social tag relevance by neighbor votingneighbor voting

    •• Exploit consistency in tagging behavior of Exploit consistency in tagging behavior of different users for visually similar imagesdifferent users for visually similar images

    Xirong Li, TMM 2009

    Image retrieval experimentsImage retrieval experiments

    •• UserUser--tagged image databasetagged image database–– 3.5 million 3.5 million labeled Flickr imageslabeled Flickr images

    •• Visual feature Visual feature –– A global colorA global color--texture texture

    •• Evaluation setEvaluation set–– 20 concepts20 concepts

    •• Evaluation criteriaEvaluation criteria–– Average precisionAverage precision

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 24

    Image retrieval experimentsImage retrieval experiments

    •• A A standardstandard tagtag--basedbased retrievalretrieval frameworkframework–– RankingRanking functionfunction: OKAPI: OKAPI--BM25BM25

    •• ComparisonComparison–– Baseline:Baseline: retrieval using original tagsretrieval using original tags

    –– Neighbor: Neighbor: retrieval using learned tag relevance as retrieval using learned tag relevance as tag frequencytag frequency

    ResultsResults

    24% relative improvement

    Updated tag relevanceUpdated tag relevance

    •• Objective tags are identified and reinforcedObjective tags are identified and reinforced

    Based on 3.5 Million images downloaded from Flickr

    Tag relevance fusionTag relevance fusion

    Xirong Li et al, CIVR 2010

    •• Note: fully unsupervised, adds 10% in performanceNote: fully unsupervised, adds 10% in performance

    Failure case: airplaneFailure case: airplane

    •• Too much consensus on wrong label…Too much consensus on wrong label…

    Results suggest…Results suggest…

    •• Relevance of a tag can be predicted based on Relevance of a tag can be predicted based on ‘wisdom’ of crowds‘wisdom’ of crowds

    Even with a lightEven with a light weight visual featureweight visual feature–– Even with a lightEven with a light--weight visual featureweight visual feature

    –– And, a small database of 3.5M imagesAnd, a small database of 3.5M images

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 25

    Conclusion on lexicon learningConclusion on lexicon learning

    RequiresRequires•• Invariant featuresInvariant features

    Suffers most Suffers most fromfrom•• Weakly labeled visual dataWeakly labeled visual data

    •• Concept detectionConcept detection

    •• Many, many, annotationsMany, many, annotations

    •• Measuring performanceMeasuring performance

    •• Lots of computationLots of computation

    •• Transfer across domainsTransfer across domains

    Special issue announcementSpecial issue announcement

    Fall 2012

    1. Short course outline1. Short course outline

    .0 Problem statement.0 Problem statement

    .1 Measuring features.1 Measuring features

    .2 Concept detection.2 Concept detection

    .3 Lexicon learning.3 Lexicon learning

    .4 Telling stories.4 Telling stories

    .5 Video browsing.5 Video browsing

    StorytellingStorytelling

    Storytelling is the conveying of events in words, Storytelling is the conveying of events in words,

    Quotey g y g ,y g y g ,

    images and sounds, often by improvisation or images and sounds, often by improvisation or embellishment. Stories or narratives have been embellishment. Stories or narratives have been shared in every culture as a means of shared in every culture as a means of entertainment, education, cultural preservation entertainment, education, cultural preservation and in order to instill moral values.and in order to instill moral values.

    Wikipedia

    An event in prehistoric timesAn event in prehistoric times

    •• What story did they want to tell?What story did they want to tell?

    Bhimbetka rock shelters, India

    Arbitrary events in modern timesArbitrary events in modern times

    •• …but what defines an event in video?…but what defines an event in video?

    Attempting a board trickFlash mob gathering Changing a vehicle tire

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 26

    Definition 1Definition 1

    •• Complex activity occurring at a specific place and Complex activity occurring at a specific place and time;time;

    •• Involves people interacting with other people Involves people interacting with other people and/or objects;and/or objects;

    •• Consists of a number of human actions, Consists of a number of human actions, processes, and activities that are loosely or tightly processes, and activities that are loosely or tightly organized and that have significant temporal and organized and that have significant temporal and semantic relationships to the overarching activity;semantic relationships to the overarching activity;

    •• Is directly observable.Is directly observable.

    Definition 2Definition 2

    •• An event happens over timeAn event happens over time–– Track semantics over time, model dynamic Track semantics over time, model dynamic

    interactions and identify goalinteractions and identify goal--directed movementdirected movementinteractions, and identify goalinteractions, and identify goal directed movement.directed movement.

    •• An event happens in spaceAn event happens in space–– Identify the most semantic regions in an image to Identify the most semantic regions in an image to

    express spatial relations and interactions.express spatial relations and interactions.

    •• An event has a recountAn event has a recount–– Must be able to provide a understandable recounting Must be able to provide a understandable recounting

    on what visual information is decisive for relevance.on what visual information is decisive for relevance.

    An event is a bagAn event is a bag--ofof--featuresfeatures

    Jiang et al, TRECVID 2010

    Video Event

    fid

    •• Fusing multiple audioFusing multiple audio--visual features is effectivevisual features is effective–– No notion whether the event is really foundNo notion whether the event is really found

    Feature vector

    Video confidence

    An event is a bagAn event is a bag--ofof--conceptsconcepts

    Gkalelis et al, CBMI 2011Merler et al, TMM, in press

    Video Eventconfidence

    Concept Detector 1

    Concept

    •• Combine available detector scores in single vectorCombine available detector scores in single vector–– No notion what concept (ordering) is important for eventsNo notion what concept (ordering) is important for events

    Concept Detector 2

    Concept Detector N

    Horse, horse, horse?Horse, horse, horse? Chair, chair, chair?Chair, chair, chair?

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 27

    An event is concepts over timeAn event is concepts over time

    Ebadollahi, ICME 2006

    •• Exploit concept evidence over timeExploit concept evidence over time–– Proof of concept, unclear how to determine automaticallyProof of concept, unclear how to determine automatically

    The event retrieval paradoxThe event retrieval paradox

    cura

    cy

    Bag-of-

    ???good

    Even

    t de

    tect

    ion

    acc

    Event recounting ability

    features

    Bag-of-concepts

    Conceptsover time

    goodbad

    bad

    Benchmarking the paradoxBenchmarking the paradox

    •• NIST Multimedia event detection taskNIST Multimedia event detection task

    •• NIST Multimedia event recounting taskNIST Multimedia event recounting task

    Multimedia Event Detection taskMultimedia Event Detection task

    •• Given an event specified by an Given an event specified by an –– event kit (query) which consists of: event kit (query) which consists of:

    d fi itid fi iti–– definition, definition,

    –– event explication, event explication,

    –– evidential description, andevidential description, and

    –– illustrative examples, illustrative examples,

    •• Search video for events.Search video for events.

    Example Event Kit Example Event Kit

    zE id i l D i i

    Definition: One or more people fashion an object out of wood.

    Event Explication: Woodworking is a popular hobby that involves crafting an object out

    of wood. Typical woodworking projects may range from creating large pieces of furniture to small decorative items or toys. The process for making objects out of wood can include cutting wood into smaller pieces…. (continues)

    Mnemonic

    Textual Definition

    Expresses event domain specific knowledge to understand the event

    Event Name: Working on a woodworking project

    zEvidential Description: scene: Often indoors in a workshop, garage, artificial lighting.

    Occasionally outdoorsobjects/people: Woodworking tools (automatic or non-automatic

    saws, sander, knife), paint, stains, sawhorses, toolbox, safety goggles

    activities: Cutting and shaping wood, attaching pieces of wood together, smoothing/sanding wood

    audio: power tool sounds; hand tool sounds (hammer, saw, etc.); narration of process

    definition

    Textual listing of attributes that are often associated with the event

    Exemplars: HVC334271.mp4, HVC393428.mp4, HVC875424.mp4, etc.

    Specific clips from the “Event Kits” data set that are known to contain the event being defined.

    Target User: An Internet information analyst or experienced Internet searcher with event-specialized knowledge.

    Multimedia Event Recounting taskMultimedia Event Recounting task

    •• An event detector will rapidly and automatically An event detector will rapidly and automatically produce a textual Englishproduce a textual English--language recounting of language recounting of each event occurrence it finds in a video collection,each event occurrence it finds in a video collection,–– describing the particular scene, actors, objects, and describing the particular scene, actors, objects, and

    activities involved.activities involved.

    •• Task starts in 2012Task starts in 2012

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 28

    Video collectionsVideo collections

    Collection Name  Designated Uses Target sizes AnnotationPilot 2010

    Development collectionTest collection

    1,723 clips1,742 clips(100 hours)

    Clip content annotation for both sets

    Development (DEV) 2011Split into two subsets:

    44K clips,(~ 1400 

    For MED ‘11: Clip content p

    (1) Transparent (DEV‐T)(2) Opaque (DEV‐O)

    2012‐2015(1) and (2) merged to a single training collection

    (hours)

    pannotation for the transparent subsetAfter MED ‘11:Clip content annotation for the opaque subset

    Progress 2012‐2015: test collection 120K clips,4000 hrs

    No clip content annotation

    Novel 1 2014: test collection 120K clips,4000 hrs.

    No clip content annotation

    Novel 2 2015: test collection 120K clips,4000 hrs.

    No clip content annotation

    The TRECVID MED ‘11 eventsThe TRECVID MED ‘11 events

    Process‐Observed EventsAttempting a board trickFeeding an animalLanding a fishWorking on a woodworking project

    Life EventsWedding ceremony

    Trai

    ning

    Ev

    ents

    Test

    ing

    Even

    ts Process‐Observed EventsChanging a vehicle tireGetting a vehicle unstuckGrooming an animalMaking a sandwichParkourRepairing an applianceWorking on a sewing project

    Life EventsBirthday partyFlash mob gatheringParade

    Conclusion on telling storiesConclusion on telling stories

    •• Not much to conclude yetNot much to conclude yet–– Due to benchmarks, much will happen coming yearsDue to benchmarks, much will happen coming years

    •• Field is currently obsessed with bagField is currently obsessed with bag--ofof--featuresfeatures–– Does not solve the event retrieval Does not solve the event retrieval paradoxparadox

    –– Can multimedia help?Can multimedia help?

    –– Can concepts help?Can concepts help?

    •• The topic is wide openThe topic is wide open–– Perfect for (many) PhD’sPerfect for (many) PhD’s

    1. Short course outline1. Short course outline

    .0 Problem statement.0 Problem statement

    .1 Measuring features.1 Measuring features

    .2 Concept detection.2 Concept detection

    .3 Lexicon learning.3 Lexicon learning

    .4 Telling stories.4 Telling stories

    .5 Video browsing.5 Video browsing

    Problem 5: Problem 5: Use is openUse is open--endedended

    scope

    •• ThisThis is the interface gapis the interface gap

    screen

    keywords

    Video search 1.0Video search 1.0

    Note the influence of textual meta data, such as the video title, on the search results.

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 29

    Query selectionQuery selection ‘Classic’ Informedia system‘Classic’ Informedia system

    •• First multimodal video search engineFirst multimodal video search engine

    Carnegie Mellon University

    FíschlárFíschlár

    •• Optimized for use by “real” usersOptimized for use by “real” users

    Dublin City University

    IBM iMARSIBM iMARS

    •• A web based systemA web based system

    IBM Research

    http://mp7.watson.ibm.com/

    MediaMagicMediaMagic

    •• Focus on the story levelFocus on the story level

    FxPal

    VisionGoVisionGo

    •• Extremely fast and efficientExtremely fast and efficient

    NUS & ICT-CAS

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 30

    RankRank

    CrossBrowsing through resultsCrossBrowsing through results

    Snoek, TMM 2007

    TimeTime

    Sphere variant

    Extreme video retrievalExtreme video retrieval

    •• ObservationObservation–– Correct results are retrieved, but not optimally rankedCorrect results are retrieved, but not optimally ranked

    If h ti t lt h ti l t i l iIf h ti t lt h ti l t i l i

    = very demanding!= very demanding!

    Carnegie Mellon University

    –– If user has time to scan results exhaustively, retrieval is If user has time to scan results exhaustively, retrieval is a matter of watching, selecting, and sorting a matter of watching, selecting, and sorting quicklyquickly

    ForkBrowserForkBrowser

    de Rooij, CIVR 2008

    CrowdsourcingCrowdsourcing via timelinevia timeline

    Snoek et al. ACM MM 2010

    •• Poster by Poster by BaukeBauke FreiburgFreiburg–– Tuesday, Tuesday, 12:30 PM 12:30 PM -- 2:00 PM 2:00 PM

    Multimedia event browsersMultimedia event browsers

    •• ACM Multimedia 2012?ACM Multimedia 2012?

    The future of video retrieval?The future of video retrieval?

    Jonathan Wang, Carnegie Mellon University

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 31

    Interface gap best addressed byInterface gap best addressed by

    •• TRECVID interactive search taskTRECVID interactive search task–– Interactively solve 20+ search topics Interactively solve 20+ search topics (10/15 minutes)(10/15 minutes)

    R t 1 000 k d h tR t 1 000 k d h t b d lt t ib d lt t i–– Return 1,000 ranked shotReturn 1,000 ranked shot--based results per topicbased results per topic

    –– Evaluate using Average PrecisionEvaluate using Average Precision

    •• VideOlympicsVideOlympics showcaseshowcase

    Video browsing at TRECVIDVideo browsing at TRECVID

    •• Wide performance variationWide performance variation–– # concept detectors# concept detectors

    2003

    2004

    –– Search interfaceSearch interface

    –– Expert Expert vsvs novice user novice user

    •• Most pronouncedMost pronounced–– 2003: 2003: InformediaInformedia classicclassic

    –– 2005: Large semantic lexicon2005: Large semantic lexicon

    –– 2008: Online learning2008: Online learning

    2005

    2006

    2008

    2009

    2007

    UvAUvA--MediaMillMediaMill@TRECVID@TRECVID

    CrossBrowser

    Snoek et al. TRECVID 04-09

    • 200+ other interactive systemsTraditional systems

    CriticismCriticism

    •• Retrieval performance cannot be the only Retrieval performance cannot be the only evaluation criterionevaluation criterion–– Quality of detectors countsQuality of detectors counts–– Quality of detectors countsQuality of detectors counts

    –– Experience of searcher countsExperience of searcher counts

    –– Visualization of interface countsVisualization of interface counts

    –– Ease of use countsEase of use counts

    –– ……

    Video browsing at Video browsing at VideOlympicsVideOlympics

    •• Promote multiple facets of video searchPromote multiple facets of video search–– RealReal--time interactive video search ‘competition’time interactive video search ‘competition’

    Si lt f lti l id h iSi lt f lti l id h i–– Simultaneous exposure of multiple video search enginesSimultaneous exposure of multiple video search engines

    –– Highlight possibilities and limitations of stateHighlight possibilities and limitations of state--ofof--thethe--artart

    ParticipantsParticipants

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 32

    Video trailerVideo trailer

    http://www.VideOlympics.org

    Conclusion on video browsingConclusion on video browsing

    •• Interaction by browsing indispensible for any Interaction by browsing indispensible for any practical video search enginepractical video search engine

    •• System should support user by learning and System should support user by learning and intuitive (mobile) intuitive (mobile) visualizationsvisualizations

    Internet Video SearchInternet Video SearchConclusion on:Conclusion on:

    conceptdetection

    tellingstories

    browsingvideovideo

    videomeasuringfeatures

    lexiconlearning

    And there is always more …And there is always more …

    •• Content Based Multimedia Retrieval: Lessons Content Based Multimedia Retrieval: Lessons Learned from Two Decades of Research Learned from Two Decades of Research

    ShihShih Fu Chang Columbia UniversityFu Chang Columbia University–– ShihShih--Fu Chang, Columbia University Fu Chang, Columbia University

    •• SIGMM Technical Achievement SIGMM Technical Achievement AwardAward–– Tomorrow: Tomorrow: 10:15 10:15 AMAM--12:30 12:30 PM PM

    And there is always more …And there is always more …

    •• Recommended special issues Recommended special issues –– IEEE Transactions on Pattern Analysis and Machine IEEE Transactions on Pattern Analysis and Machine

    Intelligence, 30(11), November 2008Intelligence, 30(11), November 2008–– Proceedings of the IEEE, 96(4), April 2008Proceedings of the IEEE, 96(4), April 2008–– IEEE Transactions on Multimedia, 9(5), August 2007IEEE Transactions on Multimedia, 9(5), August 2007

    •• 300 references on video search300 references on video search–– Snoek and Snoek and WorringWorring, Concept, Concept--Based Video Retrieval, Based Video Retrieval,

    Foundations and Trends in Information Retrieval, Foundations and Trends in Information Retrieval, Vol. 2: No 4, pp 215Vol. 2: No 4, pp 215--322, 322, 2009.2009.

    General references IGeneral references IColor Invariance. Jan-Mark Geusebroek, R. van den Boomgaard, Arnold W. M. Smeulders, H. Geerts. IEEE Trans. Pattern Analysis and Machine Intelligence, Volume 23 (12), page 1338-1350, 2001.

    Distinctive Image Features from Scale-Invariant Keypoints. D. G. Lowe. Int'l Journal of Computer Vision, vol. 60, pp. 91-110, 2004.

    Large-Scale Concept Ontology for Multimedia. M. R. Naphade, J. R. Smith, J. g p gy p , ,Tesic, S.-F. Chang, W. Hsu, L. S. Kennedy, A. G. Hauptmann, and J. Curtis,. IEEE MultiMedia, vol. 13, pp. 86-91, 2006.

    Efficient Visual Search for Objects in Videos. J. Sivic and A. Zisserman. Proceedings of the IEEE, vol. 96, pp. 548-566, 2008.

    High Level Feature Detection from Video in TRECVid: A 5-year Retrospective of Achievements. A. F. Smeaton, P. Over, and W. Kraaij, In Multimedia Content Analysis, Theory and Applications, (A. Divakaran, ed.), Springer, 2008.

    Visually Searching the Web for Content. J. R. Smith and S.-F. Chang. IEEE MultiMedia, vol. 4, pp. 12-20, 1997.

    Content Based Image Retrieval at the End of the Early Years. Arnold W. M. Smeulders, Marcel Worring, S. Santini, A. Gupta, R. Jain. IEEE Trans. Pattern Analysis and Machine Intelligence, Volume 22 (12), page 1349-1380, 2000.

  • Cees Snoek & Arnold SmeuldersUniversity of Amsterdam

    29‐11‐2011

    Internet Video Search 33

    General references IIGeneral references IIThe Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. Cees G. M. Snoek, Marcel Worring, Jan C. van Gemert, Jan-Mark Geusebroek, Arnold W. M. Smeulders. ACM Multimedia, page 421-430, 2006.

    The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, Frank J. Seinstra, Arnold W. M. Smeulders. IEEE Trans. Pattern Analysis and Machine Intelligence, Volume 28 (10), page 1678-1689, 2006.y g , ( ), p g ,

    A Learned Lexicon-Driven Paradigm for Interactive Video Retrieval. Cees G. M. Snoek, Marcel Worring, Dennis C. Koelma, Arnold W. M. Smeulders. IEEE Trans. Multimedia, Volume 9 (2), page 280-292, 2007.

    Adding Semantics to Detectors for Video Retrieval. Cees G. M. Snoek, BoukeHuurnink, Laura Hollink, Maarten de Rijke, Guus Schreiber, Marcel Worring. IEEE Trans. Multimedia, Volume 9 (5), page 975-986, 2007.

    The MediaMill TRECVID 2004-2011 Semantic Video Search Engine. Cees G. M. Snoek et al. Proceedings of the TRECVID Workshop, 2004-2011.

    Visual-Concept Search Solved?. Cees G.M. Snoek and Arnold W.M. Smeulders. IEEE Computer, Volume 43 (6), page 76-78, 2010.

    General references IIIGeneral references IIIConcept-Based Video Retrieval. Cees G. M. Snoek, Marcel Worring. Foundations and Trends in Information Retrieval, Vol. 4 (2), page 215-322, 2009.

    Local Invariant Feature Detectors: A Survey. T. Tuytelaars and K. Mikolajczyk. Foundations and Trends in Computer Graphics and Vision, vol. 3, pp. 177-280, 2008.

    Evaluating Color Descriptors for Object and Scene Recognition. Koen E. A. van de Sande, Theo Gevers, Cees G. M. Snoek. IEEE Trans. Pattern Analysis d hi lli (i ) 2010

    http://www.science.uva.nl/research/publications/

    and Machine Intelligence (in press), 2010.

    Visual Word Ambiguity. Jan C. van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, Jan-Mark Geusebroek. IEEE Trans. Pattern Analysis and Machine Intelligence (in press), 2009.

    Real-Time Bag of Words, Approximately. Jasper R. R. Uijlings, Arnold W. M. Smeulders, R. J. H. Scha. ACM Int'l Conference on Image and Video Retrieval, 2009.

    Lessons Learned from Building a Terabyte Digital Video Library. H. D. Wactlar, M. G. Christel, Y. Gong, and A. G. Hauptmann. IEEE Computer, vol. 32, pp. 66-73, 1999.

    Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Int'l Journal of Computer Vision, vol. 73, pp. 213-238, 2007.

    Contact infoContact info

    •• Cees Snoek Cees Snoek http://www.CeesSnoek.infohttp://www.CeesSnoek.info

    •• Arnold Arnold SmeuldersSmeuldershttp://staff.science.uva.nl/~smeulderhttp://staff.science.uva.nl/~smeulder