![Page 1: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/1.jpg)
1
Joint Visual-Text Modeling for Multimedia Retrieval JHU CLSP Workshop 2004 – Final Presentation, August 17 2004
![Page 2: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/2.jpg)
2
TeamUndergraduate Students
Desislava Petkova (Mt. Holyoke), Matthew Krause (Georgetown)
Graduate StudentsShaolei Feng (U. Mass), Brock Pytlik(JHU), Paola Virga (JHU)
Senior ResearchersPinar Duygulu, Bilkent U., TurkeyPavel Ircing (U. West Bohemia)Giri Iyengar, IBM ResearchSanjeev Khudanpur, CLSP, JHUDietrich Klakow, Uni. SaarlandR. Manmatha, CIIR, U. Mass AmherstHarriet Nock, IBM Research (external participant)
![Page 3: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/3.jpg)
3
“ … Palestinian leaderYes Sir You’re Fat today said …”
Big Picture: Multimedia Retrieval Task
Find clips showingYasser Arafat
VIDEO CLIPS
“ … Palestinian leaderYasser Arafat today said …”
Multimedia RetrievalSystem
Yasser Arafat
Process Query Image
Process Query Text
Spoken DocumentRetrievalImage
Content-basedRetrieval
Joint-Visual Text Models!
Most research has addressed:I. Text queries, text (or degraded text) documentsII. Image queries, image data
CombineScores
“ … Palestinian leaderYasser Arafat today said …”
![Page 4: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/4.jpg)
4
Joint Visual-Text Modeling
Process Query Text Joint word-
vistermretrievalProcess
Query Image
Yasser ArafatVIDEO CLIPS
“ … [Yes sir, you’re fat today said]…
Query ofWords and Visterms
Document of
words
Query of
words
Document ofWords and Visterms
Retrieve documents using p(Document|Query)
Retrieve documents using p(dw,dv | qw,qv)
![Page 5: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/5.jpg)
5
Joint Visual-Text Modeling: KEY GOAL
Show that joint visual-text modeling improves multimedia retrieval
Demonstrate and Evaluate performance of these models on TRECVID2003 corpus and task
![Page 6: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/6.jpg)
6
Key StepsAutomatically annotate video with concepts (meta-data)
E.g. Video contains a face, in a studio-environment …
Retrieve videoGiven a query, select suitable meta-data for the query and retrieveCombine with text-retrieval in a unified Language Model-based IR setting
![Page 7: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/7.jpg)
7
TRECVID Corpus and TaskCorpus
Broadcast news videos used for Hub4 evaluations (ABC, CNN, CSPAN)120 Hours of video
TasksShot-boundary detectionNews Story segmentation (multimodal)Concept detection (Annotation)Search task
![Page 8: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/8.jpg)
8
Alternate (development) CorpusCOREL photograph database
5000 high-quality photographs with captions
TaskAnnotation
![Page 9: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/9.jpg)
9
TRECVID Search task definition
Statement of Information need + Examples
Manual Selection ofSystem Parameters
Rankedlist of video shots
ManualInteractive
NIST Evaluation
![Page 10: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/10.jpg)
10
Our search task definition
Statement of Information need + Examples
AutomaticSelection ofSystem Parameters
Rankedlist of video shots
Isolate Algorithmic issues from interface and user issues
NIST Evaluation
![Page 11: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/11.jpg)
11
dLanguage Model based Retrieval
q
Vist
erm
sW
ords
Words Vistermsd
Baseline model
Relating document visterms to query words (MT,RelevanceModel,HMMs)
Relating document words to query images (Text Classification experiments)
Visual-only retrieval models
Rank documents with p(qw,qv|dw,dv)
![Page 12: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/12.jpg)
12
EvaluationConcept annotation performance
Compare against manual ground truthRetrieval task performance
Compare against NIST relevance judgements
Both measured using Mean Average Precision (mAP)
![Page 13: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/13.jpg)
13
Mean Average Precision (mAP)
T
tAPmAP
treltStAP
iprecisiontS
Tt
relevanti
∑
∑
∈
∈
=
=
=
)(
)()()(
)()(}{
![Page 14: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/14.jpg)
14
Experimental Setup: CorporaTRECVID03 Corpus120 HoursGround Truth on Dev data
Train38K shots
DevTest10K shots
TRECVID03IR Collection32K Shots
Train4500 images
Test500images
COREL Corpus5000 images
![Page 15: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/15.jpg)
15
Experimental Setup: Visual Features
Original
L*a*b Edge Strength Co-occurrence
![Page 16: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/16.jpg)
16
Interest Point Neighborhoods (Harris detector)
Greyscale image Interest points
![Page 17: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/17.jpg)
17
Experimental Setup: Visual Feature list
Regular partitionL*a*b Moments (COLOR)Smoothed Edge Orientation Histogram (EDGE)Grey-level Co-occurrence matrix (TEXTURE)
Interest Point neighborhoodCOLOR, EDGE, TEXTURE
![Page 18: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/18.jpg)
18
dPresentation Outline
q
Words Visterms
Vist
erm
sW
ords
dTranslation (MT) models (Paola),
Relevance Models (Shao Lei,Desislava),
Graphical Models(Pavel, Brock)
Text classification models(Matt)
Integration & Summary(Dietrich)
![Page 19: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/19.jpg)
19
A Machine Translation Approach
to Image Annotation
Presented by Paola Virga
![Page 20: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/20.jpg)
20
dPresentation Outline
q
Words Visterms
Vist
erm
s W
ords
d Translation (MT) models
)|()|()|( Vwc
Vw dcpcqpdqp ∑=
![Page 21: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/21.jpg)
21
p(f|e) = ∑ p(f,a|e)a
p(c|v) = ∑ p(c,a|v)a
Inspiration from Machine Translation
Direct translation modelgrass
grass
grass
grass grass grass grass
grass grass
tigertiger
tigertiger
tigertiger
grass
![Page 22: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/22.jpg)
22
Discrete Representation of Image Regions (visterms) to create analogy to MT
concepts
sun sky waves sea
Solution : Vector quantization visterms
In Machine Translation discrete tokensIn our task
However, the features extracted from regionsare continuous
{fn1, fn2, …fnm} -> vk
sun sky sea waves
tiger water grass
water harbor sky clouds sea
v10 v22 v35 v43c5 c1 c38 c71
v20 v21 v50 v10c15 c21 c83
v78 v78 v1 v1c21 c19 c1 c56 c38
v10 v22
v35 v43
v10
v20 v21
v50
v78 v78
v1 v1
![Page 23: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/23.jpg)
23
p (sun | )
Image annotation using translation probabilitiesp(c|v) : Probabilities obtained from direct translation
∑∈
=VdvV
V vcPd
dcP )|(1)(0 |
v10 v22
v35 v43
![Page 24: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/24.jpg)
24
Annotation Results (Corel set)
field foals horses maretree horses foals mare field
flowers leaf petals stemsflowers leaf petals grass tulip
people pool swimmers waterswimmers pool people water sky
mountain sky snow watersky mountain water clouds snow
jet plane sky sky plane jet tree clouds
people sand sky water sky water beach people hills
Top: manual annotations, bottom : predicted words (top 5 words with the highest probability)Red : correct matches
![Page 25: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/25.jpg)
25
Feature selectionFeatures : color, texture, edgeExtracted from blocks, or around interest
points
ObservationsFeatures extracted from blocks give better performance than features extracted around interest points
When the features are used individuallyEdge features give the best performance
Training using all is the bestUsing Information Gain to select visterms vocabulary didn’t help
Integrating number of faces, increases the performance slightly
mAP values for different features
![Page 26: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/26.jpg)
26
Model and iteration selectionStrategies compared
(a) IBM Model 1 p(c|v)(b) HMM on top of (a)(c) IBM Model 4 on top of (b)
-> Observation : IBM Model 1 is the best
Number of iterations in Giza training affects the performance-> Less iterations give better annotation performance
but cannot produce rare words
Corel TREC0.125 0.124
![Page 27: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/27.jpg)
27
Integrating word co-occurrences Model 1 with word co-occurrence
Integrating word co-occurrences into the model helps for Corel but not for TREC
∑=
=C
jVjjiVi dcPccPdcP
101 )|()|()( |
Corel TRECModel 1 0.125
0.145Model 1 + Word-CO0.1240.124
![Page 28: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/28.jpg)
28
Inspiration from CLIRTreat Image Annotation as a Cross-lingual IR problem
Visual Document comprising visterms (target language) and a query comprising a concept (source language)
( ) 44 344 21Vd
CVv
VV Gcpdvpvcpdcp∀∈
−+⎟⎠
⎞⎜⎝
⎛= ∑ same
)|()1(|)|()|( λλ
![Page 29: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/29.jpg)
29
Inspiration from CLIRTreat Image Annotation as a Cross-lingual IR problem
Visual Document comprising visterms (target language) and a query comprising a concept (source language)
Image does not provide a good estimate of p(v|dv) Tried p(v) and DF(v), DF works best
( )∑∈
=Vdv
VV vcpdvpdcp )|(|)|(
∑∈
=Vdv
TrainV vcpvDFdcscore )|()()|(
![Page 30: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/30.jpg)
30
Annotation Performance on TRECModel 1 0.124CLIR using Model 1 0.126
Significant at p=0.04
Average Precision values for the top 10 wordsFor some concepts we achieved up to 0.6
![Page 31: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/31.jpg)
31
Annotation Performance on TREC
![Page 32: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/32.jpg)
32
Questions?
![Page 33: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/33.jpg)
33
Relevance Models for Image AnnotationPresented by Shaolei FengUniversity of Massachusetts, Amherst
![Page 34: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/34.jpg)
34
dRelevance Models as Visual Model
q
Words Visterms
Vist
erm
sW
ords
d
Use Relevance Models to estimate the probabilities of concepts given test keyframes
)|()|()|( vwc
vw dcpcqpdqp ∑=
Goal:
![Page 35: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/35.jpg)
35
IntuitionImages are defined by spatial context.
Isolated pixels have no meaning.Context simplifies recognition/retrieval.E.g.Tiger is associated with grass, tree, water forest.
Less likely to be associated with computers.
![Page 36: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/36.jpg)
36
Introduction to Relevance ModelsOriginally introduced for text retrieval and cross-lingual retrieval
Lavrenko and Croft 2001, Lavrenko, Choquette and Croft, 2002A formal approach to query expansion.
A nice way of introducing context in imagesWithout having to do this explicitly Do this by computing the joint probability of images and words
![Page 37: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/37.jpg)
37
Cross Media Relevance Models (CMRM)
Two parallel vocabularies: Words and VistermsAnalogous to Cross – lingual relevance models Estimate the joint probabilities of words and visterms from training images
Tiger
R
Tree
Grass
)|()|()(),(||
1
JvPJcPJPdcP iTJ
d
iv
v
∑ ∏∈ =
=
J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Relevance Using Cross-Media Relevance Models, In Proc. SIGIR’03.
![Page 38: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/38.jpg)
38
Continuous Relevance Models (CRM)
A continuous version of Cross Media Relevance ModelEstimate the P(v|J) using kernel density estimate
: Gaussian Kernel: Bandwidth
∑=
⎟⎟⎠
⎞⎜⎜⎝
⎛ −=
||
1
1)|(J
i
JivvK
nJvP
β
βK
![Page 39: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/39.jpg)
39
Continuous Relevance ModelA generative modelConcept words wj generated by an i.i.d. sample from a multinomialVisterms vi generated by a multi-variate (Gaussian) density
![Page 40: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/40.jpg)
40
Normalized Continuous Relevance Models
Normalized CRMPad annotations to fixed length. Then use the CRM.Similar to using a Bernoulli model (rather than a multinomial for words).Accounts for length (similar to length of document in text retrieval).
S. L. Feng, V. Lavrenko and R. Manmatha, Multiple Bernoulli Models for Image and Video Annotation, in CVPR’04V. Lavrenko, S. L. Feng and R. Manmatha, Statistical Models for Automatic Video Annotation and Retrieval, in ICASSP04
![Page 41: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/41.jpg)
41
Annotation PerformanceOn Corel data Set:
Normalized-CRM works best
Models CMRM CRM Normalized-CRM
Mean average Precision
0.14 0.23 0.26
![Page 42: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/42.jpg)
42
Annotation Examples (Corel set)
Sky train railroad locomotive water
Cat tiger bengaltree forest
Snow fox arctic tails water
Mountain plane jet water sky
Tree plane zebra herd water
Birds leaf nest water sky
![Page 43: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/43.jpg)
43
Results: Relevance Model on TrecVideo Set
Model: Normalized continuous relevance modelFeatures: color and texture
Our comparison experiments show adding edge feature only get very slight improvement
Evaluate annotation on the development dataset for annotation evaluation
mean average precision: 0.158
![Page 44: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/44.jpg)
44
Annotation Performance on TREC
![Page 45: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/45.jpg)
45
Proposal: Using Dynamic Information for Video RetrievalPresented by Shaolei FengUniversity of Massachusetts, Amherst
![Page 46: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/46.jpg)
46
MotivationCurrent models based on single frames in each shot.But video is dynamic
Has motion information.Use dynamic (motion) information
Better image representations (segmentations)Model events/actions
![Page 47: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/47.jpg)
47
Why Dynamic InformationModel actions/events
Many Trecvid 2003 queries require motion information. E.g.
find shots of an airplane taking off.find shots of a person diving into water.
Motion is an important cue for retrieving actions/events.
But using the optical flow over the entire image doesn’t help.Use motion features from objects.
Better Image RepresentationsMuch easier to segment moving objects from background than to segment static images.
![Page 48: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/48.jpg)
48
Problems with still images.Current approach
Retrieve videos using static frames.Feature representations
Visterms from keyframes.Rectangular partition or static segmentation
Poorly correlated with objects.Features – color, texture, edges.
Problem: visterms not correlated well with concepts.
![Page 49: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/49.jpg)
49
Better Visterms – better results. Model performs well on related tasks.Retrieval of handwritten manuscripts.
Visterms – word images.
Features computed over word images.Annotations – ASCII word.“you are to be particularly careful”
Segmentation of words easier.Visterms better correlated with concepts.
So can we extend the analogy to this domain…
![Page 50: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/50.jpg)
50
Segmentation Comparison
Pictures from Patrick Bouthemy’s Website, INRIA
a: Segmentation using only still image information
b: Segmentation using only motion information
![Page 51: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/51.jpg)
51
Represent Shots not KeyframesShot boundary detection
Use standard techniques.Segment moving objects
E.g. By finding outliers from dominant (camera) motion.
Visual features for object and background.Motion features for object
E.g Trajectory information,Motion features for background.
Camera pan, zoom …
![Page 52: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/52.jpg)
52
ModelsOne approach - modify relevance model to include motion information.Probabilistically annotate shots in the test set.
Other models e.g. HMM also possible
)|()|()|()()),(,(||
1
SmPSvPScPSPddcP iTS
d
iimv ∑ ∏
∈ =
=
T: training set, S: shots in the training set
)|()|()(),(||
1
JvPJcPJPdcP iTJ
d
iv
v
∑ ∏∈ =
=
![Page 53: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/53.jpg)
53
Estimation P(vi|S), P(mi|S)If discrete visterms use smoothed maximum likelihood estimates.If continuous use kernel density estimates.
Take advantage of repeated instances of the same object in shot.
![Page 54: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/54.jpg)
54
PlanModify models to include dynamic informationTrain on TrecVID03 development datasetTest on TrecVID03 test dataset
Annotate the test set Retrieve using TrecVID 2003 queries.Evaluate retrieval performance using mean average precision
![Page 55: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/55.jpg)
55
Score Normalization Experiments
Presented by Desislava Petkova
![Page 56: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/56.jpg)
56
Motivation for Score NormalizationScore probabilities are smallBut there seems to be discriminating powerTry to use likelihood ratios
![Page 57: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/57.jpg)
57
Bayes Optimal Decision Rule
P w s r s1 r s
r s P w sP w s
P s P w sP s P w s
P w P s wP w P s w
p w pdf w s wp w pdf w s w
=
=
=
![Page 58: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/58.jpg)
58
Estimating Class-Conditional PDFsFor each word:
Divide training images into positive and negative examplesCreate a model to describe the score distribution of each set
GammaBetaNormalLognormal
Revise word probabilities
![Page 59: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/59.jpg)
59
Annotation Performance
Did not improve annotation performance on Corel or TREC
![Page 60: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/60.jpg)
60
Proposal:Using Clustering to Improve Concept AnnotationDesislava PetkovaMount Holyoke College17 August 2004
![Page 61: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/61.jpg)
61
Automatically annotating imagesCorel:5000 images
4500 training500 testing
Word vocabulary374 words
Annotations1-5 words
Image vocabulary500 visterms
![Page 62: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/62.jpg)
62
Relevance models for annotationA generative language modeling approachFor a test image I = {v1, …, vm} compute the joint distribution of each word w in the vocabulary with the visterms of I
Compare I with training images J annotated with w
P w , IJ T
P J P w , I J
P w , IJ T
P J P w Ji 1
m
P vi J
![Page 63: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/63.jpg)
63
Estimating P(w|J) and P(v|J)Use maximum-likelihood estimates
Smooth with the entire training set T
P w J 1 a c w , JJ
a c w ,TT
P v J 1 b c v , JJ
b c v ,TT
![Page 64: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/64.jpg)
64
MotivationEstimating the relevance model of a single image is a noisy process
P(v|J): visterm distributions are sparseP(w|J): human annotations are incomplete
Use clustering to get better estimates
![Page 65: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/65.jpg)
65
Potential benefits of clustering
{cat, grass, tiger, water}
{cat, grass, tiger}{water}
{cat, grass, tiger, tree}
{grass, tiger, water}{cat}
Words in red are missing in the annotation
![Page 66: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/66.jpg)
66
Relevance Models with ClusteringCluster the training images using K-means
Use both visterms and annotationsCompute the joint distribution of visterms and words in each cluster
Use clusters instead of individual images
P w , IC T
P C P w Ci 1
m
P vi C
![Page 67: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/67.jpg)
67
Preliminary results on annotation performance
mAP
Standard relevance model(4500 training examples)
0.14
Relevance model with clusters(100 training examples)
0.128
![Page 68: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/68.jpg)
68
Cluster-based smoothingSmooth maximum likelihood estimates for the training images based on clusters they belong to
P w J 1 a1 a2c w , J
Ja1
c w ,C J
C J
a2c w ,T
T
P v J 1 b1 b2c v , J
Jb1
c v ,C J
C J
b2c v , T
T
![Page 69: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/69.jpg)
69
ExperimentsOptimize smoothing parameters
Divide training set 4000 training images500 validation images
Find the best set of clustersQuery-dependent clustersInvestigate soft clustering
![Page 70: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/70.jpg)
70
Evaluation planRetrieval performance
Average precision and recall for one-word queries
Comparison with the standard relevance model
![Page 71: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/71.jpg)
71
Hidden Markov Modelsfor Image AnnotationsPavel IrcingSanjeev Khudanpur
![Page 72: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/72.jpg)
72
dPresentation Outline
q
Words Visterms
Vist
erm
sW
ords
dTranslation (MT) models (Paola),
Relevance Models (Shao Lei,Desislava),
Graphical Models(Pavel, Brock)
Text classification models(Matt)
Integration & Summary(Dietrich)
![Page 73: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/73.jpg)
73
Model setup
tiger
ground
water
grass • alignment between image blocks and annotation words is a hidden variable, models are trained using the EM algorithm (HTK toolkit)
Test HMM has |W| states, 2 scenarios: (a) p(w’|w) uniform
(b) p(w’|w) from co-occurrence LM
Posterior probability from forward-backward pass used for p(w|Image)
Training HMMs: • separate HMM for each
training image – states given by manual annotations.
• image blocks are “generated”by annotation words
![Page 74: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/74.jpg)
74
Challenges in HMM training Inadequate annotationsThere is no notion of order in the annotation words
Difficulties with automatic alignment between words and image regions
No linear order in image blocks (assume raster-scan)Additional spatial dependence between block-labels is missedPartially addressed via a more complex DBN (see later)
![Page 75: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/75.jpg)
75
Inadequacy of the annotations
cartransportation
vehicle outdoors
non-studio setting nature-non-vegetation
snow
man-made object
TRECVID databaseAnnotation concepts capture mostly semantics of the image and they are not very suitable for describing visual properties
Corel databaseAnnotators often mark only interesting objects
beachpalmpeopletree
![Page 76: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/76.jpg)
76
Alignment problemsThere is no notion of order in the annotation words
Difficulties with automatic alignment between words and image regions
![Page 77: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/77.jpg)
77
Gradual TrainingIdentify a set of “background” words (sky, grass, water,...)In the initial stages of HMM training
Allow only “background” states to have their individual emission probability distributionsAll other objects share a single “foreground”distribution
Run several EM iterationsGradually untie the “foreground” distribution and run more EM iterations
![Page 78: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/78.jpg)
78
Gradual Training Results
Results:Improved alignment of training imagesAnnotation performance on test images did not change significantly
![Page 79: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/79.jpg)
79
Another training scenariosmodels were forced to visit every state during training
huge models, marginal difference in performance
special states introduced to account for unlabelled background and unlabelled foreground, with different strategies for parameter tying
![Page 80: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/80.jpg)
80
Annotation performance - CorelImage features LM mAP
No
Yes
No
Yes
0.120Discrete
0.150
0.140Continuous(1 Gaussian per state) 0.157
Continuous features are better than discreteCo-ocurrence language model also gives moderate improvement
![Page 81: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/81.jpg)
81
Annotation performance - TRECVID
Model LM mAP
No
Yes
No
Yes
0.0941 Gaussian per state
X
0.14512 Gaussians per state
X
Continuous features only, no language model
![Page 82: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/82.jpg)
82
Annotation Performance on TREC
![Page 83: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/83.jpg)
83
Summary: HMM-Based Annotation Very encouraging preliminary results
Effort started this summer, validated on Corel, and yielded competitive annotation results on TREC
Initial findingsProper normalization of the features is crucial for system performance: bug found and fixed on Friday!Simple HMMs seem to work best
More complex training topology didn’t really helpMore complex parameter tying was only marginally helpful
Glaring gapsNeed a good way to incorporate a language model
![Page 84: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/84.jpg)
84
Brock PytlikJohns Hopkins [email protected]
Graphical Models for Image Annotation
+Joint Segmentation and
Labeling for Content Based Image Retrieval
![Page 85: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/85.jpg)
85
OutlineGraphical Models for Image Annotation
Hidden Markov ModelsPreliminary Results
Two-Dimensional HMM’sWork in Progress
Joint Image Segmentation and LabelingTree Structure Models of Image Segmentation
Proposed Research
![Page 86: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/86.jpg)
86
Graphical Model Notation
tiger
ground
water
grass
water
ground grass
tiger
3C
3O
water
ground grass
tiger
2C
2O
1Cwater
ground grass
tiger
1O
p(o | c) p(o | c)
p(c | c ') p(c | c ')
![Page 87: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/87.jpg)
87
Graphical Model Notation
tiger
ground
water
grass
water
ground grass
tiger
3C
3O
water
ground grass
tiger
2C
2O
1C
water
1O
p(o | c) p(o | c)
p(c | c ') p(c | c ')
![Page 88: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/88.jpg)
88
Graphical Model Notation
tiger
ground
water
grass
water
ground grass
tiger
3C
3O
water
2C
2O
1C
water
1O
p(o | c) p(o | c)
p(c | c ') p(c | c ')
![Page 89: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/89.jpg)
89
Graphical Model Notation
tiger
ground
water
grasstiger
3C
3O
water
2C
2O
1C
water
1O
p(o | c) p(o | c)
)|( 'ccp p(c | c ')
![Page 90: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/90.jpg)
90An HMM for a 24-block Image
Graphical Model Notation Simplified
![Page 91: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/91.jpg)
91
Graphical Model Notation Simplified
An HMM for a 24-block Image
![Page 92: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/92.jpg)
92
Modeling Spatial Structure
An HMM for a 24-block Image
![Page 93: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/93.jpg)
93
Modeling Spatial Structure
An HMM for a 24-block Image Transition probabilities represent spatial extent of objects
![Page 94: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/94.jpg)
94
Modeling Spatial Structure
Transition probabilities represent spatial extent of objects
A Two-Dimensional Model for a 24-block Image
![Page 95: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/95.jpg)
95
Modeling Spatial Structure
Transition probabilities represent spatial extent of objects
A Two-Dimensional Model for a 24-block Image
Model Training Time Per Image
Training Time Per Iteration
1-D HMM .5 sec 37.5 min2-D HMM 110 sec 8250 min = 137.5 hr
![Page 96: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/96.jpg)
96
Bag-of-Annotations TrainingUnlike ASR Annotation Words are Unordered
1
Constraint on
Ct
Ct
Tiger, Sky, Grass
Mt
p(Mt =1) =1 1 if ct ∈ tiger,grass,sky{ }0 otherwise⎧ ⎨ ⎩
![Page 97: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/97.jpg)
97
Bag-of-Annotations Training (II) Forcing Annotation Words to Contribute
Mt(1) = Mt −1
(1) ∨(Ct = tiger)
Mt(2) = Mt−1
(2) ∨(Ct = grass)
Only permit paths that visit every annotation word.
Ct
Mt(3) = Mt−1
(3) ∨(Ct = sky)
Mt(1) Mt
(2) Mt(3)
![Page 98: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/98.jpg)
98
Inference on Test ImagesForward Decoding
p(c | dv ) =p(c,dv )p(dv)
![Page 99: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/99.jpg)
99
Inference on Test ImagesForward Decoding
)( )|( 1
SpsvpcS
N
iii∑ ∏
∋ =⎥⎦
⎤⎢⎣
⎡
p(c | dv ) =p(c,dv )p(dv)
=
![Page 100: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/100.jpg)
100
Inference on Test ImagesForward Decoding
)( )|( 1
SpsvpS
N
iii∑ ∏ ⎥⎦
⎤⎢⎣
⎡
=
)( )|( 1
SpsvpcS
N
iii∑ ∏
∋ =⎥⎦
⎤⎢⎣
⎡
p(c | dv ) =p(c,dv )p(dv)
=
![Page 101: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/101.jpg)
101
Inference on Test ImagesForward Decoding
Viterbi DecodingApproximate Sum over all Paths with the Best Path
)( )|( 1
SpsvpS
N
iii∑ ∏ ⎥⎦
⎤⎢⎣
⎡
=
)( )|( 1
SpsvpcS
N
iii∑ ∏
∋ =⎥⎦
⎤⎢⎣
⎡
p(c | dv ) =p(c,dv )p(dv)
=
![Page 102: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/102.jpg)
102
Annotation Performance on Corel Data
Model Image Features
mAP
Discrete 0.071
DiscreteContinuous
0.0860.074
DiscreteContinuous
TrainingTBD
Working with 2-D models needs further studymAP not yet on par with other models
![Page 103: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/103.jpg)
103
Future WorkImproved Training for Two-Dimensional Models
Permits training horizontal and vertical chains separately
Other variations could be investigated Next Idea
Joint Image Segmentation and Labeling
)|()|(),|( ,1,11,,1, jijijijiji ccpccpcccp −−−− ∝
![Page 104: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/104.jpg)
104
Joint Segmentation and Labeling
tiger, grass, sky
![Page 105: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/105.jpg)
105
Joint Segmentation and Labeling
tiger, grass, sky
![Page 106: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/106.jpg)
106
Joint Segmentation and Labeling
tiger, grass, sky
![Page 107: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/107.jpg)
107
Joint Segmentation and Labeling
tiger, grass, sky
sky
tiger
grass
sky
tiger
grass
![Page 108: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/108.jpg)
108
Research ProposalA Generative Model for Joint Segmentation and Labeling
Tree construction by agglomerative clustering of image regions (blocks) based on visual similarity
Segmentation = A cut across the resulting treeLabeling = Assigning concepts to resulting leaves
![Page 109: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/109.jpg)
109
ModelGeneral Model
∑ ∏∈ ∈
=))(tree(cuts )(leaves
)|)(obs( ),|( )(),(vdu ul
llv clplucpupdcp
![Page 110: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/110.jpg)
110
ModelGeneral Model
∑ ∏∈ ∈
=))(tree(cuts )(leaves
)|)(obs( ),|( )(),(vdu ul
llv clplucpupdcp
Probability of Cut
![Page 111: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/111.jpg)
111
ModelGeneral Model
∑ ∏∈ ∈
=))(tree(cuts )(leaves
)|)(obs( ),|( )(),(vdu ul
llv clplucpupdcp
Probability of Label GivenCut and Leaf
![Page 112: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/112.jpg)
112
ModelGeneral Model
∑ ∏∈ ∈
=))(tree(cuts )(leaves
)|)(obs( ),|( )(),(vdu ul
llv clplucpupdcp
Probability of Observation Given Label
![Page 113: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/113.jpg)
113
ModelGeneral Model
Independent Generation of Observations Given Label
∑ ∏∈ ∈
=))(tree(cuts )(leaves
)|)(obs( ),|( )(),(vdu ul
llv clplucpupdcp
∑ ∏ ∏∈ ∈ ∈
=))tree((cuts )(leaves )child
)|(),|()(),(vdu ul (lo
llv coplucpupdcp
![Page 114: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/114.jpg)
114
Estimating Model ParametersSuitable independence assumptions may need to be made
All cuts are equally likely?Given a cut, leaf labels have a Markov dependenceGiven a label, its image footprint is independent neighboring image regions
Work out EM algorithm for this model
![Page 115: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/115.jpg)
115
Estimating Cuts given TopologyUniform
All cuts containing leaves or more equally likelyHypothesize number of segments produced
Hypothesize which possible segmentation usedGreedy Choice
Pick node with largest observation probability remaining that produces a valid segmentation
Repeat until all observations accounted forChanges Model
No longer distribution over cutsAffects valid labeling strategies
|| c
![Page 116: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/116.jpg)
116
Estimating Labels Given CutsUniform
Like HMM training with fixed concept transitionsNumber of Children
Sky often generates a large number of observationsCanoe often generates a small number of observations
Co-occurrence Language ModelEliminates label independence given cutCould do two-pass model like MT group did (not exponential)
∑ ∑∈ ∈
⎥⎦
⎤⎢⎣
⎡=
Ca umacpmaplucp )|()|(),|(
)(leaves12
![Page 117: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/117.jpg)
117
Estimating Observations Given LabelsLabel Generates its Observations Independently
Problem: Product of Children at least as high as Parent Score
Label Generates Composite Observation at Node
![Page 118: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/118.jpg)
118
Evaluation PlanEvaluate on Corel Image set using mAPTREC annotation task
![Page 119: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/119.jpg)
119
Questions?
![Page 120: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/120.jpg)
120
Predicting Visual Concepts From TextPresented byMatthew Krause
![Page 121: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/121.jpg)
121
dPresentation Outline
q
Words Visterms
Vist
erm
sW
ords
dTranslation (MT) models (Paola),
Relevance Models (Shao Lei,Desislava),
Graphical Models(Pavel, Brock)
Text classification models(Matt)
Integration & Summary(Dietrich)
![Page 122: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/122.jpg)
122
A Motivating Example
![Page 123: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/123.jpg)
123
A Motivating Example<Word stime="177.09" dur="0.22" conf="0.727"> IT'S </Word><Word stime="177.31" dur="0.25" conf="0.963"> MUCH </Word><Word stime="177.56" dur="0.11" conf="0.976"> THE </Word><Word stime="177.67" dur="0.29" conf="0.977"> SAME </Word><Word stime="177.96" dur="0.14" conf="0.980"> IN </Word><Word stime="178.10" dur="0.13" conf="0.603"> THE </Word><Word stime="178.38" dur="0.57" conf="0.953"> SUMMERTIME
</Word><Word stime="178.95" dur="0.50" conf="0.976"> GLACIER </Word><Word stime="179.45" dur="0.60" conf="0.974"> AVALANCHE
</Word>
![Page 124: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/124.jpg)
124
ConceptsAssume there is a hidden variable c which generates query words from a document’s visterms.
∑ ∑≅=C C
wvwwvwv dcpcqpdcpcdqpdqp )|()|()|(),|()|(
![Page 125: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/125.jpg)
125
ASR Features ExampleSTEVE FOSSETT AND HIS BALLOON SOLO SPIRIT ARSENIDE OVER THE BLACK SEADRIFTING SLOWLY TOWARDS THE COASTOF THE CAUCUSES HIS TEAM PLANS IFNECESSARY TO BRING HIM DOWN AFTERDAYLIGHT TOMORROW YOU THE CHECHENCAPITAL OF GROZNY
![Page 126: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/126.jpg)
126
Building FeaturesInsert Sentence Boundaries
Case Restoration
Noun Extraction Named Entity Detection
WordNet Processing
Feature Set
![Page 127: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/127.jpg)
127
ASR Features ExampleSTEVE FOSSETT AND HIS BALLOON SOLO SPIRIT ARSENIDE OVER THE BLACK SEADRIFTING SLOWLY TOWARDS THE COASTOF THE CAUCUSES HIS TEAM PLANS IFNECESSARY TO BRING HIM DOWN AFTERDAYLIGHT TOMORROW YOU THE CHECHENCAPITAL OF GROZNY
![Page 128: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/128.jpg)
128
ASR Features ExampleSTEVE FOSSETT AND HISBALLOON SOLO SPIRITARSENIDE.
OVER THE BLACK SEADRIFTING SLOWLYTOWARDS THE COASTOF THE CAUCUSES.
HIS TEAM PLANS IFNECESSARY TO BRING HIMDOWN AFTER DAYLIGHTTOMORROW.
YOU THE CHECHEN CAPITALOF GROZNY
![Page 129: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/129.jpg)
129
ASR Features ExampleSteve Fossett and his balloon Solo Spirit arsenide.
Over the Black Sea driftingslowly towards the coast of thecaucuses.
His team plans if necessary tobring him down after daylighttomorrow.
you the Chechan capital ofGrozny….
![Page 130: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/130.jpg)
130
ASR Features ExampleSteve Fossett and his balloon Solo Spirit arsenide.
Over the Black Sea driftingslowly towards the coast of thecaucuses.
His team plans if necessary tobring him down after daylighttomorrow.
you the Chechan capital ofGrozny.
Named EntitiesMale Person, Location (Region)
![Page 131: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/131.jpg)
131
ASR Features ExampleSteve Fossett and his balloon Solo Spirit arsenide.
Over the Black Sea driftingslowly towards the coast of thecaucuses.
His team plans if necessary tobring him down after daylighttomorrow.
you the Chechan capital ofGrozny.
Named EntitiesMale Person, Location (Region)
![Page 132: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/132.jpg)
132
ASR Features ExampleSteve Fossett and his balloon Solo Spirit arsenide.
Over the Black Sea driftingslowly towards the coast of thecaucuses.
His team plans if necessary tobring him down after daylighttomorrow.
you the Chechan capital ofGrozny.
Named EntitiesMale Person, Location (Region)
Nounsballoon, solo, spirit, coast, caucus, team, daylight, Chechan, capital, Grozny
WordNetnature
![Page 133: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/133.jpg)
133
Feature SelectionBasic feature set (nouns + NEs) has ~18,000 elements/shot
6000 elements x {previous, this, next}Using only a subset of the possible features may affect performance.Two strategies for feature selection:
Remove very rare words (18,000 7902)Eliminate low-value features
![Page 134: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/134.jpg)
134
Information GainMeasures the change in entropy given the value of a single feature
∑∈
=−=)(
)|()()(),(FValuesw
wFCHwpCHFCGain
![Page 135: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/135.jpg)
135
Information Gain ResultsBasketball
1. (empty)2. Location-city3. (empty) (previous)4. “game” (previous)5. “game”6. Person-male7. “point” (previous)8. “game” (next)9. “basketball (previous)10. “win”11. (empty) (next)12. “basketball”13. “point”14. “title” (previous)15. “win” (previous)
Sky1. Person-male (previous)2. “car” (previous)3. Person4. Person-male5. “jury”6. Person (next)7. (empty) (next)8. “point”9. “report”10. “point” (next)11. “change” (previous)12. “research” (next)13. “fiber” (previous)14. “retirement” (next)15. “look”
![Page 136: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/136.jpg)
136
Choosing an optimal number of features
0.56
0.565
0.57
0.575
0.58
250
750
1250
1750
2250
2750
3250
3750
4250
4750
5250
5750
6250
6750
7250
Number of Features
AP
![Page 137: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/137.jpg)
137
ClassifiersNaïve BayesDecision TreesSupport Vector MachinesVoted PerceptronsLanguage ModelAdaBoosted Naïve Bayes & Decision StumpsMaximum Entropy
![Page 138: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/138.jpg)
138
Naïve BayesBuild a binary classifier (present/absent) for each concept.
)()()|()|(
w
ww dp
cpcdpdcp =
![Page 139: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/139.jpg)
139
Language ModelingConceptually similar to Naïve Bayes but
MultinomialSmoothed distributionsDifferent feature selection
![Page 140: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/140.jpg)
140
Maximum Entropy ClassificationBinary constraints
Single 75-concept model
Ranked list of concepts for each shot.
![Page 141: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/141.jpg)
141
Results on the most common concepts
0
0.1
0.2
0.3
0.4
0.5
0.6
AP
text non_studio face indoors outdoors people person
ChanceLang ModelNaïve BayesMaxEnt
![Page 142: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/142.jpg)
142
Results on selected concepts
0
0.1
0.2
0.3
0.4
0.5
0.6
AP
weather basketball face sky indoors beach vehicle car
ChanceLang ModelNaïve BayesMaxEnt
![Page 143: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/143.jpg)
143
Mean Average Precision
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
AP
Chance Language Model SVM Naïve Bayes Max Ent
![Page 144: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/144.jpg)
144
Will this help for retrieval?“Find shots of a person diving into some water.”
person, water_body, non-studio_setting, nature_non-vegetation, person_action, indoors
“Find shots of the front of the White House in the daytime with the fountain running.”
building, outdoors, sky, water_body, cityscape, house, nature_vegetation
“Find shots of Congressman Mark Souder.”person, face, indoors, briefing_room_setting, text_overlay
![Page 145: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/145.jpg)
145
Will this help for retrieval?“Find shots of a person diving into some water.”
person, water_body, non-studio_setting, nature_non-vegetation, person_action, indoors
“Find shots of the front of the White House in the daytime with the fountain running.”
building, outdoors, sky, water_body, cityscape, house, nature_vegetation
“Find shots of Congressman Mark Souder.”person, face, indoors, briefing_room_setting, text_overlay
![Page 146: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/146.jpg)
146
Performance on retrieval-relevant conceptsConcept Importance AP Chanceoutdoors 0.68 0.434 0.270person 0.48 0.267 0.227
vehicle 0.36 0.106 0.043
man-made-obj. 0.28 0.190 0.156
sky 0.40 0.119 0.061
face 0.28 0.582 0.414
building 0.24 0.078 0.042road 0.24 0.055 0.037transportation 0.24 0.151 0.065indoors 0.24 0.459 0.317
![Page 147: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/147.jpg)
147
SummaryPredict visual concepts for ASRTried Naïve Bayes, SVMs, MaxEnt, Language Models,…Expect improvements in retrieval
![Page 148: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/148.jpg)
148
Joint Visual-Text Video OCR Proposed by:Matthew KrauseGeorgetown University
![Page 149: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/149.jpg)
149
MotivationTREC queries ask for:
specific personsspecific placesspecific eventsspecific locations
![Page 150: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/150.jpg)
150
Motivation“Find shots of Congressman Mark Souder”
![Page 151: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/151.jpg)
151
Motivation“Find shots of a graphic of Dow Jones Industrial Average showing a rise for one day. The number of points risen that day must be visible.”
![Page 152: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/152.jpg)
152
MotivationFind shots of the Tomb of the Unknown Soldier in Arlington National Cemetery.
![Page 153: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/153.jpg)
153
Motivation
WEIFll I1 NFWdJ TNNIF H
![Page 154: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/154.jpg)
154
Joint Visual-Text Video OCRGoal: Improve video OCR accuracy by exploiting other information in the audio and video streams during recognition.
![Page 155: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/155.jpg)
155
Why use video OCR?…. Sources tell C.N.N. there’s evidence that links those incidents with the January bombing of a women’s health clinic in Birmingham, Alabama. Pierre Thomas joins us now from Washington. He has more on the story in this live report…
![Page 156: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/156.jpg)
156
Why use video OCR?
![Page 157: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/157.jpg)
157
Why use video OCR?
![Page 158: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/158.jpg)
158
Why use video OCR?Those links are growing more intensiveinvestigative focus toward fugitive EricRudolph who’s been charged in theBirmingham bombing which killed an off-duty policeman…
![Page 159: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/159.jpg)
159
Why use video OCR?Text overlays provide high precision information about query-relevant concepts in the current image.
![Page 160: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/160.jpg)
160
Finding TextUse existing tools and data from IBM/CMU.
![Page 161: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/161.jpg)
161
Image ProcessingPreprocessing
Normalize the text region’s heightFeature extraction
ColorEdge Strength and Orientation
![Page 162: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/162.jpg)
162
Proposal: HMM-based recognizer
c1 c2 c3 c4 c5 c6
M A I T K
![Page 163: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/163.jpg)
163
Proposal: Cache-based LMsAugment the recognizers with an interpolation of language models
Background language modelCache-based language model
ASR or closed caption text“Interesting” words from the cache
Named Entities
321 )|()|()|()|( λλλ hcphcphcphcp iinteresticacheibgi =
![Page 164: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/164.jpg)
164
EvaluationEvaluate on TRECVID dataCharacter Error Rate
Compare vs. manual transcriptionsMean Average Precision
NIST-provided relevance judgments
![Page 165: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/165.jpg)
165
SummaryInformation from text overlays appears to be useful for IR.General character recognition is a Hard problem.Adding in external knowledge sources via the LMs should improve accuracy.
![Page 166: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/166.jpg)
166
Work Plan1. Text Localization
IBM/CMU text finders + height normalization2. Image Processing & Feature
ExtractionBegin with color and edge features
3. HMM-based RecognizerTrain using TREC data with hand-labeled captions
4. Language ModelingBackground, Cache, and “Interesting Words”
![Page 167: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/167.jpg)
167
Retrieval Experiments and
Summary
Presented by Dietrich Klakow
![Page 168: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/168.jpg)
168
dPresentation Outline
q
Words Visterms
Vist
erm
sW
ords
dTranslation (MT) models (Paola),
Relevance Models (Shao Lei,Desislava),
Graphical Models(Pavel, Brock)
Text classification models(Matt)
Integration & Summary(Dietrich)
![Page 169: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/169.jpg)
169
The Matrix
Visterms dvWords dw
DocumentW
ords
qw
Vis
term
s qv
Que
ry )| vwvw ,dd,qp(q
![Page 170: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/170.jpg)
170
The Matrix
)| ww dp(q )| vw dp(q
)| wv dp(q
Visterms dvWords dw
DocumentW
ords
qw
Vis
term
s qv
Que
ry
)| vv dp(q
![Page 171: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/171.jpg)
171
•Naïve Bayes•Max. Ent•LM•SVM, Ada Boost, …
•MT•Relevance Models•HMM
)| vw dp(q
The Matrix
)| ww dp(q
)| wv dp(q
Visterms dvWords dw
DocumentW
ords
qw
Vis
term
s qv
Que
ry
)| vv dp(q
![Page 172: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/172.jpg)
172
)|)|)|
vwvvww
vwvw
,ddp(q,ddp(q,dd,qp(q
×=
)|)1()|)|
vwwwww
vww
dp(qdp(q,ddp(q
λλ −+=
Retrieval Model I: p(q|d)
Baseline. Standard text-retrievalText QueryImage Documents
![Page 173: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/173.jpg)
173
Retrieval Model I: p(q|d)
)]|()1()|[)]|)1()|[
)|
vvvwvv
vwwwww
vwvw
dqpdp(qdp(qdp(q
,dd,qp(q
λλλλ
−+×−+
=
α Only minor improvements over baseline
![Page 174: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/174.jpg)
174
Retrieval Model II: p(q|d)We want to estimateAssume pairwise marginals given:
Setting: Maximum Entropy problem4 constraints1 iteration of GIS:
), vwvw ,dd,qp(q
),(),,
vwdq
vwvw dqp,dd,qp(qwv
=∑
4321 )|()|()|()|()| λλλλvvwvvwwwvwvw dqpdqpdqpdqp,dd,qp(q ∝
![Page 175: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/175.jpg)
175
Baseline TRECVID: Text Retrieval
Retrieval mAP: 0.131
)| ww dp(q
Visterms dvWords dw
DocumentW
ords
qw
Vis
term
s qv
Que
ry
Report best automatic run from literature (0.16)
![Page 176: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/176.jpg)
176
Combination with visual model
)| ww dp(q )| vw dp(q
Visterms dvWords dw
DocumentW
ords
qw
Vis
term
s qv
Que
ry mAP: 0.131
![Page 177: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/177.jpg)
177
Combination with visual model
Retrieval mAP: 0.139
)| ww dp(q )| vw dp(q
Visterms dvWords dw
DocumentW
ords
qw
Vis
term
s qv
Que
ry
MT 0.126Relevance Models 0.158HMM 0.145
Concept Annotationon images mAP on TRECVID
MT: Best overall performance so far
MTmAP: 0.131
![Page 178: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/178.jpg)
178
Combination with MT and ASR
Retrieval mAP: 0.149
MT
)| ww dp(q )| vw dp(q
Visterms dvWords dw
Document
)| wv dp(q
Wor
ds q
wV
iste
rms q
v
Que
ry
Concepts from ASR: mAP=0.125
MT 0.126Relevance Models 0.158HMM 0.145
Concept Annotationon images: mAP on TRECVID
Best results reported in literature: retrieval mAP=0.162
mAP: 0.131
![Page 179: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/179.jpg)
179
Recall-Precision-Curve
Improvementsin high precisionregion
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Pre
ciss
ion
Recall
BestBasline
![Page 180: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/180.jpg)
180
Difficulties and Limitations we faced
Annotations are Inconsistent, sometimes abstract, …
Used plain vanilla featuresColor, texture, edge on key-framesNo time for exploration of alternatives
Uniform block segmentation of imagesUpper bound for concepts from ASR
![Page 181: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/181.jpg)
181
Future WorkModel
Incompletely labelled images Inconsistent annotations
Get beyond the 75-concept bottleneckLarger concept set (+training data) Direct modelling
Better model for spatial and temporal dependencies in videoQuery dependent processing
E.g. image features, combination weights, OCR-features
Desislava
Shaolei and Brock
Matt
![Page 182: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/182.jpg)
182
Overall SummaryConcepts from image
MT: CLIR with direct translation works best Relevance models: best numbers on development testHMM: novel competitive approach for image annotation
Concepts from ASR: oh my god, it works
Fusion: adding multiple source in log-linear combination helped
Overall: 14% improvement
![Page 183: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/183.jpg)
183
AcknowledgmentsTREC for the dataBBN for NE-taggingIBM:
for providing the features Close captioning alignment (Arnon Amir)
Help with GMTK: Jeff Bilmes and Karen LivescuCLSP for the capitalizer (WS 03 MT-team) INRIA for the face detectorNSF, DARPA and NSA for the money CLSP for hosting
Laura, Sue, ChrisEiwe, John, PeterFred
![Page 184: Joint Visual-Text Modeling for Multimedia Retrieval](https://reader031.vdocuments.site/reader031/viewer/2022012920/61c8af24c8b64f27651ecfe1/html5/thumbnails/184.jpg)
184From: http://www.nature.ca/notebooks/english/tiger.htm