combining text/image in wikipediamm task 2009clef.isti.cnr.it/2009/clef2009-workshop-slides/... ·...

Combining text/image in WikipediaMM task 2009

Christophe Moulin, Cecile Barat, Cedric Lemaıtre, Mathias Gery,Christophe Ducottet, Christine Largeron

Laboratoire Hubert Curien, Saint-Etienne, France

October 1st 2009

Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 1 / 16

Outline

1 Model overviewTextual vector space modelVisual vocabularyCombining text and image modalities

2 Experiments

3 Conclusion and future work


Model overview

α +(1 − α)bag of words

approach

��

��documents

�

�

�

�indexing�

�

�

�combining

Model overviewA textual/visual model based on the bag of words approach


Model overview Textual vector space model

��

��stop words filtering

��

��Porter stemming

��

��bag of words creation

Textual vocabulary creationMain steps of the textual bag of words creation



bag of words vector of tf.idf weights

[2]

[1]: Salton et al.A vector space model for automatic indexing, 1975[2]: Robertson et al.Okapi et trec-3, 1994

Textual vector weightingSalton’s based tf.idf weighting[1]

�

�

wi,j = tfi,jidfj

tfi,j : representativeness

idfj : discrimination power



original Wikipedia article(n char around the image)

metadata of Wikipedia imageused in ImageCLEFwiki

Exploiting of the text around an image

Two sources of text : metadata + extracted text of the original Wikipediaarticles


Model overview Visual vocabulary

descriptors descriptorsprojection

visualvocabulary

bag of visualwords

descriptors bag of visualwords

vector oftfidf weights

[3]: Jurie et al.Creating efficient codebooks for visual recognition, 2005

Visual representationSimilar to the text representation using a visual codebook[3]

Visual vocabulary creation

Image representation


Model overview Visual vocabulary

meanstd(6 dimensions: 9350 visual words)

sift2(128 dimensions: 9630 visual words)

sift1(128 dimensions: 9303 visual words)

Visual features computationTwo different descriptors are used

regular partitioning: 16× 16 cells

interest regions based on MSER detector


Model overview Combining text and image modalities

query documents

Score matchingDistance computed between query and document vectors

query documentscore1 tf tf.idfscore2 tf.idf tf.idf


Model overview Combining text and image modalities

α +(1 − α)bag of words

approach

Model overviewLinear combination of textual and visual scores

α is fixed globally on ImageCLEFwiki 2008


Experiments

Global results

rank participant/score text image map num ret num rel ret

1 deuceng TXT - 0.2397 43052 1351

5 lahc/score2 100 char meanstd (α=0.025) 0.2178 44993 12136 lahc/score2 50 char meanstd (α=0.025) 0.2148 44993 1218

14 lahc/score2 metadata sift2 (α=0.084) 0.1903 44993 121215 lahc/score2 100 char - 0.1890 38004 120516 lahc/score2 50 char - 0.1880 37041 119820 lahc/score2 metadata meanstd (α=0.025) 0.1845 44993 120821 lahc/score2 metadata sift1 (α=0.012) 0.1807 44995 120024 lahc/score2 metadata meanstd (α=0.015) 0.1792 44993 121333 lahc/score2 metadata - 0.1667 35611 119244 lahc/score1 metadata - 0.1432 35611 116452 lahc/score2 metadata sift2 0.0365 619 14253 lahc/score2 metadata meanstd 0.0338 574 7654 lahc/score2 metadata sift1 0.0321 637 120

57 sztaki - IMG 0.0068 44993 80


Experiments

Textual results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.2 0.4 0.6 0.8 1

score1 (map: 0.1432)score2 (map: 0.1667)

score2 50 char (map: 0.1880)score2 100 char (map: 0.1890)

Improvements provided by additional text (15%)


Experiments

Textual+visual results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.2 0.4 0.6 0.8 1

score2 (map: 0.1667)score2 sift1: α=0.012 (map: 0.1807)

score2 meanstd: α=0.025 (map: 0.1845)score2 sift2: α=0.084 (map: 0.1903)

sift2 > meanstd> sift1


Experiments

Best results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.2 0.4 0.6 0.8 1

score2 50 char (map: 0.1880)score2 100 char (map: 0.1890)

score2 50 char + meanstd (map: 0.2148)score2 100 char + meanstd (map: 0.2178)

Improvements provided by visual information (15%)


Conclusion and future work

ConclusionImprovement of our last year model

It works:

Text around the image in original wikipedia articles. (+15%)

Addition of visual features (MSER+sift). (color/texturecomplementarity)

Text-Image combination. (+15%)


Conclusion and future work

Future work

Combination with more than one visual descriptor.

Other fusion method.

Learnα for each query.


combining text/image in wikipediamm task 2009clef.isti.cnr.it/2009/clef2009-workshop-slides/... ·...

Documents