multimedia i: image retrieval in biomedicine
DESCRIPTION
Multimedia I: Image Retrieval in Biomedicine. William Hersh, MD Professor and Chair Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University [email protected] www.billhersh.info. Acknowledgements. Funding NSF Grant ITR-0325160 Collaborators - PowerPoint PPT PresentationTRANSCRIPT
1
Multimedia I: Image Retrieval in Biomedicine
William Hersh, MDProfessor and Chair
Department of Medical Informatics & Clinical Epidemiology
Oregon Health & Science [email protected]
2
Acknowledgements
• Funding– NSF Grant ITR-0325160
• Collaborators– Jeffery Jensen, Jayashree Kalpathy-Cramer, OHSU– Henning Müller, University of Geneva, Switzerland– Paul Clough, University of Sheffield, England– Cross-Language Evaluation Forum (Carol Peters,
ISTI-CNR, Pisa, Italy)
3
Overview of talk
• Brief review of information retrieval evaluation
• Issues in indexing and retrieval of images
• ImageCLEF medical image retrieval project– Test collection description– Results and analysis of experiments
• Future directions
4
Image retrieval
• Biomedical professionals increasingly use images for research, clinical care, and education, yet we know very little about how they search for them
• Most image retrieval work has focused on either text annotation retrieval or image processing, but not combining both
• Goal of this work is to increase our understanding and ability to retrieve images
5
Image retrieval issues and challenges
• Image retrieval is a “poor stepchild” to text retrieval, with less understanding of how people use systems and how well they work
• Images are not always “standalone,” e.g.,– May be part of a series of images– May be annotated with text
• Images are “large”– Relative to text
• Images may be compressed, which may results in loss of content (e.g., lossy compression)
6
Review of evaluation of IR systems
• System-oriented – how well system performs– Historically focused on relevance-based measures
• Recall – # relevant retrieved / # relevant in collection• Precision – # relevant retrieved / # retrieved by search
– When content output is ranked, can aggregate both in measure like mean average precision (MAP)
• User-oriented – how well user performs with system– e.g., performing task, user satisfaction, etc.
7
System-oriented IR evaluation
• Historically assessed with test collections, which consist of– Content – fixed yet realistic collections of documents,
images, etc.– Topics – statements of information need that can be
fashioned into queries entered into retrieval systems– Relevance judgments –by expert humans for which
content items should be retrieved for which topics
• Calculate summary statistics for all topics– Primary measure usually MAP
8
Calculating MAP in a test collection
1: REL
2: NOT REL
3: REL
4: NOT REL
5: NOT REL
1/1 = 1.0
Average precision (AP) for a topic:
2/3 = 0.67
N: REL 0
(1.0 + 0.67 + 0.5) / 5 = 0.43
Mean average precision (MAP)is mean of average precision forall topics in a test collection
Result is an aggregate measurebut the number itself is only ofcomparative value
6: REL
7: NOT REL
3/6 = 0.5
N+1: REL 0
9
Some well-known system-oriented evaluation forums
• Text Retrieval Conference (TREC, trec.nist.gov; Voorhees, 2005)– Many “tracks” of interest, such as Web searching, question-
answering, cross-language retrieval, etc.– Non-medical, with exception of Genomics Track (Hersh, 2006)
• Cross-Language Evaluation Forum (CLEF, www.clef-campaign.org)– Spawned from TREC cross-language track, European-based– One track on image retrieval (ImageCLEF), which includes
medical image retrieval tasks (Hersh, 2006)• Operate on annual cycle:
Release ofdocument/imagecollection
Experimentalruns andsubmissionof results
Relevancejudgments
Analysis ofresults
10
Image retrieval – indexing
• Two general approaches (Müller, 2004)– Textual or semantic – by annotation, e.g.,
• Narrative description• Controlled terminology assignment• Other types of textual metadata, e.g., modality,
location
– Visual or content-based• Identification of features, e.g., colors, texture,
shape, segmentation• Our ability to “understand” content of images less
developed than for textual content
11
Image retrieval – searching
• Based on type of indexing– Textual – typically uses features of text
retrieval systems, e.g.,• Boolean queries• Natural language queries• Forms for metadata
– Visual – usual goal is to identify images with comparable features, i.e., find me images similar to this one
12
Example of visual image retrieval
13
ImageCLEF medicalimage retrieval
• Aims to simulate general searching over wide variety of medical images
• Uses standard IR approach with test collection consisting of– Content– Topics– Relevance judgments
• Has operated through three cycles of CLEF (2004-2006)– First year used Casimage image collection– Second and third year used current image collection– Developed new topics and performed relevance judgments for
each cycle• Web site – http://ir.ohsu.edu/image/
14
ImageCLEF medical collection library organization
Library
Collection
Collection
Case
Case
Annotation
Image
Annotation
Annotation
ImageAnnotation
Annotation
Annotation
15
ImageCLEF medical test collection
Collection Predominant images
Cases Images Annotations Size
Casimage Mixed 2076 8725 English – 177
French – 1899
1.3 GB
Mallinckrodt Institute of Radiology (MIR)
Nuclear medicine
407 1177 English – 407 63 MB
Pathology Education Instructional Resource (PEIR)
Pathology 32319 32319 English – 32319 2.5 GB
PathoPIC Pathology 7805 7805 German – 7805
English – 7805
879 MB
16
Example case from Casimage
ID: 4272 Description: A large hypoechoic mass is seen in the spleen. CDFI reveals it to be hypovascular and distorts the intrasplenic blood vessels. This lesion is consistent with a metastatic lesion. Urinary obstruction is present on the right with pelvo-caliceal and uretreal dilatation secondary to a soft tissue lesion at the junction of the ureter and baldder. This is another secondary lesion of the malignant melanoma. Surprisingly, these lesions are not hypervascular on doppler nor on CT. Metastasis are also visible in the liver. Diagnosis: Metastasis of spleen and ureter, malignant melanoma Clinical Presentation: Workup in a patient with malignant melanoma. Intravenous pyelography showed no excretion of contrast on the right.
Images
Caseannotation
17
Annotations vary widely
• Casimage – case and radiology reports
• MIR – image reports
• PEIR – metadata based on Health Information Assets Library (HEAL)
• PathoPIC – image descriptions, longer in German and shorter in English
18
Topics
• Each topic has– Text in 3 languages– Sample image(s)– Category – judged amenable to visual, mixed,
or textual retrieval methods
• 2005 – 25 topics– 11 visual, 11 mixed, 3 textual
• 2006 – 30 topics– 10 each of visual, mixed, and textual
19
Example topic (2005, #20)
Show me microscopic pathologies of cases with chronic myelogenous leukemia.
Zeige mir mikroskopische Pathologiebilder von chronischer Leukämie.
Montre-moi des images de la leucémie chronique myélogène.
20
Relevance judgments
• Done in usual IR manner with pooling of results from many searches on same topic
• Pool generation – top N results from each run– Where N = 40 (2005) or 30 (2006)– About 900 images per topic judged
• Judgment process– Judged by physicians in OHSU biomedical informatics
program– Required about 3-4 hours per judge per topic
• Kappa measure of interjudge agreement = 0.6-0.7 (“good”)
21
ImageCLEF medical retrieval task results – 2005
• (Hersh, JAMIA, 2006)• Each participating group submitted one or
more runs, with ranked results from each of the 25 topics
• A variety of measures calculated for each topic and mean over all 25– (Measures on next slide)
• Initial analysis focused on best results in different categories of runs
22
Measurement of results
• Retrieved
• Relevant retrieved
• Mean average precision (MAP, aggregate of ranked recall and precision)
• Precision at number of images retrieved (10, 30, 100)
• (And a few others…)
23
Categories of runs
• Query preparation– Automatic – no human modification– Manual – with human modification
• Query type– Textual – searching only via textual
annotations– Visual – searching only by visual means– Mixed – textual and visual searching
24
Retrieval task results
• Best results overall
• Best results by query type
• Comparison by topic type
• Comparison by query type
• Comparison of measures
25
Number of runs by query type(out of 134)
Query types Automatic Manual
Visual 28 3
Textual 14 1
Mixed 86 2
26
Best results overall
• Institute for Infocomm Research (Singapore) and IPAL-CNRS (France) – run IPALI2R_TIan
• Used combination of image and text processing– Latter focused on mapping terms to semantic
categories, e.g., modality, anatomy, pathology, etc.
• MAP – 0.28• Precision at
– 10 images – 0.62 (6.2 images)– 30 images – 0.53 (18 images)– 100 images – 0.32 (32 images)
27
Results for top 30 runs – not much variation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
IPALI
2R_T
Ian
nctu
_visu
al+Tex
t_au
to_4
UBimed
_en-
fr.TI.1
OHSUman
ual.t
xt
IPALI
2R_T
n
i6-En.
clef
UBimed
_en-
fr.T.B
l2
OHSUman
vis.tx
t
I2Rfu
s.txt
mira
rf5.2
fil.qt
op
SinaiE
n_kl_
fb_I
mgT
ext2
GE_M_1
0.tx
t
mira
base
.qto
p
GE_M_4
g.tx
t
i2r-v
k-av
g.txt
SinaiE
n_ok
api_n
ofb_
Topics
i6-vis
tex-
rfb1.
clef
rwth
_mi_
all4.
trec
i2r-v
k-sim
.txt
i6-vo
-101
0111
.clef
nctu
_visu
al_au
to_a
8
i6-30
1021
0111
.clef
ceam
dItlT
ft
ceam
dItl
OHSUauto
.txt
GE_M_T
XT.txt
cindi
Subm
issio
n.txt
Run
MA
P
MAP
R-Prec
B-Pref
P10
P30
P100
28
Best results (MAP) by query type
Query types Automatic Manual
Visual I2Rfus.txt
0.146
i2r-vk-avg.txt
0.092
Textual IPALI2R_Tn
0.208
OHSUmanual.txt
0.212
Mixed IPALI2R_TIan
0.282
OHSUmanvis.txt
0.160
• Automatic-mixed runs best (including those not shown)
29
Best results (MAP) by topic type (for each query type)
• Visual runs clearly hampered by textual (semantic) queries
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
All Visual Mixed Semantic
Topic Category
MA
P
AM
AT
AV
MM
MT
MV
30
0
50
100
150
200
250
300
350
400
450
1 2 3 4 5 6 7 8 9 1011 121314 1516 1718 1920 212223 2425
Topic
Rel
evan
t
0.0000
0.0500
0.1000
0.1500
0.2000
0.2500
0.3000
0.3500
MA
P Relevant
MAP
Relevant and MAP by topic – great deal of variation
Visual Mixed Textual
31
Interesting “quirk” in results from OHSU runs
• Man-Mixed starts out good but falls rapidly, with lower MAP• MAP measure values recall; may not be best for this task
0
0.1
0.2
0.3
0.4
0.5
0.6
P5
P10
P15
P20
P30
P100
P200
P500
P1000
Precision at Document
Pre
cisi
on
OHSUMan
OHSUManVis
32
Also much variation by topic in OHSU runs
P30
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
1.0000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
P30 OHSUman
P30 OHSUmanvis
33
ImageCLEF medical retrieval task results – 2006
• Primary measure – MAP• Results reported in track overview on CLEF Web
site (Müller, 2006) and in following slides – Runs submitted– Best results overall– Best results by query type– Comparison by topic type– Comparison by query type– Comparison of measures– Interesting finding from OHSU runs
34
Categories of runs
• Query type – human preparation– Automatic – no human modification– Manual – human modification of query– Interactive – human modification of query after
viewing output (not designated in 2005)
• System type – feature(s)– Textual – searching only via textual annotations– Visual – searching only by visual means– Mixed – textual and visual searching– (NOTE: Topic types have these category names too)
35
Runs submitted by category
System Type
Query Type
Visual Mixed Textual Total
Automatic 11 37 31 79
Manual 10 1 6 17
Interactive 1 2 1 4
Total 22 40 38 100
36
Best results overall
• Institute for Infocomm Research (Singapore) and IPAL-CNRS (France) (Lacoste, 2006)
• Used combination of image and text processing– Latter focused on mapping terms to semantic
categories, e.g., modality, anatomy, pathology, etc.
• MAP – 0.3095• Precision at
– 10 images – 0.6167 (6.2 images)– 30 images – 0.5822 (17.4 images)– 100 images – 0.3977 (40 images)
37
Best performing runs by system and query type
• Automated textual or mixed query runs best
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Visual Mixed Textual
System Type
MA
P AutomaticManualInteractive
38
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
IPAL-
IPAL_
Cpt_I
m
IPAL-
IPAL_
Mod
FDT_Cpt
IPAL-
IPAL_
Textua
l_CDF
UB-UBmed
VT2
GE_3EN.tr
ecev
al
UKLFR-U
KLFR_o
rigm
ids_e
n_en
IPAL-
IPAL_
CMP_D
1D2D
4D5D
6
OHSU-OHSU
_m1
IPAL-
IPALc
f_D1D
2D3D
4D5D
6
IPAL-
IPAL_
D1D2D
4D5
UKLFR-U
KLFR_m
ids_g
e_all
OHSU_bas
eline
_tra
ns
GE_vt2
0.tre
ceva
l
cindi-
CINDI_
Visual
_RF
OHSU_bas
eline
_not
rans
MSRA_W
SM-m
sra_
wsm
GE_8DE.tr
ecev
al
RWTHi6-
Simple
Uni
GE-GE_g
ift
SINAI-S
inaiG
iftT50
L40
SINAI-S
inaiG
iftT70
L20
GE_3DE.tr
ecev
al
RWTHi6-
Patch
OHSU_fre
nch
rwth
_mi-r
wth_m
i
Run
MA
P
MAPP10P30P100
Results for all runs
• Variation between MAP and precision for different systems
39
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Visual Mixed Textual
Topic Type
MA
P VisualMixedTextual
Best performing runs by topic type for each system type
• Mixed queries most robust across all topic types
• Visual queries least robust to non-visual topics
40
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Topic
Ima
ge
s
0
0.05
0.1
0.15
0.2
0.25
0.3
MA
P Relevant
MAP
Relevant and MAP by topic
Visual Mixed Textual
• Substantial variation across all topics and topic types
41
Interesting finding from OHSU runs in 2006 similar to 2005
0
0.1
0.2
0.3
0.4
0.5
0.6
OHSUeng
OHSUeng_
trans
OHSU-OHSUall
OHSUall
OHSU-OHSU_m
1
OHSU_bas
eline
_tra
ns
OHSUger
OHSUfre
OHSU_eng
lish
OHSU_bas
eline
_not
rans
OHSU_ger
man
OHSU_fre
nch
Run
MA
P
MAPP10P30P100
• Mixed run had higher precision despite lower MAP
• Could precision at top of output be more important for user?
42
Conclusions
• A variety of approaches are effective in image retrieval, similar to IR with other content
• Systems that use only visual retrieval are less robust than those that solely do textual retrieval– A possibly fruitful area of research might be ability to
predict which queries are amenable to what retrieval approaches
• Need broader understanding of system use followed by better test collections and experiments based on that understanding– MAP might not be the best performance measure for
the image retrieval task
43
Limitations
• This test collection– Topics artificial – may not be realistic or
representative– Annotation of images may not be
representative or of best practice
• Test collections generally– Relevance is situational– No users involved in experiments
44
Future directions
• ImageCLEF 2007– Continue work on annual cycle– Funded for another year from NSF grant– Expanding image collection, adding new topics
• User experiments with OHSU image retrieval system– Aim to better understand real-world tasks and best evaluation
measures for those tasks
• Continued analysis of 2005-2006 data– Improved text retrieval of annotations– Improved merging of image and text retrieval– Look at methods of predicting which queries amenable to
different approaches