multimedia i: image retrieval in biomedicine

1

Multimedia I: Image Retrieval in Biomedicine

William Hersh, MDProfessor and Chair

Department of Medical Informatics & Clinical Epidemiology

Oregon Health & Science [email protected]

2

Acknowledgements

• Funding– NSF Grant ITR-0325160

• Collaborators– Jeffery Jensen, Jayashree Kalpathy-Cramer, OHSU– Henning Müller, University of Geneva, Switzerland– Paul Clough, University of Sheffield, England– Cross-Language Evaluation Forum (Carol Peters,

ISTI-CNR, Pisa, Italy)

3

Overview of talk

• Brief review of information retrieval evaluation

• Issues in indexing and retrieval of images

• ImageCLEF medical image retrieval project– Test collection description– Results and analysis of experiments

• Future directions

4

Image retrieval

• Biomedical professionals increasingly use images for research, clinical care, and education, yet we know very little about how they search for them

• Most image retrieval work has focused on either text annotation retrieval or image processing, but not combining both

• Goal of this work is to increase our understanding and ability to retrieve images

5

Image retrieval issues and challenges

• Image retrieval is a “poor stepchild” to text retrieval, with less understanding of how people use systems and how well they work

• Images are not always “standalone,” e.g.,– May be part of a series of images– May be annotated with text

• Images are “large”– Relative to text

• Images may be compressed, which may results in loss of content (e.g., lossy compression)

6

Review of evaluation of IR systems

• System-oriented – how well system performs– Historically focused on relevance-based measures

• Recall – # relevant retrieved / # relevant in collection• Precision – # relevant retrieved / # retrieved by search

– When content output is ranked, can aggregate both in measure like mean average precision (MAP)

• User-oriented – how well user performs with system– e.g., performing task, user satisfaction, etc.

7

System-oriented IR evaluation

• Historically assessed with test collections, which consist of– Content – fixed yet realistic collections of documents,

images, etc.– Topics – statements of information need that can be

fashioned into queries entered into retrieval systems– Relevance judgments –by expert humans for which

content items should be retrieved for which topics

• Calculate summary statistics for all topics– Primary measure usually MAP

8

Calculating MAP in a test collection

1: REL

2: NOT REL

3: REL

4: NOT REL

5: NOT REL

1/1 = 1.0

Average precision (AP) for a topic:

2/3 = 0.67

N: REL 0

(1.0 + 0.67 + 0.5) / 5 = 0.43

Mean average precision (MAP)is mean of average precision forall topics in a test collection

Result is an aggregate measurebut the number itself is only ofcomparative value

6: REL

7: NOT REL

3/6 = 0.5

N+1: REL 0

9

Some well-known system-oriented evaluation forums

• Text Retrieval Conference (TREC, trec.nist.gov; Voorhees, 2005)– Many “tracks” of interest, such as Web searching, question-

answering, cross-language retrieval, etc.– Non-medical, with exception of Genomics Track (Hersh, 2006)

• Cross-Language Evaluation Forum (CLEF, www.clef-campaign.org)– Spawned from TREC cross-language track, European-based– One track on image retrieval (ImageCLEF), which includes

medical image retrieval tasks (Hersh, 2006)• Operate on annual cycle:

Release ofdocument/imagecollection

Experimentalruns andsubmissionof results

Relevancejudgments

Analysis ofresults

10

Image retrieval – indexing

• Two general approaches (Müller, 2004)– Textual or semantic – by annotation, e.g.,

• Narrative description• Controlled terminology assignment• Other types of textual metadata, e.g., modality,

location

– Visual or content-based• Identification of features, e.g., colors, texture,

shape, segmentation• Our ability to “understand” content of images less

developed than for textual content

11

Image retrieval – searching

• Based on type of indexing– Textual – typically uses features of text

retrieval systems, e.g.,• Boolean queries• Natural language queries• Forms for metadata

– Visual – usual goal is to identify images with comparable features, i.e., find me images similar to this one

12

Example of visual image retrieval

13

ImageCLEF medicalimage retrieval

• Aims to simulate general searching over wide variety of medical images

• Uses standard IR approach with test collection consisting of– Content– Topics– Relevance judgments

• Has operated through three cycles of CLEF (2004-2006)– First year used Casimage image collection– Second and third year used current image collection– Developed new topics and performed relevance judgments for

each cycle• Web site – http://ir.ohsu.edu/image/

14

ImageCLEF medical collection library organization

Library

Collection

Collection

Case

Case

Annotation

Image

Annotation

Annotation

ImageAnnotation

Annotation

Annotation

15

ImageCLEF medical test collection

Collection Predominant images

Cases Images Annotations Size

Casimage Mixed 2076 8725 English – 177

French – 1899

1.3 GB

Mallinckrodt Institute of Radiology (MIR)

Nuclear medicine

407 1177 English – 407 63 MB

Pathology Education Instructional Resource (PEIR)

Pathology 32319 32319 English – 32319 2.5 GB

PathoPIC Pathology 7805 7805 German – 7805

English – 7805

879 MB

16

Example case from Casimage

ID: 4272 Description: A large hypoechoic mass is seen in the spleen. CDFI reveals it to be hypovascular and distorts the intrasplenic blood vessels. This lesion is consistent with a metastatic lesion. Urinary obstruction is present on the right with pelvo-caliceal and uretreal dilatation secondary to a soft tissue lesion at the junction of the ureter and baldder. This is another secondary lesion of the malignant melanoma. Surprisingly, these lesions are not hypervascular on doppler nor on CT. Metastasis are also visible in the liver. Diagnosis: Metastasis of spleen and ureter, malignant melanoma Clinical Presentation: Workup in a patient with malignant melanoma. Intravenous pyelography showed no excretion of contrast on the right.

Images

Caseannotation

17

Annotations vary widely

• Casimage – case and radiology reports

• MIR – image reports

• PEIR – metadata based on Health Information Assets Library (HEAL)

• PathoPIC – image descriptions, longer in German and shorter in English

18

Topics

• Each topic has– Text in 3 languages– Sample image(s)– Category – judged amenable to visual, mixed,

or textual retrieval methods

• 2005 – 25 topics– 11 visual, 11 mixed, 3 textual

• 2006 – 30 topics– 10 each of visual, mixed, and textual

19

Example topic (2005, #20)

Show me microscopic pathologies of cases with chronic myelogenous leukemia.

Zeige mir mikroskopische Pathologiebilder von chronischer Leukämie.

Montre-moi des images de la leucémie chronique myélogène.

20

Relevance judgments

• Done in usual IR manner with pooling of results from many searches on same topic

• Pool generation – top N results from each run– Where N = 40 (2005) or 30 (2006)– About 900 images per topic judged

• Judgment process– Judged by physicians in OHSU biomedical informatics

program– Required about 3-4 hours per judge per topic

• Kappa measure of interjudge agreement = 0.6-0.7 (“good”)

21

ImageCLEF medical retrieval task results – 2005

• (Hersh, JAMIA, 2006)• Each participating group submitted one or

more runs, with ranked results from each of the 25 topics

• A variety of measures calculated for each topic and mean over all 25– (Measures on next slide)

• Initial analysis focused on best results in different categories of runs

22

Measurement of results

• Retrieved

• Relevant retrieved

• Mean average precision (MAP, aggregate of ranked recall and precision)

• Precision at number of images retrieved (10, 30, 100)

• (And a few others…)

23

Categories of runs

• Query preparation– Automatic – no human modification– Manual – with human modification

• Query type– Textual – searching only via textual

annotations– Visual – searching only by visual means– Mixed – textual and visual searching

24

Retrieval task results

• Best results overall

• Best results by query type

• Comparison by topic type

• Comparison by query type

• Comparison of measures

25

Number of runs by query type(out of 134)

Query types Automatic Manual

Visual 28 3

Textual 14 1

Mixed 86 2

26

Best results overall

• Institute for Infocomm Research (Singapore) and IPAL-CNRS (France) – run IPALI2R_TIan

• Used combination of image and text processing– Latter focused on mapping terms to semantic

categories, e.g., modality, anatomy, pathology, etc.

• MAP – 0.28• Precision at

– 10 images – 0.62 (6.2 images)– 30 images – 0.53 (18 images)– 100 images – 0.32 (32 images)

27

Results for top 30 runs – not much variation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

IPALI

2R_T

Ian

nctu

_visu

al+Tex

t_au

to_4

UBimed

_en-

fr.TI.1

OHSUman

ual.t

xt

IPALI

2R_T

n

i6-En.

clef

UBimed

_en-

fr.T.B

l2

OHSUman

vis.tx

t

I2Rfu

s.txt

mira

rf5.2

fil.qt

op

SinaiE

n_kl_

fb_I

mgT

ext2

GE_M_1

0.tx

t

mira

base

.qto

p

GE_M_4

g.tx

t

i2r-v

k-av

g.txt

SinaiE

n_ok

api_n

ofb_

Topics

i6-vis

tex-

rfb1.

clef

rwth

_mi_

all4.

trec

i2r-v

k-sim

.txt

i6-vo

-101

0111

.clef

nctu

_visu

al_au

to_a

8

i6-30

1021

0111

.clef

ceam

dItlT

ft

ceam

dItl

OHSUauto

.txt

GE_M_T

XT.txt

cindi

Subm

issio

n.txt

Run

MA

P

MAP

R-Prec

B-Pref

P10

P30

P100

28

Best results (MAP) by query type

Query types Automatic Manual

Visual I2Rfus.txt

0.146

i2r-vk-avg.txt

0.092

Textual IPALI2R_Tn

0.208

OHSUmanual.txt

0.212

Mixed IPALI2R_TIan

0.282

OHSUmanvis.txt

0.160

• Automatic-mixed runs best (including those not shown)

29

Best results (MAP) by topic type (for each query type)

• Visual runs clearly hampered by textual (semantic) queries

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

All Visual Mixed Semantic

Topic Category

MA

P

AM

AT

AV

MM

MT

MV

30

0

50

100

150

200

250

300

350

400

450

1 2 3 4 5 6 7 8 9 1011 121314 1516 1718 1920 212223 2425

Topic

Rel

evan

t

0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.3000

0.3500

MA

P Relevant

MAP

Relevant and MAP by topic – great deal of variation

Visual Mixed Textual

31

Interesting “quirk” in results from OHSU runs

• Man-Mixed starts out good but falls rapidly, with lower MAP• MAP measure values recall; may not be best for this task

0

0.1

0.2

0.3

0.4

0.5

0.6

P5

P10

P15

P20

P30

P100

P200

P500

P1000

Precision at Document

Pre

cisi

on

OHSUMan

OHSUManVis

32

Also much variation by topic in OHSU runs

P30

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

1.0000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

P30 OHSUman

P30 OHSUmanvis

33

ImageCLEF medical retrieval task results – 2006

• Primary measure – MAP• Results reported in track overview on CLEF Web

site (Müller, 2006) and in following slides – Runs submitted– Best results overall– Best results by query type– Comparison by topic type– Comparison by query type– Comparison of measures– Interesting finding from OHSU runs

34

Categories of runs

• Query type – human preparation– Automatic – no human modification– Manual – human modification of query– Interactive – human modification of query after

viewing output (not designated in 2005)

• System type – feature(s)– Textual – searching only via textual annotations– Visual – searching only by visual means– Mixed – textual and visual searching– (NOTE: Topic types have these category names too)

35

Runs submitted by category

System Type

Query Type

Visual Mixed Textual Total

Automatic 11 37 31 79

Manual 10 1 6 17

Interactive 1 2 1 4

Total 22 40 38 100

36

Best results overall

• Institute for Infocomm Research (Singapore) and IPAL-CNRS (France) (Lacoste, 2006)

• Used combination of image and text processing– Latter focused on mapping terms to semantic

categories, e.g., modality, anatomy, pathology, etc.

• MAP – 0.3095• Precision at

– 10 images – 0.6167 (6.2 images)– 30 images – 0.5822 (17.4 images)– 100 images – 0.3977 (40 images)

37

Best performing runs by system and query type

• Automated textual or mixed query runs best

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35


System Type

MA

P AutomaticManualInteractive

38

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

IPAL-

IPAL_

Cpt_I

m

IPAL-

IPAL_

Mod

FDT_Cpt

IPAL-

IPAL_

Textua

l_CDF

UB-UBmed

VT2

GE_3EN.tr

ecev

al

UKLFR-U

KLFR_o

rigm

ids_e

n_en

IPAL-

IPAL_

CMP_D

1D2D

4D5D

6

OHSU-OHSU

_m1

IPAL-

IPALc

f_D1D

2D3D

4D5D

6

IPAL-

IPAL_

D1D2D

4D5

UKLFR-U

KLFR_m

ids_g

e_all

OHSU_bas

eline

_tra

ns

GE_vt2

0.tre

ceva

l

cindi-

CINDI_

Visual

_RF

OHSU_bas

eline

_not

rans

MSRA_W

SM-m

sra_

wsm

GE_8DE.tr

ecev

al

RWTHi6-

Simple

Uni

GE-GE_g

ift

SINAI-S

inaiG

iftT50

L40

SINAI-S

inaiG

iftT70

L20

GE_3DE.tr

ecev

al

RWTHi6-

Patch

OHSU_fre

nch

rwth

_mi-r

wth_m

i

Run

MA

P

MAPP10P30P100

Results for all runs

• Variation between MAP and precision for different systems

39

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4


Topic Type

MA

P VisualMixedTextual

Best performing runs by topic type for each system type

• Mixed queries most robust across all topic types

• Visual queries least robust to non-visual topics

40

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Topic

Ima

ge

s

0

0.05

0.1

0.15

0.2

0.25

0.3

MA

P Relevant

MAP

Relevant and MAP by topic


• Substantial variation across all topics and topic types

41

Interesting finding from OHSU runs in 2006 similar to 2005

0

0.1

0.2

0.3

0.4

0.5

0.6

OHSUeng

OHSUeng_

trans

OHSU-OHSUall

OHSUall

OHSU-OHSU_m

1

OHSU_bas

eline

_tra

ns

OHSUger

OHSUfre

OHSU_eng

lish

OHSU_bas

eline

_not

rans

OHSU_ger

man

OHSU_fre

nch

Run

MA

P

MAPP10P30P100

• Mixed run had higher precision despite lower MAP

• Could precision at top of output be more important for user?

42

Conclusions

• A variety of approaches are effective in image retrieval, similar to IR with other content

• Systems that use only visual retrieval are less robust than those that solely do textual retrieval– A possibly fruitful area of research might be ability to

predict which queries are amenable to what retrieval approaches

• Need broader understanding of system use followed by better test collections and experiments based on that understanding– MAP might not be the best performance measure for

the image retrieval task

43

Limitations

• This test collection– Topics artificial – may not be realistic or

representative– Annotation of images may not be

representative or of best practice

• Test collections generally– Relevance is situational– No users involved in experiments

44

Future directions

• ImageCLEF 2007– Continue work on annual cycle– Funded for another year from NSF grant– Expanding image collection, adding new topics

• User experiments with OHSU image retrieval system– Aim to better understand real-world tasks and best evaluation

measures for those tasks

• Continued analysis of 2005-2006 data– Improved text retrieval of annotations– Improved merging of image and text retrieval– Look at methods of predicting which queries amenable to

different approaches

multimedia i: image retrieval in biomedicine

Documents