home photo content modeling for personalized event-based retrieval

10
B ecause digital cameras are so easy to use, consumers tend to take and accumulate more and more digital photos. Hence they need effective and efficient tools to organize and access photos in a semantically meaningful way without too much manual annotation effort. We define semantically meaningful as the ability to index and search photos based on the purposes and contexts of taking the photos. From a user study 1 and a user survey that we conducted, we confirmed that users prefer to orga- nize and access photos along semantic axes such as the event (for example, a birthday party, swimming pool trip, or park excursion), people (for example, myself, my son, Mary), time (for example, last month, this year, 1995), and place (for example, home, Disneyland, New York). However, users are reluctant to annotate all their photos manually as the process is too tedious and time consuming. As a matter of fact, content-based image retrieval (CBIR) research in the last decade 2 has focused on general CBIR approaches (for exam- ple, Corel images). As a consequence, key efforts have concentrated on using low-level features such as color, texture, and shapes to describe and compare image contents. CBIR has yet to bridge the semantic gap between feature-based indexes computed automatically and human query and retrieval preferences. We address this semantic gap by focusing on the notion of event in home photos. In the case of people identification in home photos, we can tap the research results from the face recognition liter- ature. We recognize that general face recognition in still images is a difficult problem when dealing with small faces (20 × 20 pixels or less), varying poses, lighting conditions, and so on. However, in most circumstances, consumers are only interest- ed in a limited number of faces (such as family members, relatives, and friends) in their home photos, so we might achieve a more satisfactory face recognition performance for home photos. With advances in digital cameras, we can easi- ly recover the time stamps of photo creation. Industrial players are looking into the standard- ization of the file format that contains this infor- mation—for example, the Exchangeable Image File Format version 2.2 (http://www.jeita.or.jp/ english/index.htm). Similarly, with the advances in Global Positioning System technology, the cam- era can provide the location where a photo was taken (for example, the Kodak Digital Science 420 GPS camera). Home photo event taxonomy We define home photos as typical digital pho- tos taken by average consumers to record their lives as digital memory, as opposed to those taken by professional photographers for com- mercial purposes (for example, stock photos like the Corel collection and others—see http://www. fotosearch.com). At one typical Web site where consumers upload and share their home photos, users apparently prefer occasions or activities as a broad descriptor for photos to other character- istics (like objects present in the photo or the location at which a photo was taken). In partic- ular, the site’s classification directory contains many more photos under the category “Family and Friends” (more than 9 million) than the sum of the other categories (adding up to around 5 million). Furthermore, categories such as “Scenery and Nature,” “Sports,” “Travel,” and so 28 1070-986X/03/$17.00 © 2003 IEEE Published by the IEEE Computer Society Home Photo Content Modeling for Personalized Event-Based Retrieval Joo-Hwee Lim and Qi Tian Institute for Infocomm Research, Singapore Philippe Mulhem Image Processing and Applications Laboratory, National Center of Scientific Research, France Multimedia Content Modeling and Personalization Rapid advances in sensor, storage, processor, and communication technologies let consumers store large digital photo collections. Consumers need effective tools to organize and access photos in a semantically meaningful way. We address the semantic gap between feature-based indexes computed automatically and human query and retrieval preferences.

Upload: p

Post on 02-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Home photo content modeling for personalized event-based retrieval

Because digital cameras are so easy touse, consumers tend to take andaccumulate more and more digitalphotos. Hence they need effective

and efficient tools to organize and access photosin a semantically meaningful way without toomuch manual annotation effort. We definesemantically meaningful as the ability to indexand search photos based on the purposes andcontexts of taking the photos.

From a user study1 and a user survey that weconducted, we confirmed that users prefer to orga-nize and access photos along semantic axes such asthe event (for example, a birthday party, swimmingpool trip, or park excursion), people (for example,myself, my son, Mary), time (for example, lastmonth, this year, 1995), and place (for example,home, Disneyland, New York). However, users arereluctant to annotate all their photos manually asthe process is too tedious and time consuming.

As a matter of fact, content-based imageretrieval (CBIR) research in the last decade2 hasfocused on general CBIR approaches (for exam-ple, Corel images). As a consequence, key effortshave concentrated on using low-level featuressuch as color, texture, and shapes to describe andcompare image contents. CBIR has yet to bridgethe semantic gap between feature-based indexescomputed automatically and human query andretrieval preferences.

We address this semantic gap by focusing onthe notion of event in home photos. In the case ofpeople identification in home photos, we can tapthe research results from the face recognition liter-ature. We recognize that general face recognitionin still images is a difficult problem when dealingwith small faces (20 × 20 pixels or less), varyingposes, lighting conditions, and so on. However, inmost circumstances, consumers are only interest-ed in a limited number of faces (such as familymembers, relatives, and friends) in their homephotos, so we might achieve a more satisfactoryface recognition performance for home photos.

With advances in digital cameras, we can easi-ly recover the time stamps of photo creation.Industrial players are looking into the standard-ization of the file format that contains this infor-mation—for example, the Exchangeable Image FileFormat version 2.2 (http://www.jeita.or.jp/english/index.htm). Similarly, with the advancesin Global Positioning System technology, the cam-era can provide the location where a photo wastaken (for example, the Kodak Digital Science 420GPS camera).

Home photo event taxonomyWe define home photos as typical digital pho-

tos taken by average consumers to record theirlives as digital memory, as opposed to thosetaken by professional photographers for com-mercial purposes (for example, stock photos likethe Corel collection and others—see http://www.fotosearch.com). At one typical Web site whereconsumers upload and share their home photos,users apparently prefer occasions or activities asa broad descriptor for photos to other character-istics (like objects present in the photo or thelocation at which a photo was taken). In partic-ular, the site’s classification directory containsmany more photos under the category “Familyand Friends” (more than 9 million) than the sumof the other categories (adding up to around 5million). Furthermore, categories such as“Scenery and Nature,” “Sports,” “Travel,” and so

28 1070-986X/03/$17.00 © 2003 IEEE Published by the IEEE Computer Society

Home PhotoContentModeling forPersonalizedEvent-BasedRetrieval

Joo-Hwee Lim and Qi TianInstitute for Infocomm Research, Singapore

Philippe MulhemImage Processing and Applications Laboratory, National

Center of Scientific Research, France

Multimedia Content Modeling andPersonalization

Rapid advances insensor, storage,processor, andcommunicationtechnologies letconsumers storelarge digital photocollections.Consumers needeffective tools toorganize and accessphotos in asemanticallymeaningful way. Weaddress the semanticgap betweenfeature-basedindexes computedautomatically andhuman query andretrieval preferences.

Page 2: Home photo content modeling for personalized event-based retrieval

on are the outcome of activities. AlthoughVailaya et al.3 have presented a hierarchy of eightcategories for vacation photos, they’re skewedtoward scenery classification. Hence an event-based taxonomy is what consumers need. The“Related Works” sidebar discusses otherapproaches.

We suggest a typical event taxonomy forhome photos as Figure 1 shows. Because homephoto collections are highly personal contents,we view our proposed event taxonomy as a clas-sification basis open to individual customization.Our proposed computational learning frameworkfor event models facilitates the personalizationof the event taxonomy nodes. In broad terms, atypical event could be a gathering, family activi-ty, or visit to some place during a holiday. Thesecorrespond to the purposes of meeting withsomeone, performing some activity, and going tosome place respectively.

A gathering event could be in the form of par-ties for occasions such as birthdays or weddings,or simply having meals together. A family activ-ities event would involve family members. Wekeep our classification simple and general bydividing this event type into indoor and outdoorfamily activities. Examples of indoor activitiesinclude kids playing, dancing, chatting, and soon; outdoor activities include sports, kids at aplayground, picnics, and so on.

The third major type of event is visiting places.

These events could be people centric or not.People centric refers to the case when a photo hasfamily members as its focus. In non-people-centric photos, family members are either absentor not clearly visible. We divide this latter caseinto photos of natural (nature) and urban (man-made) scenes. A nature event photo is one taken

29

Text annotation by human users is tedious, inconsistent, anderroneous due to the huge number of possible keywords. Anefficient annotation system needs to limit the possible keywordchoices from which a user can select. The MiAlbum systemdescribed in Liu, Sun, and Zhang1 performs automatic annota-tion in the context of relevance feedback.2 Essentially, the textkeywords in a query are assigned to positive feedback examples(that is, retrieved images that the user who issues the queryconsiders relevant). This would require constant user interven-tion (in the form of relevance feedback) and the keywordsissued in a query might not necessarily correspond to what isconsidered relevant in the positive examples.

The works on image categorization3,4 are also related to ourwork in that they attempt to assign semantic class labels toimages. However, none of these approaches has made a com-plete taxonomy for home photos. In particular, the attempts ofphoto classification are: indoor versus outdoor,3 natural versusman made,3,4 and categories of natural scenes.4 Furthermore,the classifications were based on low-level features such as

color, edge directions, and so on. In our article, the notion ofevents is meant to be more complex than visual classes thoughwe only demonstrate our approach on visual events. Also, wefocus on relevance measures of unlabeled photos to events tosupport event-based retrieval rather than class memberships forclassification purposes.

1. W. Liu, Y. Sun, and H. Zhang, “MiAlbum—A System for Home

Photo Management Using the Semi-Automatic Image

Annotation Approach,” Proc. ACM Multimedia, ACM Press, 2000,

pp. 479-480.

2. Y. Lu et al., “A Unified Framework for Semantics and Feature-

Based Relevance Feedback in Image Retrieval Systems,” Proc.

ACM Multimedia, ACM Press, 2000, pp. 31-37.

3. B. Bradshaw, “Semantic-Based Image Retrieval: A Probabilistic

Approach, Proc. ACM Multimedia, ACM Press, 2000, pp. 167-

176.

4. A. Vailaya et al., “Bayesian Framework for Hierarchical Semantic

Classification of Vacation Images,” IEEE Trans. Image Processing,

vol. 10, no. 1, 2001, pp. 117-130.

Events

Gathering Familyactivities

Visit places

Parties Meals

Indoor

Peoplecentric Non-

people-centric

Nature

Man made

ParksBeach

Waterside

Mountains

Swimming pools Street Interior

Outdoor

Birthday Wedding

Related Works

Figure 1. Event

taxonomy for home

photos.

Page 3: Home photo content modeling for personalized event-based retrieval

at, for example, a mountainous area, riverside orlakeside (waterside), beach, and park (also garden,field, forest, and so on). For the man-made event,we include photos taken at a swimming pool,roadside (or street), or inside a structure. Thenotion of event usually encompasses four aspects:

❚ who takes part in the event (for example, Johnand his wife),

❚ what occasion or activity is involved (forexample, my birthday),

❚ where the event takes place (for example, myhouse), and

❚ when the event takes place (for example, lastmonth).

Using visual content alone (that is, without usinginformation from people, time, and place)wouldn’t let us model all these four aspects of anevent. For example, it’s not feasible to further dif-ferentiate breakfast, lunch, and dinner for themeals event if we don’t use the time information.In this article, we approximate event by visualevent, defined as an event based on the visualcontent of photos (that is, the “what” aspect).

Modeling visual eventsFor each Event Ei, we assume that there’s an

associated computational model Mi that lets uscompute the relevance measure R(Mi, x) (∈ [0,1])of a photo x to Ei. To preserve a high level ofsemantics to model events in our proposed eventtaxonomy, we require an expressive knowledgerepresentation of image content. To minimizethe effort of manual annotation and to allow per-sonalization of event models Mi for a user’s photocollection (for example, a modern city street ver-sus a rural roadside, an Asian wedding versus aWestern wedding), we propose a computationallearning approach to construct event models Mi

from a small set of photos L labeled by the event

Ei and compute the relevance measures of otherphotos unlabeled U to the event models. Figure2 shows a schematic diagram of this framework.

In this article, we adopt a vocabulary-basedindexing methodology called visual keywords4 toautomatically extract relevant semantic tokens,a conceptual graph formalism5 to model the visu-al events, and instance-based and graph-generalization representations to learn the eventmodels. So, an event model Mi has two facets: theevent model according to the visual keywordsrepresentation, namely Mvi, and the event modelaccording to conceptual graphs, namely Mgi.

Visual keywords indexingThe visual keywords approach4 is a new

attempt to achieve content-based image index-ing and retrieval beyond feature-based (for exam-ple, QBIC6) and region-based (for example,Blobworld7) approaches. Visual keywords areintuitive and flexible visual prototypes extractedor learned from a visual content domain with rel-evant semantics labels.

In this article, we have designed a visualvocabulary for home photos. There are eightclasses of visual keywords, each subdivided intotwo to five subclasses. Hence there are 26 distinctlabels in total. We used a three-layer feed-forwardneural network with dynamic node creationcapabilities to learn these 26 visual keywords sub-classes from 375 labeled image patches croppedfrom home photos. Color and texture features4

are computed for each training region as an inputvector for the neural network. Figure 3 shows thetypical appearance of the visual keywords.

Once the neural network has learned a visualvocabulary, our approach subjects an image to beindexed to multiscale, view-based recognitionagainst the 26 visual keywords. Our approachreconciles the recognition results across multipleresolutions and aggregates them according toconfigurable spatial tessellation. For example, theswimming pool image in Figure 4a is indexed asfive visual keyword histograms based on left,right, top, bottom, and center areas as shown inFigure 4b, which shows only the center schemat-ic histogram. In essence, an image area is repre-sented as a histogram of visual keywords basedon local recognition certainties. For instance, thevisual keyword histogram for the center area inthe swimming pool image in Figure 4a has a 0.36value for subclass water:pool, and small valuesfor the rest of the bins. This spatial configurationis appropriate for home photos because the cen-

30

IEEE

Mul

tiM

edia

Event models Mi Event-based

retrieval

Labeled photos L

Learning Unlabeled photos U

Matching

Figure 2. Learning of

event models for

personalized event-

based retrieval.

Page 4: Home photo content modeling for personalized event-based retrieval

ter area is usually the image’s focus and hence wecan assign higher weight to it during similaritymatching. Other spatial configuration includesuniform grids (for example, 4 × 4) of equalweights, and so on.

We can compute similarity matching betweentwo images as the weighted average of the simi-larities (for example, histogram intersection)between the images’ corresponding local visualkeyword histograms. When comparing the sim-ilarity between a single image with index xvk anda group of images with indexes V = {vj}, we com-pute the similarity matching score as

Sv(V, xvk) = maxj(S(vj, xvk)) (1)

Although the Blobworld approach also per-forms similarity matching based on local regions,the visual keywords approach does not performimage segmentation, and it associates semanticlabels to the local image regions. Mohan and col-leagues have proposed an interesting approachto recognize objects by their components andapplied it to people detection based on adaptivecombination of classifiers.8 However, manyobjects in home photos (such as those in Figure3) do not have a well-defined part–whole struc-ture. Instead, we adopt a graph-theoretic repre-sentation to allow hierarchies of concepts andrelations. During the indexing process, we alsogenerate fuzzy labels to facilitate object segmen-tation with a clustering algorithm in the visualkeyword space. At the graph-based representa-tion level, the most probable label is kept, andinference allows keeping specificities of relation-ships like symmetry and transitivity.

Visual event graphWe chose the knowledge representation for-

malism called conceptual graphs as a framework tohandle concepts, concept hierarchies, and rela-tion hierarchies.5 Conceptual graphs are bipartitefinite oriented graphs composed of concept andrelation nodes. Concept nodes are composed ofa concept type and a referent. A generic referentdenotes the existence of a referent, while an indi-vidual refers to one instance of the concept type.In our case, the concept type set includes theobjects of the real world present in the photos,and they’re extracted as visual keywords. Weorganized concept types in a lattice that reflectsgeneralization/specialization relationships. Figure5 shows a part of the concept type hierarchyused. The relationship set includes absolute spa-

tial (position of the center of gravity), relativespatial, and structural relationships.

The weighting scheme only considers media-dependent weight and the input are the certain-ties of the recognition of concepts. A concept isdescribed by

31

Octo

ber–D

ecemb

er 2003

People: Face, figure,crowd, skin

Sky: Clear, cloudy, blue

Ground: Floor, sand, grass

Water: Pool, pond, river

Foliage: Green, floral, branch

Mountain: Far, rocky

Building: Old, city, far

Interior: Wall, wooden, China, fabric, light

Figure 3. Visual keywords defined for home photos.

(a) (b)

Object

Nature Man made

Ground FoliageSky Building Interior

Figure 4. Swimming pool image (a) to be indexed and (b) its index represented

as a tessellation of visual keyword histograms.

Figure 5. Excerpt of a concept type hierarchy.

Page 5: Home photo content modeling for personalized event-based retrieval

❚ the weight of the concept w, representing theimportance of the concept in the photo(defined as the relative size of its region), and

❚ the certainty of recognition ce of the conceptc (we use the certainty values that come froma labeling process).

We then represent a concept as a [type:refer-ent | w | ce]. Figure 6 shows a graph describing thesame swimming pool image in Figure 4a. Theimage is composed of two objects: a foliageregion with an importance 0.34 and a certaintyof recognition 0.59 and a water-pool region withan importance of 0.68 and a certainty of recog-nition of 0.32.

An event model graph has the same compo-nents as the image graphs, except that the con-cepts corresponding to the image objects mightbe complex. That is, they might contain morethan one concept type associated with a valueindicating the concept type’s relevance for themodel, and the relationships might also have arelevance value for the model. Consider a part ofan event graph (see Figure 7) composed from twoimages chosen by a user to describe a swimmingpool: the one in Figure 6 and another composed

of a building region and a water:pool that touch-es the building. This event graph stores the factthat water:pool regions appear in all the selectedimages, and the complex concept related to theother concept denotes that 50 percent of theselected images (that is, a relevance of the con-cept type rel is 0.5) contains foliage and 50 per-cent building. Then, using the concept typehierarchy of Figure 5, 50 percent of the selectedimages contained a natural object and 50 percenta man-made object, and the two images (that is,a relevance rel of 1.0) contain a general object.The relevance values for the relations of the arch-es follow the same principle.

The matching between a graph correspondingto an event model Mgi and an image graph xcg isbased on the fact that the matching value mustincorporate elements coming from the concepts(that is, the elements present in the images) aswell as the relationships between these elementsrepresented, in our case, by arches. So, we com-pute the matching value between the modelgraph Mgi and the document graph xcg, accordingto the weights of the matching arches ad (a com-ponent of a graph of the form [typed1:referentd1 |wd1 | ced1] → (Relationd) → [typed2:referentd2 | wd2 |ced2]) and the weights of the matching conceptscd of xcg. The relevance status value for an eventmodel graph Mgi and a document graph xcg isshown in Figure 8a, where πMgi(xcg) is the set ofpossible projections of the event model graphinto the image graph. In the Figure 8a equation,the Match_concepts function is defined as shownin Figure 8b.

The matching function matchc of an imageconcept cd= [Typed:referentd | wd | ced] and a corre-sponding complex concept cp (as described inFigure 7) of the projection gp of an event modelgraph is based on the certainty recognition of theconcept, weight of the concept, and relevancevalue computed from the images to define anevent model. The value given by matchc is pro-portional to the certainty of recognition, to theweight of the concept, and to the relevance ofthe best concept of the event model shown inFigure 8c.

If we consider the part of the equation inFigure 8a related to arches, the Match_archesfunctions is defined as shown in Figure 8d.

The matching function matcha of arches isbased on the weight and certainty of recognitionof each document arch ad defined as [typed1:ref-erentd1 | wd1 | ced1] → (Relationd) → [typed2:refer-entd2 | wd2 | ced2]) of the document, and on the

32

IEEE

Mul

tiM

edia

Comp

Comp

On top

Center02

Center23

Touch top

Touches

Image: #IMG0623

Water pool: #water pool1 0.68 0.32

Foliage: #foliage1 0.34 0.59

Figure 6. Graph describing the swimming pool image.

Water pool 1.0

On top 0.5

Touches 1.0

Object 1.0

Nature 0.5 Man made 0.5

Foliage 0.5 Building 0.5

Figure 7. Graph describing a model.

Page 6: Home photo content modeling for personalized event-based retrieval

relevance rel of the relations of the correspond-ing arch in the model. We use the minimum ofthe combination of weights and certainty ofrecognition of the arch concepts to reflect anotion of fuzzy conjunction between the con-cepts—because an arch exists only when theconcepts it relates exist (see Figure 8e).

Such values are computed when the eventgraph has at least one projection (see Sowa5) intothe image graph.

Visual event learningIn this article, for the learning of a model Mi

for an event Ei, we have two levels of representa-tions, namely local visual keyword histogramsand a graph-based abstraction. As presented pre-viously, we adopt an instance-based approach toconstruct a visual keywords representation Mvi

for Ei as the set of visual keyword indexes {vj} forL. Then, given an unlabeled photo x, we com-pute the similarity matching score Sv(Mvi, xvk) asin Equation 1. Using the conceptual graph for-malism, we compute the conceptual graph rep-resentation Mgi of an event Ei from thegeneralization of a given set of labeled photos L= {dj} as described in the previous section. Then,given an unlabeled photo x, we can compute the

similarity matching score Sg(Mgi, xcg) as in Figure8a. Finally, we combine these relevance measuresto obtain an overall relevance measure

R(Mi, x) = λ . Sv(Mvi, xvk) + (1 − λ) . Sg(Mgi, xcg) (2)

where we might determine the λ parameter a pri-ori or with empirical tuning.

Experimental evaluationTo evaluate the effectiveness of event-based

retrieval of home photos, we conducted eventmodel learning and event-based query experi-ments using 2,400 heterogeneous home photoscollected over a five-year period with both indoorand outdoor settings. The images are those of thesmallest resolution (that is, 256 × 384) from KodakPhotoCDs, in both portrait and landscape layouts.After removing possibly noisy marginal pixels, theimages become 240 × 360 resolution. Figure 9(next page) displays typical photos in this collec-tion, and Figure 10 shows some of the photos withbad quality (for example, faded, over exposed,blurred, dark, and so on). We didn’t remove thesebad quality photos from our test collection toreflect the true complexity of the original real data.

For our experimental evaluation, we selected

33

Octo

ber–D

ecemb

er 2003

gS iMg cgxg x pg cgx pg cgx

pg xcg matchc c p dcc g

c x

cmatch pc dc dw dcec t

rel pt

pg xcg

p Mgi cg

p p

d cg

p p p dt

( , ) max (Match_concepts( , )) Match_arches( , ))

Match_concepts( , ) ( , )

( , ) . . max ( )

Match_arches( , )

( )= +

=

=

=

π

(a)

(b)

(c)

concept of correspondingto of

and generic of Type

matchmatcha a p daa g

a x

amatch a p ad wd ced wd ced rel Relationp

p p

d cg

p p

p

d

arch

Relation relations aRelation

Relation

( , )

( , ) min( . , . ). max ( )( )

ofcorrespondingto of

and genericof

=∈

(d)

(e)

1 1 2 2

Figure 8. Equations related to graph matching.

Page 7: Home photo content modeling for personalized event-based retrieval

four events from our proposed home photoevent taxonomy (from Figure 1). They are parks,swimming pools, waterside, and wedding. Figure11 shows some sample photos of these events.

Event-based learning and queryFor each of these events, a user constructed the

list of photos considered relevant to the eventfrom the 2,400 photos. The sizes of these groundtruth lists are 306 for parks, 52 for swimmingpools, 114 for waterside, and 241 for wedding.

To learn an event, we decided to use a train-ing set of only 10 labeled photos to simulatepractical situations (that is, a user only needs tolabel 10 photos for each event). To ensure unbi-ased training samples, we generated 10 differenttraining sets from the ground truth list for eachevent based on uniform random distribution.The learning and retrieval of each event werethus performed 10 times and the respectiveresults are averages over these 10 runs. Note thatfor each of these runs, we removed the photosused as training from the ground truths whencomputing the precision and recall values.

To query based on event, a user issues orselects one of the four event labels. Based on thelearned event models, we computed the rele-vance measures using Equation 2.

Comparison and analysisWe compare three methods on event-based

retrieval. The first method, which we denote ashue saturation value (HSV)-global, indexed pho-

tos as global histograms of 11 key colors (red,green, blue, black, gray, white, orange, yellow,brown, pink, purple) in the HSV color space, asadopted by the original PicHunter system.9

The second method (denoted as HSV-grid) isthe same color histogram as in HSV-global butcomputed for each 4 × 4 grid on the photos.

The third method, visual event retrieval (VER),implemented the approach proposed in this arti-cle. In particular, the visual keywords learned arecharacterized using color and texture features.4 Weused five local visual keyword histograms (left,right, top, bottom, and center blocks with centerblock having a higher weight). We computed theconceptual graph representation as compoundleast common generalization from training pho-tos of an event. The relevance measure was basedon Equation 2 with λ = 0.5 that produced the over-all best result among all λ values (between 0 and 1at 0.05 interval) that we experimented with.

Table 1 lists the average precisions (over 10runs) of retrieval for each event and for all eventsusing different methods. Table 2 shows the aver-age precisions (over 10 runs) among the top 20and 30 retrieved photos for each event and for allevents using the methods compared.

From our experimental results in Table 1, weobserve that the overall results of our proposedVER method are encouraging. This could beattributed to the integration of orthogonal repre-sentations used in our VER method (that is,instance and abstraction based). Hence the VERapproach performed much better than the com-

34

IEEE

Mul

tiM

edia

Figure 9. Typical home photos used in our experiment.

Figure 10. Some home photos of inferior quality.

Page 8: Home photo content modeling for personalized event-based retrieval

pared methods. The same can be said for the indi-vidual events. In particular, the overall averageprecision is 61 percent and 37 percent better thanthat of HSV-global and HSV-grid respectively.

Furthermore, for practical needs, the overallaverage precisions of the top 20 and 30 retrievedphotos were improved by 52 percent and 48 per-cent respectively when compared with the HSV-global method, and by 37 percent and 35 percentrespectively when compared with the HSV-gridmethod (see Table 2). In concrete terms, the VERapproach, on average, retrieved 13 and 19 rele-vant photos among the top 20 and 30 photos forany event query.

From Tables 1 and 2, we notice that the water-side event is less accurately recognized by each ofthe approaches considered. This comes from thefact that the waterside images are more variedthan other kinds of images considered, negative-ly impacting the learning quality.

Figure 12 (next page) displays the precision/recall curves for all the methods compared, aver-aged on all the query runs. The curves of Figure12 show clearly the higher quality of the VERmethod, which has higher precision values foreach of the 11 recall points, with the bigger gapsoccurring for recall points smaller than 0.4.

Figure 13 (next page) shows the breakdown

of precision/recall curves by each event for thebest method (VER). We see that the curve forthe waterside event is much lower than theother three events, which is consistent with ourprevious remarks on the small precision valuesat 20 and 30 documents of Table 2 and theresults in Table 1.

35

Figure 11. Sample photos of each of the four events: (a) parks, (b) swimming pools, (c) waterside, and (d) wedding.

(a)

(b)

(c)

(d)

Table 1. Average precision ratios for event-based retrieval.

Event HSV-global HSV-grid VERParks 0.30 0.42 0.56

Swimming pools 0.12 0.22 0.29

Waterside 0.10 0.13 0.20

Wedding 0.38 0.32 0.41

Overall 0.23 0.27 0.37

Table 2. Average precision ratios at top 20 and 30 documents.

HSV-global HSV-grid VER Event (20/30 documents) (20/30 documents) (20/30 documents)

Parks 0.62/0.59 0.78/0.79 0.94/0.92

Swimming pools 0.26/0.21 0.32/0.27 0.52/0.43

Waterside 0.24/0.22 0.24/0.23 0.31/0.29

Wedding 0.62/0.65 0.60/0.53 0.90/0.85

Overall 0.44/0.42 0.49/0.46 0.67/0.62

Page 9: Home photo content modeling for personalized event-based retrieval

Future workWe presented our event taxonomy for home

photo content modeling and a computationallearning framework to personalize event modelsfrom labeled sample photos and to compute rel-evance measures of unlabeled photos to the eventmodels. These event models are built automati-cally from learning and indexing using a prede-fined visual vocabulary with conceptual graphrepresentation. The strength of our event modelscomes from the intermediate semantic represen-tations of visual keywords and conceptual graphsthat are abstractions of low-level feature-basedrepresentations to bridge the semantic gap.

In actual deployment, a user annotates a smallset of photos with event labels selected from alimited event vocabulary during photo import(from digital cameras to hard disk) or photoupload (to online sharing Web sites). Ourapproach would then build the event modelsfrom the labeled photos, upon semantics extract-

ed from the photos annotated with events.In the near future, we’ll experiment with more

events and with other indexing cues such as peo-ple identification and time stamps. We’ll alsoexplore other computational models and learn-ing algorithms for personalized event modelingto increase the quality of learning on events thatcontain a lot of variability. MM

References1. K. Rodden and K. Wood, “How People Manage

Their Digital Photos?,” to be published in Proc. ACM

Computer–Human Interaction (CHI), ACM Press,

2003, pp. 409-416.

2. A. Smeulders et al., “Content-Based Retrieval Image

Retrieval at the End of the Early Years,” IEEE Trans.

Pattern Analysis and Machine Intelligence, vol. 22,

no. 12, 2000, pp. 1349-1380.

3. A. Vailaya et al., “Bayesian Framework for

Hierarchical Semantic Classification of Vacation

Images,” IEEE Trans. Image Processing, vol. 10, no.

1, 2001, pp. 117-130.

4. J.H. Lim, “Building Visual Vocabulary for Image

Indexation and Query Formulation,” Pattern

Analysis and Applications (Special Issue on Image

Indexation), vol. 4, nos. 2/3, 2001, pp. 125-139.

5. J. Sowa, Conceptual Structures: Information Processing

in Mind and Machine, Addison-Wesley Pub., 1984.

6. M. Flickner et al., “Query by Image and Video

Content: The QBIC System,” Computer, vol. 28, no.

9, 1995, pp. 23-32.

7. C. Carson et al., “Blobworld: Image Segmentation

Using Expectation-Maximization and its Application

to Image Querying,” IEEE Trans. Pattern Analysis

and Machine Intelligence, vol. 24, no. 8, 2002, pp.

1026-1038.

8. A. Mohan, C. Papageorgiou, and T. Poggio, “Exam-

ple-Based Object Detection in Images by Com-

ponents,” IEEE Trans. Pattern Analysis and Machine

Intelligence, vol. 23, no. 4, 2001, pp. 349-361.

9. I. Cox et al., “The Bayesian Image Retrieval System,

PicHunter: Theory, Implementation and

Psychophysical Experiments,” IEEE Trans. Image

Processing, vol. 9, no. 1 200, pp. 20-37.

Joo-Hwee Lim is a research sci-

entist at the Institute for Info-

comm Research of Singapore. His

research interests include con-

tent-based indexing and retrieval,

pattern recognition, and neural

networks. Lim received BSc and MSc degrees in com-

puter science from the National University of Singapore.

36

IEEE

Mul

tiM

edia

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Recall

Prec

isio

n

ParkPoolWatersideWedding

Figure 13.

Precision/recall curves

for each event by the

proposed VER

approach.

0

0.1

0.2

0.3

0.4

0.50.6

0.7

0.8

0.9

1.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Recall

Prec

isio

n

HSV-globalHSV-gridVisual event retrieval

Figure 12. Overall precision/recall curves for each method.

Page 10: Home photo content modeling for personalized event-based retrieval

Qi Tian is a principal scientist at

the Media Division, Institute for

Infocomm Research in Singapore.

His research interests include

image/video analysis, indexing,

browsing and retrieval, and pat-

tern recognition. Tian received a BS and MS from the

Tsinghua University, China, and a PhD from the

University of South Carolina, all in Electronic

Engineering. He is a senior IEEE member and serves on

the editorial boards of two professional journals and

committees of international conferences on multimedia.

Philippe Mulhem is director of

the Image Processing and

Applications Laboratory in

Singapore, a joint laboratory

between the French National

Center of Scientific Research, the

National University of Singapore, and the Institute for

Infocomm Research of Singapore. He is also a

researcher in the Modeling and Multimedia

Information Retrieval group of the CLIPS-IMAG labo-

ratory, Grenoble, France. His research interests include

formalization and experimentation of image, video,

and multimedia document indexing and retrieval.

Mulhem received a PhD and Habilitation to Manage

Research from the Joseph Fourier University, Grenoble.

Readers may contact Joo-Hwee Lim at the Inst. for

Infocomm Research, 21 Heng Mui Keng Terrace,

Singapore 119613; [email protected].

For further information on this or any other computing

topic, please visit our Digital Library at http://computer.

org/publications/dlib.

37

Octo

ber–D

ecemb

er 2003

computer.org/join/Complete the online application and

• Take Web-based training courses in technical areas for free

• Receive substantial discounts for our software development professional certification program

• Get immediate online access to Computer

• Subscribe to our new publication, IEEE Security & Privacy, or any of our 22 periodicals at discounted rates

• Attend leading conferences at member prices

• Sign up for a free e-mail alias—[email protected]

Join the IEEE Computer Society online at

T H E W O R L D ' S C O M P U T E R S O C I E T Y