socialized mobile photography: learning to photograph with social context via mobile devices

184 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

Socialized Mobile Photography: Learning toPhotograph With Social Context via Mobile DevicesWenyuan Yin, Student Member, IEEE, Tao Mei, Senior Member, IEEE, Chang Wen Chen, Fellow, IEEE, and

Shipeng Li, Fellow, IEEE

Abstract—The popularity of mobile devices equipped with var-ious cameras has revolutionized modern photography. People areable to take photos and share their experiences anytime and any-where. However, taking a high quality photograph via mobile de-vice remains a challenge for mobile users. In this paper we investi-gate a photography model to assist mobile users in capturing highquality photos by using both the rich context available frommobiledevices and crowdsourced social media on the Web. The photog-raphymodel is learned from community-contributed images on theWeb, and dependent on user’s social context. The context includesuser’s current geo-location, time (i.e., time of the day), and weather(e.g., clear, cloudy, foggy, etc.). Given a wide view of scene, our so-cialized mobile photography system is able to suggest the optimalview enclosure (composition) and appropriate camera parameters(aperture, ISO, and exposure time). Extensive experiments havebeen performed for eight well-known hot spot landmark locationswhere sufficient context tagged photos can be obtained. Throughboth objective and subjective evaluations, we show that the pro-posed socialized mobile photography system can indeed effectivelysuggest proper composition and camera parameters to help theuser capture high quality photos.

Index Terms—Camera parameters, mobile photography, socialcontext, social media, view enclosure.

I. INTRODUCTION

T HE recent popularity of mobile devices and the rapiddevelopment of wireless network technologies have

revolutionized the way people take and share multimediacontent. With the pervasiveness of mobile devices, more andmore people are taking photos to share their experiences usingtheir mobile devices anytime and anywhere. Market researchindicates that more than 27% of photos were captured bysmartphones in 2011, while the number was merely 17% in theprevious year [1]. The booming development of built-in mobilecameras (such as the advanced eight megapixel resolution andthe large aperture) has triggered a trend that may lead tomobile cameras replacing traditional handheld cameras.

Manuscript received October 09, 2012; revised February 08, 2013 and May18, 2013; accepted June 27, 2013. Date of publication September 25, 2013;date of current version December 12, 2013. This work was supported by NSFGrant 0964797 and a Gift Funding fromKodak. Part of this work was performedwhen the first author visited Microsoft Research Asia as a research intern. Theassociate editor coordinating the review of this manuscript and approving it forpublication was Dr. Vasileios Mezaris.W. Yin and C. W. Chen are with the State University of New York at Buffalo,

Buffalo, NY 14260 USA (e-mail: [email protected]).T. Mei and S. Li are with Microsoft Research Asia, Beijing 100080, China

(e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TMM.2013.2283468

However, mobile cameras cannot guarantee perfect photos.Although mobile cameras harness a variety of technologiesto take care of many camera settings (e.g., auto-exposure) forpoint-and-shoot ease, capturing high quality photos is still achallenging task for amateur mobile users, not to mention thoselacking photography knowledge and experience. Therefore,assisting amateur users to capture high quality photos viatheir mobile devices becomes a demanding task. While mostexisting research has predominantly focused on how to retrieveand manage photos on mobile devices [2], [3], or how to adaptmedia considering unique characteristics of mobile devices[39], there have been few attempts to address this topic before.To obtain high quality photos, various types of commercial

software such as Photoshop have been developed for post-pro-cessing to adjust photo quality. However, most of them are de-signed for desktop PC, which require intensive computation. Al-though there exist some mobile applications for image post-pro-cessing, they are only able to conduct simple operations, such ascropping and contrast adjustment. These post-processing toolscannot always fix poorly captured images. For example, the in-formation lost in an over-exposed image is usually unrecover-able by any post-processing technique. Therefore, it is desirableto assist mobile users in obtaining high quality photos while thephotos are being taken. For example, if we can suggest the op-timal scene composition and suitable camera settings (i.e., aper-ture, ISO, and exposure time) based on the user’s current con-text (i.e., geo-location, time, and weather condition) and inputscene, the user’s ability to capture high quality photos via mo-bile devices will be improved significantly.On the other hand, computational aesthetics of photography

have emerged as a hot research area. It aims to automatically as-sess or enhance image quality with computational models basedon various visual features, such as lighting (e.g., light, color,and texture) and composition (e.g., the rule of thirds). A com-prehensive survey on computational aesthetics can be found in[5]. However, aesthetic assessment or enhancement methods inexisting works are also designed for desktop PCs, and focus onevaluating or enhancing image quality as post-processing tools.Moreover, they only apply some simple and general photog-raphy rules to the aesthetics evaluations. For example, whenassessing the photo composition, they tend to place objects ac-cording to the rule of thirds or put the horizontal line lower in theframe, according to the golden ratio [6]. In many cases, insteadof applying these simple rules slavishly, professional photogra-phers usually adapt the composition and the camera exposureparameters for photographing, e.g., the scene and the lightingconditions. Therefore, the existing general photo aesthetics as-sessment and enhancement approaches are far from enough toguide mobile user to capture high quality photos on the fly.

1520-9210 © 2013 IEEE

YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 185

Based on the above analysis, it is desirable to develop anintelligent built-in tool to assist mobile users in capturing highquality photos. Since composition and exposure parameters(i.e., aperture, ISO, and exposure time) are two key factorsaffecting photo quality [7], the system should be able to pro-vide suggestions to mobile users in terms of these two aspects.Therefore, we propose an intelligent mobile photographysystem, which can not only assist mobile user to find good viewenclosures, but also help them set correct exposure parameters.

A. Challenges and Opportunities

Despite recent development on computational aesthetics, in-telligent mobile photography still remains a challenge becauseno unified knowledge can always be applied to various scenesand contexts. Various composition rules and exposure prin-ciples apply adaptively and flexibly when capturing differentobjects or scenes with diverse perspectives and arrangementsunder various lighting conditions. Professional photographerstake years of training to obtain sufficient photography knowl-edge. They usually skillfully adopt the domain knowledge tocapture scene under different conditions. Photographers areartists whose knowledge is difficult to model and representwith simple rules. The descriptions on the complexity of photocomposition and camera settings are given in Section III.Therefore, assisting mobile users to take professional photosunder various shooting situations is a challenging task.Fortunately, the availability of rich context sensors on mo-

bile devices and the explosion of images associated with theircapture contexts, camera parameters and social information onsocial media community bring us good opportunities to solvethis challenging problem. On mobile devices, the GPS and timeof the camera can be obtained and some other photography re-lated context information like weather condition at the shootingtime can be further inferred. On the other hand, a large volumeof photos on social media websites are associated with meta-data such as GPS and timestamp, as well as EXIF includingthe camera parameters. Moreover, in the social media commu-nity, the photo quality can be implicitly inferred from numberof views, number of favorites, and comments, or even explicitlyobtained from ratings. Despite some noise in the media meta-data, the aggregated photographs captured nearby containingthe same scene with their metadata from the media commu-nity can provide significant insight into relevant photographyrules in terms of composition and exposure parameters for mo-bile photography assistance.For example, as shown in Fig. 1, when capturing a photo

for “Statue of Liberty” from the perspective as shown in theinput wide-view image, using the content and context of theinput image, photos with the same content from similar per-spectives and their associated social information which can re-flect their qualities can be crowdsourced. By analyzing the com-position and their aesthetic quality of these crowdsourced im-ages, we find that pictures with the Statue on the right third lineare usually highly preferred; hence the optimal composition ofthe input image can be inferred. Moreover, with the input timeand weather condition, we can also estimate the optimal cameraexposure parameters from the crowdsourced photos with sim-ilar content and lighting conditions. Here, the first two crowd-sourced similar photos with high ratings were captured under

Fig. 1. Illustration of how crowdsourced images help for view suggestion andexposure parameter suggestion.

similar season and time of the day, and their weather conditionis also overcast as the current weather condition, so by settingthe camera parameters to exposure time (ET): 1/200, aperture:8.0 and ISO: 100 as the two images, the user should be able toobtain high quality photos with proper exposure.Motivated by the above observations of mobile devices and

social media, we propose a socialized mobile photographysystem leveraging both the context and crowdsourced photog-raphy knowledge to aid mobile users in acquiring high-qualityphotos. We predominantly focus on outdoor landscape pho-tography on mobile devices. To achieve intelligent mobilephotography, the following problems need to be solved. First,the system should be able to suggest view enclosure withoptimal composition by mining the scene specific compositionrules from the crowdsourced community-contributed photos.Second, the optimal exposure parameters have to be recom-mended given the suggested composition and the lightingcondition.The proposed socialized mobile photography system is

shown in Fig. 2. Given the input wide-view image and theshooting context, i.e., geo-location, time, and weather, thesystem is able to suggest view enclosures with optimal com-position as marked in red rectangle by mining the relevantcomposition rules from crowdsourced photos captured nearby.Then, the system suggests optimal exposure parameters,i.e., ISO, aperture, and exposure time (ET) as shown on theupper-right corner of the screen, by mining exposure rules fromthe crowdsourced images with similar content and context.Therefore, the optimal view enclosure and exposure parameterscan be suggested by the proposed system to capture high quality


Fig. 2. Demonstration of the proposed socialized mobile photography system.

photos in the given scene and context, as long as sufficientimages containing similar content are crowdsourced nearby.To intelligently aid mobile users in capturing high quality

photos considering the complex scene and context dependentcomposition and exposure principles, two fundamental com-ponents are developed in the socialized mobile photographysystem: 1) offline composition and exposure parameter learning,and 2) online photography suggestion.Considering various views in a given scene due to the differ-

ence of capture location, perspective, and content, viewpointclusters within a scope of location are discovered based onimage local features as well as their geo-locations by unsu-pervised methods. Relevant composition knowledge is minedby view cluster ranking and view cluster specific compositionlearning. To learn the effects of various contexts and differentshooting content to exposure, metric learning is carried outfor exposure parameter suggestion. The time consuming viewclustering, composition and exposure learning processes areall performed offline, which make the system applicable andpractical for mobile applications. In the online process, aportion of view enclosure candidates are discarded by clusterrankings obtained offline, which make the online view decisionquite efficient. Then top ranked view enclosure is selected fromthe remaining candidates based on the learned view specificcomposition model. Finally, the optimal exposure parametersare suggested according to the content of the optimal view andthe shooting context based on the learned exposure model.

B. Contributions

We make the following three major contributions:• We propose a general framework for mobile photographyusing the rich social context from mobile devices. To thebest of our knowledge, little research has been conductedon this topic before.

• We solve the problem ofmobile photography by leveragingboth rich context and crowdsourced images on the Web.To overcome the photography challenge due to the com-plex scene and context-dependent characteristics, a viewcluster discovering followed by a view specific compo-sition learning and exposure parameter learning schemeis developed to suggest the optimal view and parametersbased on the discovered photography rules.

• We develop a mobile photography system and evaluatethrough objective and subjective evaluations.

The remainder of the paper is organized as follows. InSection II related works are discussed. In Section III, we

present the challenges of mobile photography to justify theneed of the proposed mobile photography system. The proposedsocialized mobile photography system overview and the detailson offline photography learning and online mobile photographysuggestion are introduced in Section IV. Experiment designsand evaluations are demonstrated in Section V. Finally, weconclude this paper in Section VI.

II. RELATED WORK

Recently, considerable research efforts have been made onphoto aesthetics computation. Based on some heuristic lowlevel features which are expected to discriminate betweenpleasing and displeasing images, linear regression is appliedto predict numerical aesthetics ratings in [4]. In [24], bydetermining perceptual factors that distinguish professionalphotos and snapshots, visual features such as spatial distribu-tion of edges, color distribution, and hue count were adoptedfor photo quality assessment. Considering that professionalphotographers often skillfully differentiate the main subjectof the photo from the background, features such as lightingdifference and subject composition were adopted in [25] forphoto quality classification. Visual attention model based onsaliency map is deployed for photo assessment in [26]. In [27],high level describable image attributes including compositionalattributes, content attributes and sky illumination attributes aredesigned for photo quality prediction. In particular, query spe-cific ‘interestingness’ is estimated with the consideration of thevariety of the usefulness of the proposed attributes for differentsubjects. Instead of using various hand-crafting features orattributes intuitively or theoretically related to photo aesthetics,visual features widely used in computer vision domain suchas Bag-of-Visual-Words and Fisher Vector are introducedfor photo quality classification in [28]. The classificationperformance improvements achieved by [27], [28] are usefulevidences of the fact that the photography is a complicated andscene-dependent task.Using the same set of features as in [4], an aesthetic quality

inference engine is developed [29] to allow users to upload pho-tographs and automatically rate their aesthetic quality based onthe distance to the SVM hyperplane. Later on, they extend theirwork into OSCAR [30], which aims to retrieve highly rated im-ages with similar semantics and similar composition based onlow level features by classifying image composition into textual,diagonal, horizontal, centered, and vertical categories to allowusers to imitate. Without considering the rich context character-istics of mobile device, in many cases, the retrieved results withdifferent semantic objects have limited power to assist photog-raphy. In addition, for novice mobile photographers, it is stilldifficult to take appealing photos without explicit compositionand exposure suggestions under various shooting situations.To enhance the photo quality, several schemes retarget

images based on several composition guidelines [6], [8]. In [6]they relocate the object to a more aesthetically pleasing locationbased on rule of thirds for single object image enhancement,while cropping or expanding the photo to achieve pleasing bal-ance for landscape images without a dominant object. In [8], anoptimal version of the input image is produced by crop-and-re-target based on known composition rules such as rule ofthirds, diagonal dominance and visual balance. However, these


post-processing techniques for photo enhancement requireintensive computation or manual manipulations. To accuratelysegment objects, either complicated detection and segmentationtechniques or annoying user-guided manipulations are required.In addition, complicated inpainting techniques are needed forimage re-composition. Moreover, a number of poor qualityphotos captured by improper exposure parameter cannot berecovered or enhanced as described earlier.Since the explicit rule-based composition suggestion system

can not capture all photography knowledge, several approacheshave been developed in [31] and [32], attempting to discover thephotography principles through learning. In [31], a view findingsystem is proposed to automatically generate professional viewenclosure by mining the underlying knowledge of professionalphotographers from massively crawled professional photos. Tomodel the general photography knowledge, omni-range con-text model is developed to learn the patch spatial distributionsand correlation distributions of pair-wise patches to guide thephoto composition. Considering the variation of photographicstyles of different photographers, a preference-aware aestheticmodel for view suggestion has been proposed in [32] by con-structing an aesthetic feature library with bag-of-aesthetics-pre-serving features in a bottom-up fashion. However, we argue thata general composition model is not sufficient to model the com-plicated professional composition knowledge since in variousscenes containing different objects with various perspectives,professional photographer tend to compose photos in differentmanners as illustrated in Section III-A. In addition, exposure isnot considered in this photography system. Moreover, the gen-eral photography systems have not utilized the context informa-tion as they are not designed for mobile device.With theavailabilityof context sensingofmobiledevices, con-

text-basedmobile applications have drawn numerous attentions,such as the mobile image search [2], landmark retrieval [3] andidentification [33]. With the successful use of contexts in thesemobile applications, an image recommendation application isdeveloped in [34] to help user to compose a scene by utilizing thelocation context. Without looking into the image visual content,nearby photos with different semantics may be retrieved forframing assistance, but in that case the recommended imagemay have little reference value for current photo composition.Moreover, without composition learning, the users have to bedirected to the specific place of the recommended image exactlyand align the current photowith the recommendedonemanually.The work towards crowd-sourced learning to photography in

[35] tried to find optimal view enclosure. Albeit limited con-texts and online media metadata, i.e., location and time of theday are leveraged for crowdsourcing to mine scene specific pho-tography knowledge, lots of other necessary photography re-lated information is ignored such as weather context and mediaEXIF metadata. In addition, perspective difference of the samesubject is not taken into account for view suggestion, which isa non-negligible aspect for photo aesthetics. Moreover, the ex-posure problem is not considered in the photography system,but good view finding alone cannot guarantee a high qualityphoto acquisition as mentioned earlier. From the technical per-spective, the online contextual relevant image retrieval and aes-thetics model learning processes are quite time consuming formobile applications.

Fig. 3. Professional photo examples with different composition for differentscenes: (a), (b), and (c) are about object placement; (d), (e), and (f) are aboutobject size determination; (g), (h), and (i) are about horizon placement.

III. CHALLENGES OF MOBILE PHOTOGRAPHY

It is quite difficult for non-professional mobile users to takehigh quality photos with good composition and exposure pa-rameters due to the following great challenges.

A. Complexity of Photo Composition

Photo composition is a complicated problem which is highlydependent on scenes being shot, objects, and perspectives. Al-though a few typical composition rules have been introducedinto existing works on computational aesthetics, e.g., the rule ofthirds and golden ratio placement of horizontal lines [6], theserules are not always applied dogmatically by professional pho-tographers. Moreover, they are non-exhaustive, indicating thatthey cannot cover all possible composition principles.Fig. 3 shows some professional photos. Besides placing the

object by the rule of thirds as in (b), symmetrical composition,which places a single subject right in the middle as (c) and (i), isoften applied to achieve a sense of equilibrium. The unresolvedbalance, placing the subject far from the center, is also used tocreate visual tension (a). Furthermore, both extremes and all va-rieties of balance in between have their uses in photography. Notonly the subject position but also its relative size in the photois determined by many factors, such as the information contentof the subject and the relationship between the subject and itssettings. The shooting content-dependent characteristic makesthe simplified subject size models as in [8] not adequate forcomposition optimization. As the examples in (d), (e), and (f),these photos containing different objects have various sizes. Thesame objects from different views can be captured with diversesizes as (a) and (d), which make the object scale determinationmore complicated. Another typical example is the placement ofhorizon, which is a key element in many landscape images. Thesimplified scheme to put it lower with the golden ratio divisionlike (g) in [6] is not a hard rule to follow in real cases. Whenthe ground has plenty of interest, it encourages a high positionof the horizon to draw attention to the ground as (h). A typicalphotography technique is to create reflection, which place the


Fig. 4. Framework of the proposed socialized mobile photography system.

horizon onto the middle, when water appears on the ground togive a feeling of tranquility as (i).Most photographers adapt the photography repertoire, a set

of compositional possibilities, obtained from their photographyknowledge or experience, to the shooting scenes [7]. Therefore,we claim that photography view finding is a highly complex andscene-dependent task, and the introduction of several simple andgeneral composition rules is insufficient to guide themobile userto capture high quality photos.

B. Complexity of Camera Settings

Exposurehas a critical effect onphoto lightness, color andcon-trast, and the humanvisual system is quite sensitive to all of them.Unsuitable exposure can make a poor quality photo even thoughthe composition is successful, and quite often even post-pro-cessing can not improve it into an appealing work. Unfeasiblesettingof anyexposure relatedcameraparameters cansaleout thephotoquality,which isdifficult formobileusers tohandle.Professionalphotographershave toadjust theexposureparam-

eters adaptively considering the shooting scenes, i.e., the subjectand its settings. From the image lightness and color perspective,themain subject has to be captured into an acceptable tone, and atthe same time, the contrast of the shooting subject and its settinghas to be taken into account given the appearance relations be-tweenthemtocreatean idealattentioneffect.Therefore,exposureparametershighlydependon the shootingcontent.Various lighting conditions influenced by many context fac-

tors such as time and weather make the exposure parameteradjustment a more complicated problem. Even for the sameshooting content, the exposure has to be varied with the lightingconditions. The exposures are significantly different between

day and night, and even vary with different time of day due tothe sunshine variations. In addition, in the same time period ofthe day at the same place, the exposure also varies with dif-ferent weather conditions. Therefore, the parameters have to beadapted accordingly to obtain high quality images.

IV. SOCIALIZED MOBILE PHOTOGRAPHY

A. Approach Overview

The framework of the proposed socialized mobile photog-raphy system is shown in Fig. 4. The system takes the to-be-taken wide-view image along with the mobile user geo-locationas input and sends it to the cloud server. The input wide-viewimage can either be directly taken, or synthesized from mul-tiple consecutive photos taken by the mobile user. By jointlyutilizing the input wide-view image and its geo-location as wellas the lighting condition related contexts such as time, date andweather condition which can be obtained from the Internet, thesystem will suggest optimal view enclosures and proper expo-sure parameters best fitting the shooting content and contextbased on photography rules learned from crowdsourced socialmedia data and metadata nearby.Two fundamental components are needed in the socialized

mobile photography system to aid mobile users in capturingprofessional photographs. First, an offline photography learningprocess is needed to mine composition and exposure rules.Second, the input wide-view image content and context is uti-lized together to find relevant photography rules to recommendoptimal view enclosures and exposure parameters for mobileusers.


In the offline photography learning procedure, view clusterdiscovering is performed first in a certain scope of geo-loca-tions by clustering based on both image visual features andtheir geo-locations. Because some view clusters are intrinsicallymore appealing than others, view cluster ranking is carried out.For example, as shown in the view cluster ranking part of Fig. 4,the cluster of the first row is significantly better than that of thefourth row. The view cluster ranking results will be utilized inthe online optimal view enclosure searching step to make thesearching process much more efficient. As mentioned earlier,due to the non-exhaustiveness and flexibility of the composi-tion rules when taking pictures for different portions of scenes,composition learning is performed for each view cluster discov-ered. As the instances shown in the view specific compositionlearning part of Fig. 4, for the view cluster of the second row,the rule of thirds is more appropriate to apply than symmetricalcomposition; while for the cluster of the third row, instead of therule of thirds, symmetrical composition of the close-up view ofthe “Golden Gate Bridge” pier is preferred. Moreover, due to thefact that professional photographers usually adjust the cameraexposure parameters according to the brightness and color ofthe objects and the whole settings, as well as the lighting con-dition influenced by a variety of factors, such as the intensityand the direction of sunshine which are affected by the seasonand the time of the day as well as weather conditions, metriclearning is carried out to model the various effects of contentand context to the exposure parameters.In the online photography suggestion stage, utilizing the vi-

sual content and geo-location of the input, relevant view clus-ters similar to all possible view enclosure candidates of the inputimage are found. Considering the fact that some view enclosurecandidates are intrinsically bad no matter how to tune the objectplacement, a large portion of view enclosure candidates similarto the low ranked view clusters are discarded. Afterwards, theoptimal view enclosures will be selected based on the offlinelearned view specific composition principles only from the re-maining enclosure candidates. Once mobile users are providedwith the optimal view enclosure, the appropriate exposure pa-rameters, i.e., exposure time, aperture and ISO suitable for theview and lighting conditions are suggested. With the suggestedview enclosure and corresponding exposure parameters, mobileusers can capture high quality photos with appealing composi-tion and exposure.

B. Offline Photography Learning1) View Cluster Discovering: Intuitively, when photogra-

phers visit a certain scope of geo-location, they tend to cap-ture photos from a certain number of photo-worthy viewpoints.The aggregated photographs taken in the location scope asso-ciated with their social information from the media communitycan provide significant insight on the aesthetics and composi-tion rules of different viewpoints. To discover those repeatedviews from the crowdsourced photographs of the given locationscope, we perform image clustering by jointly utilizing imagevisual features and capture locations. The target is to expose dif-ferent views with different portion of the scene from a varietyof perspectives.ViewClustering. An efficient content and geo-location based

clustering process is carried out to discover all typical viewclusters in the location scope. As the images within the same

view cluster should contain the same main objects, we use localfeatures with geometric verification to capture image contentsimilarity. To facilitate the clustering and online relevant viewdiscovering process, the crowdsourced photos in the locationscope are indexed by inverted files [9] based on SIFT [10] vi-sual words. To overcome the false matching caused by the am-biguity of the visual words, geometric relationships among thevisual words are also recorded into the index by spatial codingas in [11]. Hence, using the index, the image content similaritycan be efficiently computed based on the matching score formu-lated by the number of matched visual words passing the spatialverification [11]. In addition, considering images captured fromclose places usually have similar content from similar perspec-tives, location is also adopted to the view cluster discoveringprocess. The image location similarity is calculated based ontheir GPS Euclidean distance. Then the view clusters can bediscovered by a clustering process based on image similarityobtained by fusion of their content similarity and location sim-ilarity. The similarity fusion is achieved using their product.However, it is difficult to manually specify the number of

views considering the difference in content and perspectives forclustering, even after manually going through the whole datasetcrawled from the location scope. Therefore, affinity propaga-tion [12] which does not require the specification of number ofclusters, is performed to cluster the images into different viewsbased on the fused image similarity.View Cluster Denoising. To learn the photography rules of

a given content from a certain perspective, we need to modelthe relevant rules for each cluster. However, the noisy imageswithout the main objects in the view clusters discovered by theabove clustering process would negatively affect the photog-raphy learning process. It is necessary to denoise the imageswithout the same content. To identify the image sharing themain objects with other images in the cluster, we first select theiconic image based on local features. The image with maximumtotal content similarity scores with the others within the clusterare chosen as the iconic image of the view cluster. Afterwards,the images with content similarity score less than the threshold

are considered as noisy images without the main ob-jects in the cluster and thus are discarded. Then, noisy clusterswithout representative content in the scene such as the ones con-taining portraits and crowds can be removed by discarding theclusters with quite a small number of images. Here we discardthe clusters with less than 5 images. Finally, we can expect im-ages with very similar content from the same viewpoints to endup in different view clusters.2) View Cluster Ranking: As aforementioned, some view

clusters have more appealing content and composition thanothers, therefore, we rank the view clusters discovered. Lateron, in the online view enclosure selection stage, the enclosureshaving relative less appealing content can be discarded directlyto facilitate the optimal view searching. We adopt the clustersize and score distribution of the images in the cluster to rankthe view clusters.View Cluster Ranking. Suppose we have the whole ranking

list of the crowdsourced images based on their aesthetic scores,to rank the view clusters, the score distributions of the individualimages in the cluster need to be taken into account. In the idealcase, the individual images of high ranking clusters should havehigh aesthetic scores in the average sense. On the other hand, the


individual image scores of highly ranked clusters should not bescattered too much, since the clusters with compact scores tendto have more stable aesthetic scores. Inspired by the commonlyused metrics, average precision, in information retrieval field,we formulate the average score of cluster by

(1)

where is the number of images in the view cluster. is the-th image belonging to cluster in the whole rank list.is the ratio of the total score of images within the cluster k to thetotal score of images at the cut-off rank of image in the wholerank list, in which is the -th image belonging to clusterand is the -th image in the rank list no matter which cluster itbelongs to. In addition, the appealing degree of the view clustercan also be reflected by the size of the cluster, since pleasingobjects tend to draw more photographers’ attention and hencea large number of images are aggregated in the view cluster.Therefore, the view clusters are ranked according to the scorescalculated by

(2)

The larger the view cluster size, the higher the view is scored.Image Aesthetic Score Generation. In many social media

websites such as Photo.net [13] and DPChallenge [14], mostphotos are rated by a number of professional photographers, theratings can almost reflect the photo aesthetics. Due to the lack ofGPS information, we did not use the data from these websites.However, we can expect the availability of the context infor-mation of them in the near future due to the user demand andadvanced techniques. Given the rich information of Flickr [15],we generate image aesthetic scores based on several heuristiccriterions.• Ranking of interestingness. Interestingness-enabledsearch is provided by Flickr API. As in [16], the inter-estingness rankings are based on the quantity of userentered metadata such as tags, comments and annotations,the number of users who assigned metadata, user accesspatterns, and a lapse of time of the media objects. Basedon the interestingness rankings the top photos are usu-ally with high qualities. Hence, we utilize this importantinformation to generate the aesthetic scores. Due to theconsideration of lapse of time, although the interestingnessbased ranking can reflect the photo quality, the rankingscheme tends to rank the newly uploaded photos higherthan the old photos. To improve the rankings of thoseappealing but old photos, we enhance the influence by thenumber of views and the number of favorites consideringthe fact that these photos usually have high quantity inthese two terms, even though the interestingness hassubsumed the two terms.

• Number of favorites. The number of favorites explicitlyshows how many people liked the image. Hence it is astraightforward reflection of the photo’s degree of appeal.

• Number of views. The number of views of the image canreflect the attention drawn from the social media commu-nity. Usually, high quality images tend to be viewed by

more users. We also consider this parameter in aestheticscore generation to complement the number of favorites.

Therefore, by weakening the time fading effect of interest-ingness rankings via highlighting the impacts from number ofviews and favorites, interestingness rankings are utilized to gen-erate photo aesthetic scores by fusing the above three factors asfollows:

(3)

where and are the number of views and number of favoritesof image , respectively. is the rank of image based oninterestingness. is the total number of crawled images in thelocation scope. In this way, the aesthetic scores ranged from 0to 100 are generated, in which high quality photos are assignedwith high aesthetic scores. Through empirical analysis, we set

and .3) View Specific Composition Learning: The view clusters

obtained are expected to contain different content or capturedwith different perspectives from different location tiles. As il-lustrated in Fig. 4, different composition rules are adopted tothose different view clusters. Therefore, view specific composi-tion learning has to be performed to extract different composi-tion principles for each view cluster.The photographs of each cluster usually contain the same ob-

jects from similar perspectives but with different positions andscales in the frame. The difference is one key factor leading tothe different aesthetical scores. To characterize the compositiondifference in terms of themain object placement for each cluster,the camera operations compared with the cluster iconic imageare utilized to represent themain object composition as shown inFig. 5. The camera operation is defined as horizontal translation,vertical translation, zoom in/out, and rotation as in [17]. Usingthe matched SIFT points, the coordinates in the given image

can be represented by the affine model based on thecoordinates of its corresponding matched points in the clustericonic image as follows:

(4)

The parameters of the affine modelcan be calculated by the least square method based on allmatched SIFT points in the given image. Based on the affineparameters, the camera operations can be obtained by

(5)

where

(6)

The terms of the camera operations , , , andrepresent the camera horizontal translation, vertical

translation, zoom in/out degree, and rotation, respectively [18].As shown in Fig. 5, the object composition in terms of scale andlocation can be captured by the camera operations comparedwith the view cluster iconic image.In addition to the modeled main object, some other salient

objects in the photos can also affect the image compositions.Therefore, the spatial distribution of saliency is also utilized tocapture the composition. We divide the image into 5 5 grids,


Fig. 5. Illustration of view specific composition learning.

the average saliency value in each grid is calculated to form thevector to represent the saliency spatial distribution. The saliencymap is computed by spectral residual approach [19]. Hence,the camera operation vector and the saliency spatial distributionare concatenated together to capture the image composition asshown in Fig. 5.Note that when computing the composition features, the im-

ages are normalized to the same size, i.e., 640 426 or 426640 for vertical and horizontal images respectively, since mostmedium 640 version images we crawled from Flickr are withaspect ratio of 3:2 or 2:3, and the larger side is usually 640pixels. In our experiment, we assume the photos captured bymobile camera are also with the same aspect ratio 3:2 or 2:3.When the aspect ratio of the mobile capture photos is different,training images with the given input aspect ratio can be crowd-sourced from the social media websites and the system can workin the same way. For each view cluster, the composition modelsof horizontal image and vertical images are learned separately,since they may follow different composition rules when cameraorientations are different, though the objects and perspectivesare the same.To learn the composition rules for view cluster , we treat

composition learning as a two class classification problem usingRBF kernel based Support Vector Machine (SVM) classifier[20]. In cluster , the photos with aesthetic score higher than

and the ones with score lower than are consid-ered as high quality and low quality photos, respectively, where

is the median value of the aesthetic scores in cluster . Thephotos with score between and are not uti-

lized in the training process to overcome the quality ambiguityissue. The SVM classifier training leads to the learned hyper-plane which is able to separate the photos with good and badcompositions in cluster . Afterwards, the image aesthetic scorecan be inferred by the rescaled sigmoid function based on the

distance from the hyperplane to the given image by

(7)

As the distance goes from 0 to the negative infinity, the aestheticscore decreases from 50 to 0, while as the distance goes from 0to the positive infinity, the aesthetic score goes from 50 to 100.We employ five-fold cross validation to search for their corre-sponding optimal parameters error parameter , tube-widthand kernel parameter .4) Exposure Metric Learning: Professional photographers

have to jointly consider the shooting object and the setting,as well as the lighting condition influenced by various con-text factors for exposure parameter adjustment. Therefore, wehave to learn the exposure parameters by jointly considering theshooting content and various lighting related contexts.Exposure Feature. The shooting content is one primary

factor for exposure learning. Different hues are perceived withdifferent light values, the colors on the objects and settings, aswell as the contrast between them in the view enclosure, havedirect influence on the exposure. As the images in the sameview clusters usually contain the same objects from the similarcapture angle, we utilize the cluster id to represent the shootingcontent feature. Moreover, a series of features are extractedto capture the contextual information related to the lightingconditions. Currently, our system incorporates the followingtemporal and weather contextual features, since these contextshave strong and direct influence on lighting, and easy to beobtained or inferred.• Time of the day. The sunshine direction and luminancevary with the time of the day. For example, two pro-fessional photographs with the same shooting contentcaptured at the sunrise time and at noon have differentexposure parameter settings to fit the lighting difference.Here, considering the variation of sunrise and sunset time,we quantize the time period of the day into six bins:[sunrise time-1hr, sunrise time+1hr), [sunrise time+1hr,11am), [11am, 2pm), [2pm, sunset time-1hr), [sunsettime-1hr, sunset time+1hr), and [sunset+1hr, sunrise timeof the next day-1hr). The sunrise and sunset time of everyday in the given location can be obtained from manyweather record websites [21]. In this system, we obtain thehistorical sunrise and sunset time from [22] by specifyingthe geo-location and date for the crowdsourced photos.

• Month. The light intensity changes with season for a givengeo-location. For example, the light intensity at noon isstronger in summer than that in winter. Hence, month isalso an important temporal factor for exposure.

• Weather condition. Lighting is influenced by the weatherconditions obviously. For example, the light is strongerin sunny days than that of cloudy days, which also di-rectly affect the exposures. We define possible values ofthe weather condition as: clear, cloudy, flurries, fog, over-cast, rain, and snow. The historical hourly weather infor-mation is also obtained from [22] given the date, time and


geo-location. In this system, the dataset is built by GPS en-abled search from Flickr, and hence all photos are taggedwith their GPS. In addition, the date and time can be foundfrom the EXIF of photos.

Exposure Feature Space Metric Learning. The exposureparameter setting is a complicated problem due to the variousshooting content and lighting conditions affected by variouscontextual factors. Although we may identify all possible in-fluencing factors, it is still quite difficult to formulate their dif-ferent effects. A simple instance is that when taking photos atnight, time of day may be a dominant factor rather than weatherconditions, while at noon time, the lighting difference betweencloudy and clear weather conditions may have a greater influ-ence than time on the exposure. With these context dimensionsintertwined, it is hard to determine their influences on exposure.The exposure setting problem becomes even more challengingwhen taking into account the diversity of exposure for differentshooting content.Therefore, the proposed system models the exposure param-

eters by supervised distance metric learning, which generallyaims to learn a transformation of the feature space to maxi-mize the classification performance. Here, the hope is to find thetransformation of the exposure feature space, with the abilityto rescale the feature dimensions according to their effects onthe exposure parameter selections. Ideally, the learned transfor-mation can project the photos with similar exposure parame-ters into clusters as illustrated in the exposure metric learningmodule of Fig. 4.To perform distance metric learning for the exposure feature

space, LargeMargin Nearest Neighbor (LMNN) [23], a distancemetric learner designed for -nearest neighbor (kNN) classifier,is utilized. It aims to maintain that the -nearest neighbors al-ways belong to the same class while the samples from differentclasses are separated by a large margin. The approach maxi-mizes the kNN classification performance without the unimodaldistribution assumption for each class, which is a valuable prop-erty for fitting our model. For example, the exposure parametersfor the same view cluster under similar weather conditions atsunrise and sunset may be the same, but fall into different clus-ters after exposure feature space transformation. In the metriclearning step, the exposure parameters of the photos are repre-sented in the exposure feature space with their exposure param-eter values as labels.Exposure Compensation, Aperture and ISO Learning.

As different exposure parameters have different sensitivity andfunctionality to the content and context dimensions, we have tolearn the metrics for the parameters separately. For example, atnight, photographers tend to increase the ISO values to over-come the issue of weak lighting, while besides the concernsabout lighting, they tend to use large aperture to reduce thedepth of field when shooting single objects. A straightforwardway is to model the exposure feature space distance metricsfor the three most commonly used exposure parameters, i.e.,aperture, ISO and exposure time, directly. Then, the exposurevalue can be calculated by

(8)

where and are exposure value and exposure time, re-spectively. However, it is not reasonable because even though

the separately predicted aperture, ISO and exposure time are ap-plicable, when putting them together it is not guaranteed that theexposure value fits the given content and contexts conditions.To suggest exposure parameters with reasonable EV, there are

two possible ways. One is to learn the distance metric of the ex-posure feature space by using the coupled aperture, ISO and ETtriples as labels, but the combination of the large number of pos-sible values of the three parameters make the number of labelsexponentially increased, and hence degrade performance. Theother way is to predict EV and any two of the three parameters.However, EV determination is complicated because it is sensi-tive to the layout of the intensity and colors in the shooting viewand lighting conditions. Although camera manufactories havemade efforts to estimate correct exposure values by improvinglighting meters for decades, it is still challenging to provide cor-rect exposure value automatically. Hence, predicting exposurevalue directly according to shooting content and context is notpractical. Empirical analysis also validates that the direct expo-sure value prediction performance is not satisfactory.Although camera lighting meters are indeed improved nowa-

days, professional photographers sometimes need to adjust thecamera computed exposure value by increasing or decreasingcertain compensation levels to achieve perfect EV. By settingthe exposure compensation (EC), the camera lighting metersand the photographers’ knowledge are both sufficiently utilizedto obtain the ideal EV. Inspired by it, we adopt EC to achieveoptimal EV. We learn the distance metrics of the exposure fea-ture space for EC. Similarly, metric learning of exposure fea-ture space for aperture and ISO are also performed, respectively.Once the optimal EV, aperture and ISO are obtained, ET can becalculated according to (8).

C. Online Mobile Photography Suggestion

Given the input wide-view image with size , viewenclosure candidates at various positions in different scales needto be generated for optimal view selection. As mentioned inSection IV-B-3, we assume the mobile captured photos are ofaspect ratio 2:3 or 3:2. We slide a window of the given aspectratio with moving step size fromor for horizontal and vertical input image,respectively, until the largest possible window size with scalingratio to generate all possible view enclosure candidates.Then, a number of view candidates will be discarded first basedon view cluster rankings. The view enclosure with the best com-positions will be selected from the remaining ones, utilizing theoffline learned view specific composition rules.1) Relevant View Discovering: Once the view enclosure can-

didates are generated, their relevant view clusters containing thesame content have to be discovered for their composition aes-thetics judgment. Using the image index built offline, the mostrelevant image can be efficiently retrieved for each of the en-closure candidates. We can consider the view cluster, which therelevant image belongs to, as the relevant view cluster of theenclosure candidate. The visual words extraction only needs tobe performed once on the input image and the visual words ofeach candidate can be obtained based on the enclosure coordi-nates accordingly.2) Low Ranking View Enclosure Discarding: It is difficult to

decide which part of a given input panoramic image to capture


automatically, especially when there are some other objects be-sides the main objects. The ranking of the relevant view clustercan help decide which objects to capture to make the photo moreappealing. In addition, the exhaustive searching by assessing allview candidates using the learned composition rules is compu-tationally costly and hence are not applicable for real-time mo-bile processing. Taking advantage of the facts that some viewenclosures containing certain portion of the scenes have intrin-sically better compositions than others, we carry out a pre-pro-cessing step to make the optimal view searching run efficiently.Once the relevant view cluster with similar content is found foreach of the view enclosure candidates, utilizing the view clusterrankings already obtained offline, we only search through thecandidates belonging to the highest ranked view cluster out ofall relevant clusters. In this way, a large percentage of enclosurecandidates belonging to relatively low ranked view clusters canbe discarded first.3) Optimal View Enclosure Searching: Once the view en-

closure candidates belonging to low ranking view clusters arediscarded, we perform optimal view searching through the re-maining candidates belonging to the top ranked relevant cluster.For the view candidates relevant to the same view cluster, theycontain similar objects with slight variance on the arrangementand scale. To make a view suggestion, the candidate with thehighest predicted aesthetic score is considered as optimal viewenclosure. Therefore, we predict the aesthetic scores of the re-maining candidates belonging to the top ranked relevant clusterusing the learned view specific composition rules and suggestthe most highly ranked one to mobile users.4) Exposure Parameter Suggestion: In the offline exposure

metric learning module, the system learns the distance metricsof the exposure feature space to model the diverse effects ofthe content and context dimensions to EC, aperture and ISO,respectively. Once the optimal view enclosure is obtained, themetrics learned is applied to the current content and contexts. Inthe transformed exposure feature space, the photos with similarexposure parameters are projected into clusters due to the localoptimization property of LMNN designed for kNN. Hence, thesimple kNN classifier, which makes the classification decisionbased on the most frequent label surrounding the input data inthe transformed feature space learned by LMNN, is utilized topredict the optimal exposure parameters according to the con-tent and contexts. It is consistent with the intuition that photoscontaining the similar content and contexts at the dominant di-mensions should have similar exposure parameters. Hence, thenearest samples in the transformed exposure feature space canbe utilized effectively to predict exposure parameters.The flowchart of exposure parameter suggestion is shown in

Fig. 6. Once the optimal view enclosure is obtained, the optimalEV fitting the view and context is estimated. To sufficientlytake advantage of the camera light metering results of the inputimage and avoid the inconvenience of asking mobile users tomanually capture the suggested view, we estimate the EV of thesuggested view enclosure from the camera meter as follows:

(9)

where and are the estimated EV of the sug-gested view enclosure and the input panoramic image fromthe camera meter, respectively. and are the

Fig. 6. The flowchart of exposure parameter suggestion.

median intensity of the suggested view enclosure and the inputpanoramic image, respectively. Here, is set to 2.2.Utilizing the suggested view content and context as described

in Section IV-B-4, EC can be predicted and thus the correct EVfor the suggested view enclosure can be calculated by

(10)

where is the predicted EC for the suggested view. Inaddition, with the view content and contexts, the correspondingaperture and ISO can also be predicted based on their learnedmetrics, respectively.One problem with exposure parameter prediction is the

possible confliction of the diverse capabilities of mobile cam-eras and the camera parameters of the crawled photos takenby some professional cameras. For example, the predictedaperture may not be supported by some mobile devices.Therefore, in that case, all the training samples of the previouspredicted label are temporarily removed to predict the nextoptimal parameter until it is allowed by the current cameracapabilities.Once getting the correct EV, aperture and ISO, the corre-

sponding ET can be calculated with equation (8). At that time,if the ET is larger than the threshold , then the picture takenwould probably be blurred due to hand jittering. If it is smallerthan , then the aperture, ISO and ET are suggested; oth-erwise the exposure parameters have to be adjusted under thesame EV.The exposure parameter adjustment process is illustrated in

Fig. 7. Inspired by the fact that aperture priority mode is usuallyadopted when capturing landscape scenes, which means that thephotographers try to adjust ISO or ET while keeping the optimalaperture value to achieve the optimal EV. Therefore, if ET of theinitial predicted exposure parameter set is over long, keepingthe optimal EV, we first adjust the predicted ISO value by (11)in the current camera allowable ISO settings, in which isthe current ISO value, is the updated ISO value.

(11)

Hence, ET can be reduced while maintaining the same EV.Then, if ET is below , the parameter adjustment stop and


Fig. 7. The exposure parameter adjustment process.

the updated parameter set is suggested; otherwise, further ad-justment is needed. When updating the ISO value, we have tocheck the current camera allowable range of ISO values. Oncethe maximum ISO value is reached, we are only able to de-crease ET by decreasing the aperture value using (12) under thesame EV.

(12)

Similarly,when updating aperture, the updated aperture has to besupported by the current camera. Hence, we can obtain the op-timal set of exposure parameters fitting the suggested view andlighting contexts considering the mobile camera capabilities.5) Online Mobile Photography Suggestion Computation:

Given an input wide-view panoramic photo, with the viewspecific composition model and the exposure parameter metricslearned offline, the proposed socialized photography systemcan efficiently suggest optimal view and proper camera param-eters. Once the input photo and the associated GPS informationis sent to the cloud server, the SIFT feature and saliency featureare generated. Later on, the view enclosure candidates can begenerated by sliding windows and their features can be obtainedquickly by simply cutting out from the original image features.Then, with the indices of the photos in the location, the mostrelevant images of the candidates can be retrieved and thustheir view clusters can be found in a parallel fashion. Due tothe limited number of view clusters, the highest ranked relevantcluster can be obtained quickly. The aesthetic scores of thecandidates belonging to the highest ranked relevant cluster areparallelly predicted with the offline learned composition model.Hence, the optimal view candidate can be suggested efficiently.Since the current weather information can be prefetched for thegiven location, the offline learned exposure parameter metricscan be employed to obtain suitable parameters with a simpleoperation. Usually the parameter adjustment can be finishedwith 0–3 iterations. Hence, the optimal view and proper cameraparameters can be sent to the client quickly.

V. EXPERIMENTS AND EVALUATIONS

To validate the proposed socialized mobile photographysystem, we performed objective evaluations on the systemcomponents and subjective evaluations on the view suggestionand exposure parameter suggestion results.

A. Dataset Building

Up to now, there is no publicly available dataset containingsufficient photos with context information for photo qualityassessment. To guarantee sufficient photos with the requiredcontext information for the composition learning and expo-sure learning processes, we build our own dataset from eightwell-known hot spot landmark locations. The proposed pho-tography learning process, including view cluster discovering,composition learning and exposure parameter learning, is per-formed utilizing the crowdsourced data from eight hot spotplaces: “Golden Gate Bridge,” “Taj Mahal,” “Sydney OperaHouse,” “Portland Head Light,” “Statue of Liberty,” “JeffersonMemorial,” “Eiffel Tower,” and “United States Capitol.” In eachlocation, we use Flickr API flickr.photos.search to perform in-terestingness based search in descending order for photos withinthe radius of one kilometer by specifying the geo-location of thelandmark. Hence, all photos collected are tagged with latitudeand longitude. The capture date and time is obtained from photoEXIF data. The weather information of all photos can be foundfrom [22] by providing their capture date, hour and city. For eachlocation, about 2,000–4,000 photos are obtained due to limits ofFlickrAPI. Thenumber of views ranges from0 to61,397, and thenumber of favorites is from 0 to 455. They were captured at dif-ferent time of day ranging from 2001 to 2012, and under variousweather conditions: clear, cloudy, flurries, fog, overcast, rain,and snow. For each location, we randomly select 50 wide-viewimages for system performance evaluation, i.e., view suggestionevaluation and camera parameter suggestion evaluation, and theremainingphotos serveas trainingandvalidationdata for compo-sition learningandcameraparametermetric learning.Wesplit thedata into ten folds and in the system component evaluation, i.e.,composition learning accuracy and exposure parameter learningaccuracy, ten fold cross validation is carried out. In the systemevaluation part, we utilize the composition model and cameraparameter metrics learned in one randomly selected round.

B. View Cluster Discovering Results

We carried out the proposed view cluster discovering ap-proach as described in Section IV-B-1. After the content andgeo-location based view clustering and local feature basedcluster denoising, 11–33 clusters are obtained for each location.The number of clusters for the eight locations are listed inTable I. Example photos of the top three ranked clusters of fourhot spot locations using the view cluster ranking method illus-trated in Section IV-B-2 are demonstrated in Fig. 8. Despite theexistence of noisy images, the proposed view discovering ap-proach indeed found meaningful view clusters in the locations.

C. Evaluation of Composition Suggestion

1) Composition Learning Accuracy: To validate the pro-posed view-specific composition learning component, we


TABLE INUMBER OF CLUSTER FOR EIGHT LOCATIONS

Fig. 8. The top three view clusters discovered of four locations: from top tobottom are “Golden Gate Bridge,” “United States Capitol,” “Taj Mahal,” and“Portland Head Light.”.

implement the composition learning approach described inSection IV-B-3 for each discovered cluster and calculate theaccuracy of the predicted aesthetic scores for the photos ineach hot spot place. Through ten fold cross validation, theaverage Mean Squared Error (MSE) of the predicted resultsfor each location is demonstrated in Table II. The minimumand maximum MSE is 288 and 375 for the eight locations,respectively, which means the prediction error ranges from17.0 to 19.4 on average. The variance of the images and thereduction percentage of the predicted error from variance foreach location is also shown in Table II. The variances rangefrom 430 to 565. We can see that the prediction error has beenreduced between 14% and 19% from the variance.

TABLE IIMSE OF THE PREDICTED AESTHETIC SCORES FOR EIGHT LOCATIONS

Fig. 9. Example results of suggested views, in which each image is from onehot-spot location: (a) input wide-view image, (b) optimal relevant cluster iconicimage, (c) suggested view.

2) Subjective Evaluation of View Suggestion: Due to the lackof a systematic evaluation function for measuring the photocomposition in terms of the users’ level of satisfaction, we con-duct user studies to evaluate the composition of the suggestedviews. Fifteen subjects including three females and with agesranging from 20 to 44 are invited to rate the view suggestion


Fig. 10. Results of suggested views, in which input images and suggested views of each hot spot landmark location are shown in one row. (b)(d)(f)(h) are suggestedviews of input wide-view images (a)(c)(e)(g), respectively.

results and the corresponding wide view input photos for com-parison. Note that the input wide view images downloaded fromFlickr were usually already carefully composed by the pho-tographers. Three subjects are professional photographers, oneof whom is a photographer working in a photographic studio,and the other two are students majoring in photography. Allof them have more than five years of photography experienceswith single lens reflex cameras. Twelve subjects are amateurs,who are not majoring in photography but have at least two yearsof experiences with single lens reflex cameras. For each of theeight locations, five photos and corresponding results are ran-domly selected for the user study. Hence, 40 image pairs arerated by the 15 subjects. Due to the page limit, for each locationone input wide view photo example and the corresponding viewsuggestion result as well as the iconic image of the optimal rele-vant view cluster are shown in Fig. 9. The remaining four inputphotos and their suggested views of each location are demon-strated in Fig. 10. The subjects are asked to rate each photo inthe following three perspectives:1) Is the photo visually appealing in terms of focal length?2) Is the photo visually appealing in terms of objectplacement?

3) What’s the overall composition rating?The subjects are asked to provide their ratings on the three

questions for each photo using 1(very bad), 2(bad), 3(neutral),4(good) and 5(very good). The subjective composition evalu-ation and the subject camera parameters evaluation in V-D-2took 1.5–2 hours for each photographer. To avoid the inaccu-

racy of the subjective evaluation due to fatigue or boredom, nomore samples were assessed. The average ratings of the pro-fessional and the amateur photographers on the composition interms of question (3) for the eight locations are demonstrated inFig. 13(a). The blue and green solid lines show the average pro-fessional ratings for the input and output composition, respec-tively. The red and black dashed lines show the average ama-teur ratings for the input and output composition, respectively.From the figure, we find that, for most landmark locations, thesuggested compositions are better than the input compositionseither from the professional or the amateurs’ perspective. In thecases of Statue of Liberty and Jefferson Memorial, the ratingsof the suggested compositions are slightly worse than or similarto the input compositions due to the failure in the detection ofdistinguished salient points because of noise on the backgroundor overly small scale of foreground objects. Extensive adoptionof more robust composition features will be performed in the fu-ture to overcome this issue. In addition, the standard deviationsof the ratings from the three professional photographers and thetwelve amateur photographers for the input and output of the40 photos from the eight landmark locations in terms of ques-tion (3) on overall composition are demonstrated in Fig. 14(a).The blue and green solid lines are the standard deviation of theprofessional ratings for input and output, respectively. The redand black dashed lines are the standard deviation of the am-ateur ratings for input and output, respectively. The standarddeviations on the ratings of the 40 photo pairs from the threeprofessional photographers and the twelve amateur photogra-


Fig. 11. The rating distributions of the three professional photographers onquestion (1), (2) and (3) are demonstrated in (a), (b) and (c), respectively, whilethe rating distributions of the twelve amateur photographers are demonstratedin (d), (e) and (f). The black bars show the rates on the input photos and thewhite ones show the rates on the corresponding suggested views.

TABLE IIIERROR RATE OF THE PREDICTED EXPOSURE PARAMETERS

phers are smaller than 1.24 and 1.85, respectively. The ratingdistributions of the three professional photographers on ques-tion (1), (2) and (3) are demonstrated in Fig. 11(a), (b) and (c),respectively, while the rating distributions of the twelve am-ateur photographers are demonstrated in (d), (e) and (f). Theblack bars show the rates on the input photos and the whiteones show the rates on the corresponding suggested views. Fromthe figure, we can observe that although the rating distributionsof the professional and amateur photographers are slightly dif-ferent, there is a tendency that the composition of the suggestedviews are significantly improved compared with the input wideview photos in terms of focal length, object placement and theoverall composition.

D. Evaluation of Camera Parameter Suggestion

1) Exposure Parameter Learning Accuracy: To validate theproposed exposure parameter learning component, we carriedout metric learning for aperture, ISO and EC as described in

Fig. 12. The rating distributions of the three professional photographers onquestion (1), (2) and (3) are demonstrated in (a), (b) and (c), respectively, whilethe rating distributions of the twelve amateur photographers are demonstratedin (d), (e) and (f). The black bars show the rates on the camera parameters of theinput photos and the white ones show the rates on the corresponding suggestedcamera parameters.

Fig. 13. The average rating of the three professional photographers and thetwelve amateur photographers for the input and output of the eight landmark lo-cations in terms of (a) overall composition and (b) overall exposure parameters.The blue and green solid lines are the average professional rating for input andoutput, respectively. The red and black dashed lines are the average amateurratings for input and output, respectively.

Section IV-B-4 and calculate the average error rate of the pre-dicted parameters through ten fold cross validation for each hot


Fig. 14. The standard deviation of the ratings from the three professional pho-tographers and the twelve amateur photographers for the input and output of the40 photos from the eight landmark locations in terms of (a) overall composi-tion and (b) overall exposure parameters. The blue and green solid lines are thestandard deviation of the professional ratings for input and output, respectively.The red and black dashed lines are the standard deviation of the amateur ratingsfor input and output, respectively.

spot place. The average error rate of the predicted results be-fore parameter adjustment for each location is demonstrated inTable III.2) Subjective Evaluation of Exposure Parameter Sugges-

tion: To evaluate the suggested exposure parameters, the 15photographers are also invited to evaluate the suggested expo-sure parameters for the suggested view of the same set of 40photos. The subjects are required to answer the following threequestions:1) Do the camera parameters make the photo underexposedor overexposed?

2) Does the exposure time make the photo blurred when usingmobile cameras?

3) Are the camera parameters reasonable from overallperspective?

For question (1), they are asked to answer with (under-exposed), (proper exposure), and (overexposed). For ques-tion (2), they provide binary answers (yes) or (no). Forquestion (3), they are asked to answer with 1 (not reasonable), 2(reasonable, but can be improved) and 3 (perfect). When eval-uate the camera parameters, the lighting related contexts, i.e.,month, time of the day, and weather conditions are also pre-sented to the photographers. The average ratings of the profes-sional and amateur photographers on the reasonability of theinput and output exposure parameters in terms of question (3)for the eight landmark locations are demonstrated in Fig. 13(b),in which the blue and green solid lines show the ratings ofthe professional photographers on the input and output, respec-tively, and the red and black dashed lines show the ratings of theamateur photographers on the input and output, respectively.

TABLE IVTHE PREDICTED AND SUGGESTED EXPOSURE PARAMETERS OF THE SUGGESTED

VIEWS AND CORRESPONDING INPUT IMAGES OF FIG. 9

We can observe that for most locations, the suggested param-eters are better than or similar to the input parameters, eitherfrom the professional or the amateurs’ ratings. In addition, thestandard deviations of the ratings from the three professionalphotographers and the twelve amateur photographers for theinput and output of the 40 photos from the eight landmark lo-cations in terms of question (3) on overall exposure parame-ters are demonstrated in Fig. 14(b). The blue and green solidlines are the standard deviation of the professional ratings forinput and output, respectively. The red and black dashed linesare the standard deviation of the amateur ratings for input andoutput, respectively. The standard deviations on the ratings ofthe 40 photo pairs from the three professional photographersand the twelve amateur photographers are smaller than 0.94 and1.15, respectively. The rating distributions of the three profes-sional photographers on question (1), (2) and (3) are shown inFig. 12(a), (b), and (c), respectively, while the rating distribu-tions of the twelve amateur photographers are shown in (d), (e),and (f). The black bars show the ratings on the camera parame-ters of the original photos, while the white show the ratings onthe suggested camera parameters. From the evaluations of bothprofessional and amateur photographers, we can find that theimproper lightings, i.e., under-exposure and over-exposure, aswell as blur are both reduced via parameter suggestion. Overall,a large portion of unreasonable exposure parameters of the inputphotos are corrected. Hence, the exposure parameter suggestedby the proposed system significantly improved the photo qualitycaptured by mobile cameras.Assuming the mobile camera aperture range and ISO range

are [2.8, 22.6] and [100, 1600], respectively, the predicted expo-sure parameters and the suggested ones after parameter adjust-ments of the suggested views as well as the original parametersof the input photo of Fig. 9 are shown in Table IV. The EV andEC are also demonstrated for comparison.


VI. CONCLUSION AND FUTURE WORK

The contradiction between the popularity of photo capturingand sharing by mobile devices and the lack of photographyknowledge and skills of most mobile users serves as the pri-mary motivation of our work. We propose a socialized mobilephotography system to assist mobile user in capturing highquality photos by using both the rich context available frommobile devices and crowdsourced social media on the Web.Considering the flexible and adaptive adoption of photographyprinciples with different content and perspective, view clustersare discovered and the view specific composition rules arelearned from the community-contributed images. Leveraginguser’s social context, the proposed socialized mobile pho-tography system is able to suggest optimal view enclosureto achieve appealing composition. Due to the complex scenecontent and a number of shooting-related contexts to exposureparameters, metric learning is applied to suggest appropriatecamera parameters. Currently, we aim to solve the mobilephotography problem for some hot spot landmark locations.Objective and subjective evaluations for the hot spot landmarkphotos validated the effectiveness of the proposed socializedmobile photography system.There are several interesting future works. Photography

knowledge transfer from other similar geo-locations may solvethe problem of insufficient data for photography learningand thus improve the performance of the socialized mobilephotography system [36]. Moreover, in the current system,we can only search the optimal view enclosure from the inputwide-view photo. However, the optimal view may sometimesgo beyond the scope of the input views. It is very challengingto quantitatively evaluate the effect of the constraint on theinitial input-wide view image. Such an evaluation is beyond thescope of this paper and will be considered in the future work.A possible way of optimal view enclosure finding beyondthe input scope may be realized via building 3D models ofthe given location from crowdsourced images. Furthermore,people tend to capture consumer photos with mobile camerasin various scenes and events as in [37]. Hence, we may extendour system into more scene types and events by analyzing andinferring human portraits and activities. Finally, city-scale mo-bile photography suggestion system can be built by integratingautomatic landmark recommendation [38] into the currentphotography suggestion system.

REFERENCES

[1] [Online]. Available: http://tinyurl.com/c8mp7jt[2] C. Zhu, K. li, Q. Lv, L. Shang, and R. P. Dick, “iScope: Personalized

multi-modality image search for mobile devices,” in Proc. MOBISYS,2009, pp. 277–290.

[3] R. Ji, L.-Y. Duan, J. Chen, H. Yao, J. Yuan, Y. Rui, and W. Gao, “Lo-cation discriminative vocabulary coding for mobile landmark search,”Int. J. Comput. Vision, vol. 96, no. 3, pp. 290–314, 2012.

[4] R. datta, D. Joshi, J. Li, and J. Z. Wang, “Studying aesthetics in pho-tographic images using a computational approach,” in Proc. ECCV,2006, pp. 288–301.

[5] D. Joshi, R. Datta, Q.-T. Luong, E. Fedorovskaya, J. Z. Wang, J. Li,and J. Luo, “Aesthetics and emotions in images: A computational per-spective,” IEEE Signal Process. Mag., vol. 28, no. 5, pp. 94–115, 2011.

[6] S. Bhattacharya, R. Sukthankar, andM. Shah, “A framework for photo-quality assessment and enhancement based on visual aesthetics,” inProc. ACM Multimedia, 2010, pp. 271–280.

[7] M. Freeman, The Photographer’s Eye: Composition and Design forBetter Digital Photos. Lewes, U.K.: Ilex Press, 2007.

[8] L. Liu, R. Chen, L. Wolf, and D. Cohen-Or, “Optimizing photo com-position,” Comput. Graph. Forum, vol. 29, no. 2, pp. 469–478, 2010.

[9] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach toobject matching in videos,” in Proc. ICCV, 2003, pp. 1470–1477.

[10] D. G. Lowe, “Distinctive image features from scale-invariant key-points,” Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.

[11] W. Zhou, Y. Lu, H. Li, Y. Song, and Q. Tian, “Spatial coding for largescale partial-duplicate web image search,” in Proc. ACM Multimedia,2010, pp. 511–520.

[12] B. Frey and D. Dueck, “Newblock clustering by passing messages be-tween data points,” Science, vol. 315, no. 5814, pp. 972–976, 2007.

[13] Photo.net [Online]. Available: http://photo.net/[14] DPChallenge [Online]. Available: http://www.dpchallenge.com/[15] Flickr [Online]. Available: http://www.flickr.com/[16] D. S. Butterfield, C. Fake, C. J. Henderson-Begg, and S. Mourachov,

“Interestingness Ranking of Media Objects,” US Patent Application20060242139, 2006.

[17] J.-G. Kim, H. S. Chang, J. Kim, and H.-M. Kim, “Efficient cameramotion characterization for MPEG video indexing,” in Proc. IEEE Int.Conf. Multimedia and Expo, 2000, pp. 1171–1174.

[18] T. Mei, X.-S. Hua, C.-Z. Zhu, H.-Q. Zhou, and S. Li, “Home videovisual quality assessment with spatiotemporal factors,” IEEE Trans.Circuits Syst. Video Technol., vol. 17, no. 6, pp. 699–706, Jun. 2007.

[19] X. Hou and L. Zhang, “Saliency detection: A spectral residual ap-proach,” in Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, 2007.

[20] C. C. Chang and C.-J. Lin, “LIBSVM: A library for support vectormachines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 27, 2011.

[21] J. Zhuang, T.Mei, S. C. H. Hoi, Y.-Q. Xu, and S. Li, “When recommen-dation meets mobile: Contextual and personalized recommendation onthe go,” in Proc. Ubicomp, 2011, pp. 153–162.

[22] wunderground.com [Online]. Available: http://www.wunder-ground.com/

[23] K. Q. Weinberger and L. K. Saul, “Distance metric learning for largemargin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10,pp. 207–244, 2009.

[24] Y. Ke, X. Tang, and F. Jing, “The design of high-level features forphoto quality assessment,” in Proc. IEEE Conf. Computer Vision andPattern Recognition, 2006, pp. 419–426.

[25] Y. Luo and X. Tang, “Photo and video quality evaluation: Focusing onthe subject,” in Proc. ECCV, 2008, pp. 386–399.

[26] X. Sun, H. Yao, R. Ji, and S. Liu, “Photo assessment based on compu-tational visual attention model,” in Proc. ACM Multimedia, 2009, pp.541–544.

[27] S. Dhar, V. Ordonez, and T. L. Berg, “High level describable attributesfor predicting aesthetics and interestingness,” in Proc. IEEE Conf.Computer Vision and Pattern Recognition, 2011, pp. 1657–1664.

[28] L. Marchesotti, F. Perronnin, D. Larlus, and G. Csurka, “Assessing theaesthetic quality of photographs using generic image descriptors,” inProc. ICCV, 2011, pp. 1784–1791.

[29] R. Datta and J. Z. Wang, “ACQUINE: Aesthetic quality inference en-gine—Real-time automatic rating of photo aesthetics,” in Proc. Multi-media Information Retrieval, 2010, pp. 421–424.

[30] L. Yao, P. Suryanarayan, M. Qiao, J. Z. Wang, and J. Li, “OSCAR:On-site composition and aesthetics feedback through exemplars forphotographers,” Int. J. Comput. Vision, vol. 96, no. 3, pp. 353–383,2012.

[31] B. Cheng, B. Ni, S. Yan, and Q. Tian, “Learning to photograph,” inProc. ACM Multimedia, 2010, pp. 291–300.

[32] H.-H. Su, T.-W. Chen, C.-C. Kao,W. H. Hsu, and S.-Y. Chien, “Prefer-ence-aware view recommendation system for scenic photos based onbag-of-aesthetics-preserving features,” IEEE Trans. Multimedia, vol.14, no. 3, pp. 833–843, Jun. 2012.

[33] D. M. Chen, G. Baatz, K. Koser, S. S. Tsai, R. Vedantham, T. Pyl-vanainen, K. Roimela, X. Chen, and J. Bach, “City-scale landmarkidentification on mobile devices,” in Proc. IEEE Conf. Computer Vi-sion and Pattern Recognition, 2011, pp. 737–744.

[34] S. Bourke, K. McCarthy, and B. Smyth, “The social camera: A case-study in contextual image recommendation,” in Proc. IUI, 2011, pp.13–22.

[35] W. Yin, T. Mei, and C. W. Chen, “Crowd-sourced learning to photo-graph via mobile devices,” in Proc. IEEE Int. Conf. Multimedia andExpo, 2012.

[36] W. Yin, T. Mei, and C. W. Chen, “Assessing photo quality with geo-context and crowdsourced photos,” in Proc. VCIP, 2012.


[37] S. Papadopoulos, C. Zigkolis, Y. Kompatsiaris, and A. Vakali,“Cluster-based landmark and event detection for tagged photo collec-tions,” IEEE Multimedia, vol. 18, no. 1, pp. 52–63, Jan. 2011.

[38] M. D. Choudhury, M. Feldman, S. Amer-Yahia, N. Golbandi, R.Lempel, and C. Yu, “Automatic construction of travel itinerariesusing social breadcrumbs,” in Proc. 21st ACM Conf. Hypertext andHypermedia, 2010, pp. 35–44.

[39] W. Yin, J. Luo, and C. W. Chen, “Event-based semantic image adapta-tion for user-centric mobile display devices,” IEEE Trans. Multimedia,vol. 13, no. 3, pp. 432–442, Jun. 2011.

Wenyuan Yin (S’10) received B.E. degree fromNanjing University of Science and technology in2006. She is now pursuing the Ph.D. degree in theDepartment of Computer Science and Engineering,State University of New York at Buffalo. Hercurrent research interests include image and videosemantic understanding, media quality assessment,mobile media adaptation, video transcoding, imageprocessing, machine learning and computer vision.She received the Best Student Paper Award at VCIP2012.

Tao Mei (M’07–SM’11) is a Lead Researcherwith Microsoft Research Asia, Beijing, China. Hereceived the B.E. degree in automation and the Ph.D.degree in pattern recognition and intelligent systemsfrom the University of Science and Technology ofChina, Hefei, China, in 2001 and 2006, respectively.His current research interests include multimediainformation retrieval and computer vision. He hasauthored or co-authored over 100 papers in journalsand conferences, and holds eight U.S. grantedpatents. He was the recipient of several best paper

awards, including the Best Paper Awards at ACM Multimedia in 2007 and2009, and the IEEE Transactions on Multimedia Prize Paper Award 2013. Heis an Associate Editor of Neurocomputing and the Journal of Multimedia.

ChangWen Chen (F’04) is a Professor of ComputerScience and Engineering at the State Universityof New York at Buffalo, USA. Previously, he wasAllen S. Henry Endowed Chair Professor at FloridaInstitute of Technology from 2003 to 2007, a facultymember at the University of Missouri-Columbiafrom 1996 to 2003 and at the University of Rochester,Rochester, NY, from 1992 to 1996. He has servedas the Editor-in-Chief for IEEE Transactions onCircuits and Systems for Video Technology fromJanuary 2006 to December 2009 and an Editor for

Proceedings of IEEE, IEEE T-MM, IEEE JSAC, IEEE JETCAS and IEEEMultimedia Magazine. He and his students have received six (6) Best PaperAwards and have been placed among Best Paper Award finalists many times. Heis a recipient of Sigma Xi Excellence in Graduate Research Mentoring Awardin 2003, Alexander von Humboldt Research Award in 2009 and SUNY-BuffaloExceptional Scholar–Sustained Achievements Award in 2012. He is an IEEEFellow and an SPIE Fellow.

Shipeng Li (F’10) joined Microsoft ResearchAsia (MSRA) in May 1999. He is now a principalresearcher and research manager of the MediaComputing group. He also serves as the researcharea manager coordinating the multimedia researchactivities at MSRA. From October 1996 to May1999, he was with the Multimedia TechnologyLaboratory at Sarnoff Corporation (formerly DavidSarnoff Research Center and RCA Laboratories)as a member of the technical staff. He has beenactively involved in research and development in

broad multimedia areas. He has made several major contributions adoptedby MPEG-4 and H.264 standards. He invented and developed the world firstcost-effective high-quality legacy HDTV decoder in 1998. He started P2Pstreaming research at MSRA as early as in August 2000. He led the building ofthe first working scalable video streaming prototype across the Pacific Oceanin 2001. He has been an advocate of scalable coding format and is instrumentalin the SVC extension of H.264/AVC standard. He first proposed the 694;Media 2.0 concepts that outlined the new directions of next generation internetmedia research (2006). He has authored and coauthored more than 200 journaland conference papers and holds 90+ US patents in image/video processing,compression and communications, digital television, multimedia, and wirelesscommunication.

socialized mobile photography: learning to photograph with social context via mobile devices

Documents