sentribute: image sentiment analysis from a mid-level...

8
Sentribute: Image Sentiment Analysis from a Mid-level Perspective Jianbo Yuan Department of Electrical & Computer Engineering University of Rochester Rochester, NY 14627 [email protected] Quanzeng You Department of Computer Science University of Rochester Rochester, NY 14627 [email protected] Sean Mcdonough Department of Electrical & Computer Engineering University of Rochester Rochester, NY 14627 [email protected] Jiebo Luo Department of Computer Science University of Rochester Rochester, NY 14627 [email protected] ABSTRACT Visual content analysis has always been important yet chal- lenging. Thanks to the popularity of social networks, im- ages become an convenient carrier for information diffusion among online users. To understand the diffusion patterns and different aspects of the social images, we need to inter- pret the images first. Similar to textual content, images also carry different levels of sentiment to their viewers. However, different from text, where sentiment analysis can use easily accessible semantic and context information, how to extract and interpret the sentiment of an image remains quite chal- lenging. In this paper, we propose an image sentiment pre- diction framework, which leverages the mid-level attributes of an image to predict its sentiment. This makes the senti- ment classification results more interpretable than directly using the low-level features of an image. To obtain a bet- ter performance on images containing faces, we introduce eigenface-based facial expression detection as an additional mid-level attributes. An empirical study of the proposed framework shows improved performance in terms of predic- tion accuracy. More importantly, by inspecting the predic- tion results, we are able to discover interesting relationships between mid-level attribute and image sentiment. Categories and Subject Descriptors H.2.8 [Database management]: Database Applications; H.3.1 [Information Storage and Retrieval]: Content Analysis and Retrieval; I.5.4 [Pattern Recognition]: Ap- plications Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WISDOM ’13, August 11 2013, Chicago, USA Copyright 2013 ACM 978-1-4503-2332-1/13/08 ...$15.00. General Terms Algorithms, Experimentation, Application Keywords Image sentiment, Analysis, Mid-level Attributes, Visual Con- tent 1. INTRODUCTION Nowadays, social networks such as Twitter and microblog such as Weibo become major platforms of information ex- change and communication between users, between which the common information carrier is tweets. A recent study shows that images constitute about 36 percent of all the shared links on Twitter 1 , which makes visual data mining an interesting and active area to explore. As an old saying has it, an image is worth a thousand words. Much alike tex- tual content based mining approach, extensive studies have been done regarding aesthetics and emotions in images [3, 8, 28]. In this paper, we are focusing on sentiment analysis based on visual information analysis. So far analysis of textual information has been well de- veloped in areas including opinion mining [18, 20], human decision making [20], brand monitoring [9], stock market prediction [1], political voting forecasts [18, 25] and intelli- gence gathering [31]. Figure 1 shows and example of image tweets. In contrast, analysis of visual information covers areas such as image information retrieval [4, 33], aesthetics grading [15] and the progress is relatively behind. Social networks such as Twitter and microblogs such as Weibo provide billions of pieces of both textual and visual in- formation, making it possible to detect sentiment indicated by both textual and visual data respectively. However, sen- timent analysis based on a visual perspective is still in its infancy. With respect to sentiment analysis, much work has been done on textual information based sentiment analysis [18, 20, 29], as well as online sentiment dictionary [5, 24]. 1 http://socialtimes.com/is-the-status- update-dead-36-of-tweets-are-photos- infographic b103245#.UDLhTK9rHY8.wordpress

Upload: others

Post on 12-Mar-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Sentribute: Image Sentiment Analysis from a Mid-levelPerspective

Jianbo YuanDepartment of Electrical & Computer

EngineeringUniversity of RochesterRochester, NY 14627

[email protected]

Quanzeng YouDepartment of Computer Science

University of RochesterRochester, NY 14627

[email protected]

Sean McdonoughDepartment of Electrical & Computer

EngineeringUniversity of RochesterRochester, NY 14627

[email protected]

Jiebo LuoDepartment of Computer Science

University of RochesterRochester, NY 14627

[email protected]

ABSTRACTVisual content analysis has always been important yet chal-lenging. Thanks to the popularity of social networks, im-ages become an convenient carrier for information diffusionamong online users. To understand the diffusion patternsand different aspects of the social images, we need to inter-pret the images first. Similar to textual content, images alsocarry different levels of sentiment to their viewers. However,different from text, where sentiment analysis can use easilyaccessible semantic and context information, how to extractand interpret the sentiment of an image remains quite chal-lenging. In this paper, we propose an image sentiment pre-diction framework, which leverages the mid-level attributesof an image to predict its sentiment. This makes the senti-ment classification results more interpretable than directlyusing the low-level features of an image. To obtain a bet-ter performance on images containing faces, we introduceeigenface-based facial expression detection as an additionalmid-level attributes. An empirical study of the proposedframework shows improved performance in terms of predic-tion accuracy. More importantly, by inspecting the predic-tion results, we are able to discover interesting relationshipsbetween mid-level attribute and image sentiment.

Categories and Subject DescriptorsH.2.8 [Database management]: Database Applications;H.3.1 [Information Storage and Retrieval]: ContentAnalysis and Retrieval; I.5.4 [Pattern Recognition]: Ap-plications

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WISDOM ’13, August 11 2013, Chicago, USACopyright 2013 ACM 978-1-4503-2332-1/13/08 ...$15.00.

General TermsAlgorithms, Experimentation, Application

KeywordsImage sentiment, Analysis, Mid-level Attributes, Visual Con-tent

1. INTRODUCTIONNowadays, social networks such as Twitter and microblog

such as Weibo become major platforms of information ex-change and communication between users, between whichthe common information carrier is tweets. A recent studyshows that images constitute about 36 percent of all theshared links on Twitter1, which makes visual data miningan interesting and active area to explore. As an old sayinghas it, an image is worth a thousand words. Much alike tex-tual content based mining approach, extensive studies havebeen done regarding aesthetics and emotions in images [3,8, 28]. In this paper, we are focusing on sentiment analysisbased on visual information analysis.

So far analysis of textual information has been well de-veloped in areas including opinion mining [18, 20], humandecision making [20], brand monitoring [9], stock marketprediction [1], political voting forecasts [18, 25] and intelli-gence gathering [31]. Figure 1 shows and example of imagetweets. In contrast, analysis of visual information coversareas such as image information retrieval [4, 33], aestheticsgrading [15] and the progress is relatively behind.

Social networks such as Twitter and microblogs such asWeibo provide billions of pieces of both textual and visual in-formation, making it possible to detect sentiment indicatedby both textual and visual data respectively. However, sen-timent analysis based on a visual perspective is still in itsinfancy. With respect to sentiment analysis, much work hasbeen done on textual information based sentiment analysis[18, 20, 29], as well as online sentiment dictionary [5, 24].

1http://socialtimes.com/is-the-status-update-dead-36-of-tweets-are-photos-infographic b103245#.UDLhTK9rHY8.wordpress

Figure 1: Selected images crawled from Twitter showing(left column) positive sentiment and (right column) negativesentiments.

Semantics and concept learning approaches [6, 19, 16, 22]based on visual features is another way of sentiment analysiswithout employing textual information. However, semanticsand concept learning approaches are hampered by the limi-tations of object classifier accuracy The analysis of aesthetics[3, 15], interestingness [8] and affect or emotions [10, 14, 17,32] of images are most related to sentiment analysis basedon visual content. Aiming to conduct visual content basedsentiment analysis, current approaches includes employinglow-level features [10, 11, 12], via facial expression detec-tion [27] and user intent [7]. Sentiment analysis approachesbased on low-level features has the limitation of low inter-pretability, which in turn makes it undesirable for high-leveluse. Metadata of images is another source of information forhigh-level feature learning [2]. However, not all images con-tain such kind of data. Therefore, we proposed Sentribute,an image sentiment analysis algorithm based on mid-levelfeatures.

Compared to the state-of-the-art algorithms, our maincontribution to this area is two-fold: first, we propose Sen-tribute, an image-sentiment analysis algorithm based on 102mid-level attributes, of which results are easier to interpretand ready-to-use for high-level understanding. Second, weintroduce eigenface to facial sentiment recognition as a so-lution for sentiment analysis on images containing people.This is simple but powerful, especially in cases of extremefacial expressions, and contributed an 18% gain in accuracyover decision making only based on mid-level attributes, and30% over the state of art methods based on low level fea-

tures.The remainder of this paper is organized as follows: in Sec-

tion 2, we present an overview of our proposed Sentributeframework. Section 3 provides details for Sentribute, includ-ing low-level feature extraction, mid-level attribute gener-ation, image sentiment prediction, and decision correctionbased on facial sentiment recognition. Then in Section 4,we test our algorithm on 810 images crawled from Twitterand make a comparison with the state of the art method,which makes prediction based on low-level features and tex-tual information only. Finally, we summarize our findingsand possible future extensions of our current work in Section5.

2. FRAMEWORK OVERVIEWFigure 2 presents our proposed Sentribute framework. The

idea for this algorithm is as follows: first of all, we extractscene descriptor low-level features from the SUN Database[7] and use these four features to train our classifiers by Lib-linear [10] for generating 102 predefined mid-level attributes,and then use these attributes to predict sentiments. Mean-while, facial sentiments are predicted using eigenfaces. Thismethod generates really good results especially in cases ofpredicting strong positive and negative sentiments, whichmakes it possible to combine these two predictions and gen-erate a better result for predicting image sentiments withfaces. To illustrate how facial sentiment help refine ourprediction based on only mid-level attributes, we presentan example in Section 4, of how to correct our false posi-tive/negative prediction based on facial sentiment recogni-tion.

3. SENTRIBUTEIn this section we outline the design and construction of

the proposed Sentribute, a novel image sentiment predictionmethod based on mid-level attributes, together with a de-cision refine mechanism for images containing people. Forimage sentiment analysis, we conclude the procedure start-ing from dataset introduction, low-level feature selection,building mid-level attribute classifier, image sentiment pre-diction. As for facial sentiment recognition, we introduceeigenface to fulfill our intention.

3.1 DatasetOur proposed algorithm mainly contains three steps: first

is to generate mid-level attributes labels. For this part, wetrain our classifier using SUN Database2, the first large-scale scene attribute database, initially designed for high-level scene understanding and fine-grained scene recognition[21]. This database includes more than 800 categories and14,340 images, as well as discriminative attributes labeledby crowd-sourced human studies. Attributes labels are pre-sented in form of zero to three votes, of which 0 vote meansthis image is the least correlated with this attribute, andthree votes means the most correlated. Due to this votingmechanism, we have an option of selecting which set of im-ages to be labeled as positive: images with more than onevote, introduced as soft decision (SD), or images with morethan two votes, introduced as hard decision (HD).

Second step of our algorithm is to train sentiment pre-dicting classifiers with images crawled from Twitter together

2http://groups.csail.mit.edu/vision/SUN/

Visual Contents

Extract Low-level Features

Face Detection Facial Expression Detection

Generate 102 Mid-level Attributes

Sentiment Prediction(decision fusion)

Four Scene Descriptors

Sentribute: Algorithm Framework

Eigenface model Images contain Faces

Training Classifiers Asymmetric Bagging

Figure 2: Selected images crawled from Twitter showing (a) positive sentiment and (b) negative sentiments.

with their textual data covering more than 800 images. Twit-ter is currently one of the most popular microblog platforms.Sentiment ground truth is obtained from visual sentimentontology3 with permission of the authors. The dataset in-cludes 1340 positive, 223 negative and 552 neutral imagetweets. For testing, we randomly select 810 images, onlycontaining positive (660 tweets) and negative (150 tweets).Figure 1 shows images chosen from our dataset as well astheir sentiment labels.

The final step is facial emotion detection for decision fu-sion mechanism. We chose to use the Karolinska DirectedEmotional Faces dataset [13] mainly because the faces areall well aligned with each other and have consistent lighting,which makes generating good eigenface much easier. Thedataset contains 70 men and women over two days express-ing 7 emotions (scared, anger, disgust, happy, neutral, sad,and surprised) in five different poses (front, left prole, rightprole, left angle, right angle).

3.2 Feature SelectionIn this section, we are aiming to select low-level features

for generating mid-level attributes, and we choose four gen-eral scene descriptor: gist descriptor [17], HOG 2x2, self-similarity, and geometric context color histogram features[30]. These four features were chosen because they are eachindividually powerful and because they can describe distinctvisual phenomena in a scene perspective other than usingspecific object classifier. These scene descriptor featuressuffer neither from the inconsistent performance comparedto commonly used object detectors for high-level semanticsanalysis of an image, nor from the difficulty of result inter-pretation generated based on low-level features.

3.3 Generating Mid-level AttributeGiven selected low-level features, we are then able to train

our mid-level attribute classifiers based on SUN Database.We have 14,340 dimensions of sampling space, and over170,000 dimensions of feature space. For classifier options,Liblinear4 outperforms against LibSVM5 in cases where thenumber of samples are huge and the number of feature di-mension is huge. Therefore we choose Liblinear toolbox toimplement SVM algorithm to achieve time saving.

The selection of mid-level attribute also plays an impor-tant part in image sentiment analysis. We choose 102 pre-defined mid-level attributes based on the following criteria:(1) have descent detection accuracy, (2) potentially corre-

3http://visual-sentiment-ontology.appspot.com/4http://www.csie.ntu.edu.tw/~cjlin/liblinear/5http://www.csie.ntu.edu.tw/~cjlin/libsvm/

lated to one sentiment label, and (3) easy to interpret. Wethen select four types of mid-level attributes accordingly: (1)Material: such as metal, vegetation; (2) Function: playing,cooking; (3) Surface property: rusty, glossy; and (4) SpatialEnvelope [17]: natural, man-made, enclosed.

We conduct mutual information analysis to discover mid-level attributes that are most correlated with sentiments.For each mid-level attribute, we computed the MI valuewith respect to both positive and negative sentiment cat-egory (Figure 4). Table 1 illustrates 10 most distinguishablemid-level attributes for predicting both positive and nega-tive labels in a descending order based on both SD and HD.Figure 6 demonstrates Average Precision (AP) for the 102attributes we selected, for both SD and HD. It’s not surpris-ing to see that attributes of material (flowers, trees, ice, stillwater), function (hiking, gaming, competing) and spatialenvelop (natural light, congregating, aged/worn) all play animportant role according to the result of mutual informationanalysis

0 20 40 60 80 100 1200

1

2

3

4

5

6

7

8x 10

-3

X = 33Y = 0.00741

X = 47Y = 0.007

X = 48Y = 0.00719

X = 58Y = 0.00708

X = 30Y = 0.00703

X = 79Y = 0.00639

X = 3Y = 0.00526

X = 90Y = 0.00567X = 69

Y = 0.00536X = 75Y = 0.0049

0 20 40 60 80 100 1200

0.005

0.01

0.015

0.02

0.025

X = 38Y = 0.0225

X = 29Y = 0.014 X = 41

Y = 0.0131X = 26Y = 0.0117

X = 7Y = 0.00845

X = 22Y = 0.00838

X = 53Y = 0.00808

X = 76Y = 0.00771

X = 78Y = 0.00894

X = 56Y = 0.00658

Figure 4: Computing Mutual Information for each label.

3.4 Image Sentiment PredictionIn our dataset we have 660 positive samples and 140 neg-

ative samples. It is likely to obtain a biased classifier basedon these samples alone. Therefore we introduce asymmet-ric bagging [23] to dealing with biased dataset. Figure 6presents the idea of asymmetric bagging: instead of build-

SIGKDD Explorations. Copyright 2000 ACM SIGKDD, January 2001. Volume 2, Issue 2 – page 3

Figure 3: The images in the table above are grouped by the number of positive labels (votes) they received from AMT workers. From left to right the visual presence of each attribute increases. [sun1]

Second step of our algorithm is to train sentiment predicting classifiers with images crawled from Twitter together with their textual data [rr] covering more than 800 images. Twitter is currently one of the most popular microblog platforms. Sentiment ground truth is obtained conducted by labeling runs using AMT. The dataset includes 1340 positive, 223 negative and 552 neutral image tweets. For testing, we randomly select 810 images, only containing positive (660 tweets) and negative (150 tweets). Figure 1 shows images chosen from our dataset as well as their sentiment labels.

3.2 FEATURE SELECTION In this section, we are aiming to select low-level features for generating mid-level attributes, and we choose four general scene descriptor: gist descriptor [45], HOG 2x2, self-similarity, and geometric context color histogram features [sun2]. These four features were chosen because they are each individually powerful and because they can describe distinct visual phenomena in a scene perspective other than using specific object classifier. These scene descriptor features suffer neither from the inconsistent performance compared to commonly used object detectors for high-level semantics analysis of an image, nor from the difficulty of result interpretation generated based on low-level features.

3.3 GENERATE MID-LEVEL ATTRIBUTE Given selected low-level features, we are then able to train our mid-level attribute classifiers based on SUN Database [xx]. We have 14,340 dimensions of sampling space, and over 170,000 dimensions of feature space. For classifier options, Liblinear [xx] outperforms against LibSVM [xx] in cases where the number of samples are huge and the number of feature dimension is huge.

Therefore we choose Liblinear toolbox to implement SVM algorithm to achieve time saving.

The selection of mid-level attribute also plays an important part in image sentiment analysis. We choose 102 predefined mid-level attributes based on the following criteria: (1) have descent detection accuracy, (2) potentially correlated to one sentiment label, and (3) easy to interpret. We conduct mutual information analysis to discover mid-level attributes that are most correlated with sentiments. For each mid-level attribute, we computed the MI value with respect to both positive and negative sentiment category (Figure 4). Table 1 illustrates 10 most distinguishable mid-level attributes for predicting both positive and negative labels in a descending order based on both SD and HD. Figure 5 demonstrates Average Precision (AP) for the 102 attributes we selected, for both SD and HD.

0 20 40 60 80 100 1200

1

2

3

4

5

6

7

8x 10

-3

X = 33Y = 0.00741

X = 47Y = 0.007

X = 48Y = 0.00719

X = 58Y = 0.00708

X = 30Y = 0.00703

X = 79Y = 0.00639

X = 3Y = 0.00526

X = 90Y = 0.00567X = 69

Y = 0.00536X = 75Y = 0.0049

0 20 40 60 80 100 1200

0.005

0.01

0.015

0.02

0.025

X = 38Y = 0.0225

X = 29Y = 0.014 X = 41

Y = 0.0131X = 26Y = 0.0117

X = 7Y = 0.00845

X = 22Y = 0.00838

X = 53Y = 0.00808

X = 76Y = 0.00771

X = 78Y = 0.00894

X = 56Y = 0.00658

Figure 4: compute mi for each label

Comment [W用13]: 排版/ 这个图可

以用么?直接从论文引用的

Comment [W用14]: rongrong’s paper

Comment [W用15]: http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Figure 3: The images in the table above are grouped by the number of positive labels (votes) received from AMT workers.From left to right the visual presence of each attribute increases [21].

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101

AP For 102 Mid-level Attributes SD vs HD

SD

HD

Figure 5: AP of the 102 Attributes based on SD and HD.

Table 1: Attributes with Top 10 Mutual Information.

TOP 10 Soft Decision Hard Decision1 congregating railing2 flowers hiking3 aged/worn gaming4 vinyl/linoleum competing5 still water trees6 natural light metal7 glossy tiles8 open area direct sun/sunny9 glass aged/worn10 ice constructing

ing one classifier, we now build several classifiers, and trainthem with the same negative samples together with differentsampled positive samples of the same amount. Then we cancombine their results and build an overall unbiased classifier.

3.5 Facial Sentiment RecognitionOur proposed algorithm, Sentribute, contains a final step

of decision fusion mechanism by incorporating eigenface-based emotion detection approach. Images containing facescontribute to a great partition of the whole images that,

...

...

Group of ClassifiersNegative Samples

Positive Samples

Asymmetric Bagging

Figure 6: Asymmetric bagging.

382 images from our dataset have faces. Therefore, facialemotion detection is not only useful but important for theoverall performance of our algorithm.

In order to recognize emotions from faces we use classesof eigenfaces corresponding to different emotions. Eigenfacewas one of the earliest successful implementations of facialdetection [26]; we modify the algorithm to be suitable for de-tecting classes of emotions. Though this method is widelyappreciated already, we are the first to modify the algo-rithm to be suitable for detecting classes of emotions, andthis method is simple yet surprisingly powerful for detect-ing facial emotions for front and consistent lightened faces.Note that we are not trying to propose an algorithm that

outperforms the state-of-the-art facial emotion detection al-gorithms. This is beyond the scope of this paper.

There are seven principal emotions that human’s experi-ence: scared (afraid), anger, disgust, happy, neutral, sad,and surprised. Due to the accuracy of the model and theframework of integrating the results with Sentribute, we re-duce the set of emotions to positive, neutral, and negativeemotions. This is done by classifying the image as one ofthe seven emotions and then mapping the happy and sur-prised emotions to positive sentiment, neutral sentiment toitself, and all other emotions to negative sentiment. At ahigh level, we are computing the eigenfaces for each class ofemotion; we then compare the features of these eigenfaces tothe features of the target image projected onto the emotionclass space.

The algorithm requires a set of faces to train the classi-fier (more specifically to find the features of the images).We chose to use the Karolinska Directed Emotional Facesdataset [13] for many reasons, specifically the faces are allwell aligned with each other and have consistent lighting,which makes generating good eigenfaces much easier. Thedataset contains 70 men and women over two days express-ing 7 emotions (scared, anger, disgust, happy, neutral, sad,and surprised) in five different poses (front, left profile, rightprofile, left angle, right angle). We use a subset of the KDEFdatabase for our training set, only using the 7 frontal emo-tions from one photographing session.

Training the dataset and extracting the eigenfaces fromthe images of each emotion class was accomplished by usingprincipal component extraction. We preprocess the trainingdata by running it through fdlibmex6, a fast facial detectionalgorithm to obtain the position and size of the face. Wethen extract the face from the general image and scale it toa 64 × 64 grayscale array; it is then vectored into a 4096length vector. We concatenate the individual faces fromeach class into a M × N array X, where M is the lengthof each individual image and N is the number of images inthe class. We then are able to find the eigenfaces by us-ing Principal Component Extraction. Principal componentextraction converts correlated variables, in our case a setof images, into an uncorrelated variables via an orthogonaltransform. We implement principal component analysis byfirst computing the covariance matrix

C = (x− µ)(x− µ)T , (1)

where µ is the mean of which has been concatenated to thesame size of X. The eigenvectors of C are then calculated arearranged by decreasing eigenvalues. Only the twenty largesteigenvectors are chosen for each class of facial emotions. Theprinciple eigenfaces are simply the eigenvectors of the systemthat have the largest eigenvalues. We compute the featuresof the class as shown below.

EC = PCA(XC)FC = EC(XC − µC)

. (2)

In order to classify the target image preprocessing is neces-sary to preprocess the image as we preprocess the trainingdataset, which we will denote y. The classification of a testface is performed by comparing the distance of the featuresof the target face (projected onto the emotion subspace)to the features of the eigenfaces of the subspace. We then

6http://www.mathworks.com/matlabcentral/fileexchange/20976

choose the class that minimizes this function as the predictedclass, specifically

arg minC

Xi

‖ ECi (y − µC)− FC

i ‖, (3)

where i is each individual feature column vector in the array[26].

We then set a threshold value, which was determined em-pirically, in order to filter out results that are weakly classi-fied. In this case, no result is given. Figure 7 shows examplesof classified facial emotions.

(a) Classification:Negative

(b) Classification:Positive

(c) Classification:Positive

Figure 7: Examples of Eigenface-based Emotion Detection.

4. EXPERIMENTS

4.1 Image Sentiment PredictionAs mentioned before, state-of-the-art sentiment analysis

approach can be mainly concluded as: (1) textual informa-tion based sentiment analysis, as well as online sentimentdictionary [5, 24] and (2) sentiment analysis based on low-level features. Therefore, in this section, we set three base-lines: (1) low-level feature based approach and (2) textualcontent based approach [24] and (3) online sentiment dictio-nary SentiStrength [5].

4.1.1 Image Sentiment Classification PerformanceFirst we demonstrate results of our proposed algorithm,

image sentiment prediction based on 102 mid-level attributes(SD vs. HD). Both Linear SVM and Logistic Regressionalgorithms are employed for comparison.

As demonstrated in Table 2, performance of precision forboth Linear SVM and Logistic Regression outperforms overthat of recall. Owing to the implementation of asymmetricbagging, we are now able to classify negative samples in adecent detection rate. Smaller number of false positive sam-ples and relatively larger number of detected true positivesamples contribute to this unbalanced value of precision andrecall performance.

Table 2: Image Sentiment Prediction Performance.

Precision Recall Accuracy

Linear SVMSD 82.6% 56.8% 55.2%HD 86.7% 59.1% 61.4%

Logistic RegrSD 84.3% 54.7% 54.8%HD 88.1% 58.8% 61.2%

The next thing we are interested in is the comparisonagainst baseline algorithms.

4.1.2 Low-level Feature Based and Textual ContentBased Baselines

For low-level feature based algorithm, Ji et al. employedthe following visual features: a dimensional Color Histogramextracted from the RGB color space, a 512 dimensionalGIST descriptor [17], a 53 dimensional Local Binary Pat-tern (LBP), a Bag-of-Words quantized descriptor using a1000 word dictionary with a 2-layer spatial pyramid, and a2659 dimensional Classemes descriptor. Both Linear SVMand Logistic Regression algorithms are used for classifica-tion. For textual content based algorithm, we choose Con-textual Polarity, a phrase level sentiment analysis system[29], as well as SentiStrength API7. Table 3 the results ofaccuracy based on low-level features, mid-level attributesand textual contents.

Table 3: Accuracy of Sentiment Prediction.

(a) Comparison between low-level based algorithm andmid-level based algorithm.

SVM (low) Logistic Regr (low) SVM (mid)AC 50% 53% 61.4%

(b) Comparison between mid-level visual content based algo-rithm and textual content based algorithm.

Contextual Polarity SentiStrength SVM (mid)AC 61.7% 61% 61.4%

4.2 Decision FusionThe final step of Sentribute is decision fusion. By apply-

ing eigenface-based emotion detection, we are able to im-prove the performance of our decision based on mid-levelattributes only. We only take into account images withcomplete face with reasonable lighting condition. There-fore among all the images with faces, we first employ a facedetection process and generate a set of 153 images as thetesting data set for facial emotion detection and decision fu-sion. For each face we detected, we assigned them a labelindicating sentiments: 1 for positive, 0 for neutral and -1 fornegative sentiments. We thus computed a sentiment scorefor each image as a whole. For instance, if we detect threefaces from an image, two of them are detected as positiveand one of them is detected as neutral, then the overall facialsentiment score of this image is 2. These sentiment scorescan be used for decision fusion with the decision made basedon mid-level attributes only, i.e., we add up the facial sen-timent score and the confidence score of the results basedon mid-level attributes only returned by our classifiers toimplement a decision fusion mechanism. Table 4 shows theimprovements in accuracy after decision fusion.

Figure 8 presents examples of TP, FP, TN, FN samplesgenerated by Sentribute. False classified samples show thatit’s hard to distinguish images only containing texts fromboth positive and negative labels, and images of big event/ cerebration (football game or a concert) from those ofprotest demonstration. They both share similar generalscene descriptors, similar lighting condition, and similar colortone. Another interesting false detected sample is the firstimage shown in false negative samples. Figures make frown

7http://sentistrength.wlv.ac.uk/

expression on their faces, however the sentiment behind thisexpression is positive since they were meant to be funny.This sample is initially classified as positive based on mid-level attributes only, and then refined as negative becausetwo strong negative facial expression are detected by oureigenface expression detector. This kind of images shows abetter decision fusion metric would be one of our potentialimprovements.

Table 4: Accuracy of Sentribute Algorithm.

AccuracyMid-level Based Prediction 64.71%Facial Emotion Detection 73.86%

Sentribute (After Synthesis) 82.35%

5. CONCLUSIONIn this paper we have demonstrated Sentribute, a novel

image sentiment prediction algorithm based on mid-level at-tributes. Asymmetric bagging approach is employed to dealwith unbalanced dataset. To enhance our prediction perfor-mance, we introduce eigenface-based emotion detection al-gorithm, which is simple but powerful especially in cases ofdetecting extreme facial expressions, to dealing with imagescontaining faces and obtain a distinct gain in accuracy overresult based on mid-level attributes only. Our proposed algo-rithm explores current visual content based sentiment analy-sis approach by employing mid-level attributes and withoutusing textual content. We are aware that this work is justone out of many steps that several potential directions areexciting to set foot on. First, this mid-level based visual con-tent can be introduced to aesthetics analysis as well. Also, acombination of our approach and textual content sentimentanalysis approach might be beneficial. Additionally, furtherapplication of our proposed work includes but not limitedto psychology, public opinion analysis and online activityemotion detection.

AcknowledgmentsWe thank Professor Shih-Fu Chang’s group for providing uswith the Columbia University data set for image sentimentanalysis.

6. REFERENCES[1] J. Bollen, H. Mao, and X. Zeng. Twitter mood

predicts the stock market. Journal of ComputationalScience, 2(1):1–8, 2011.

[2] E. Cambria and A. Hussain. Sentic album: content-,concept-, and context-based online personal photomanagement system. Cognitive Computation,4(4):477–496, 2012.

[3] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Studyingaesthetics in photographic images using acomputational approach. In Computer Vision–ECCV2006, pages 288–301. Springer, 2006.

[4] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Imageretrieval: Ideas, influences, and trends of the new age.ACM Computing Surveys (CSUR), 40(2):5, 2008.

[5] A. Esuli and F. Sebastiani. Sentiwordnet: A publiclyavailable lexical resource for opinion mining. InProceedings of LREC, volume 6, pages 417–422, 2006.

(a). Predicted Examples: Detected as True Positive Samples

(b). Predicted Examples: Detected as True Negative Samples

(c). Predicted Examples: Detected as False Positive Samples

(d). Predicted Examples: Detected as True False Negative Samples

(a) True positive samples

(a). Predicted Examples: Detected as True Positive Samples

(b). Predicted Examples: Detected as True Negative Samples

(c). Predicted Examples: Detected as False Positive Samples

(d). Predicted Examples: Detected as True False Negative Samples

(b) True negative samples

(a). Predicted Examples: Detected as True Positive Samples

(b). Predicted Examples: Detected as True Negative Samples

(c). Predicted Examples: Detected as False Positive Samples

(d). Predicted Examples: Detected as True False Negative Samples

(c) False positive samples

(a). Predicted Examples: Detected as True Positive Samples

(b). Predicted Examples: Detected as True Negative Samples

(c). Predicted Examples: Detected as False Positive Samples

(d). Predicted Examples: Detected as True False Negative Samples

(d) False negative samples

Figure 8: Examples of Sentiment Detection Results By Sentribute.

[6] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth.Describing objects by their attributes. In ComputerVision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on, pages 1778–1785. IEEE, 2009.

[7] A. Hanjalic, C. Kofler, and M. Larson. Intent and itsdiscontents: the user at the wheel of the online videosearch engine. In Proceedings of the 20th ACMinternational conference on Multimedia, pages1239–1248. ACM, 2012.

[8] P. Isola, J. Xiao, A. Torralba, and A. Oliva. Whatmakes an image memorable? In Computer Vision andPattern Recognition (CVPR), 2011 IEEE Conferenceon, pages 145–152. IEEE, 2011.

[9] B. J. Jansen, M. Zhang, K. Sobel, and A. Chowdury.Twitter power: Tweets as electronic word of mouth.Journal of the American society for informationscience and technology, 60(11):2169–2188, 2009.

[10] J. Jia, S. Wu, X. Wang, P. Hu, L. Cai, and J. Tang.

Can we understand van gogh’s mood?: learning toinfer affects from images in social networks. InProceedings of the 20th ACM international conferenceon Multimedia, pages 857–860. ACM, 2012.

[11] P. J. Lang, M. M. Bradley, and B. N. Cuthbert.International affective picture system (iaps): Technicalmanual and affective ratings, 1999.

[12] B. Li, S. Feng, W. Xiong, and W. Hu. Scaring orpleasing: exploit emotional impact of an image. InProceedings of the 20th ACM international conferenceon Multimedia, pages 1365–1366. ACM, 2012.

[13] D. Lundqvist, A. Flykt, and A. Ohman. Thekarolinska directed emotional faces-kdef. cd-rom fromdepartment of clinical neuroscience, psychologysection, karolinska institutet, stockholm, sweden.Technical report, ISBN 91-630-7164-9, 1998.

[14] J. Machajdik and A. Hanbury. Affective imageclassification using features inspired by psychology

and art theory. In Proceedings of the internationalconference on Multimedia, pages 83–92. ACM, 2010.

[15] L. Marchesotti, F. Perronnin, D. Larlus, andG. Csurka. Assessing the aesthetic quality ofphotographs using generic image descriptors. InComputer Vision (ICCV), 2011 IEEE InternationalConference on, pages 1784–1791. IEEE, 2011.

[16] M. R. Naphade, C.-Y. Lin, J. R. Smith, B. Tseng, andS. Basu. Learning to annotate video databases. InSPIE Conference on Storage and Retrieval on Mediadatabases, 2002.

[17] A. Oliva and A. Torralba. Modeling the shape of thescene: A holistic representation of the spatialenvelope. International journal of computer vision,42(3):145–175, 2001.

[18] B. O’onnor, R. Balasubramanyan, B. R. Routledge,and N. A. Smith. From tweets to polls: Linking textsentiment to public opinion time series. In Proceedingsof the International AAAI Conference on Weblogs andSocial Media, pages 122–129, 2010.

[19] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text:Describing images using 1 million captionedphotographs. In Neural Information ProcessingSystems (NIPS), 2011.

[20] B. Pang and L. Lee. Opinion mining and sentimentanalysis. Foundations and trends in informationretrieval, 2(1-2):1–135, 2008.

[21] G. Patterson and J. Hays. Sun attribute database:Discovering, annotating, and recognizing sceneattributes. In Computer Vision and PatternRecognition (CVPR), 2012 IEEE Conference on,pages 2751–2758. IEEE, 2012.

[22] C. G. Snoek and M. Worring. Concept-based videoretrieval. Foundations and Trends in InformationRetrieval, 2(4):215–322, 2008.

[23] D. Tao, X. Tang, X. Li, and X. Wu. Asymmetricbagging and random subspace for support vectormachines-based relevance feedback in image retrieval.Pattern Analysis and Machine Intelligence, IEEETransactions on, 28(7):1088–1099, 2006.

[24] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, andA. Kappas. Sentiment strength detection in shortinformal text. Journal of the American Society forInformation Science and Technology,61(12):2544–2558, 2010.

[25] A. Tumasjan, T. O. Sprenger, P. G. Sandner, andI. M. Welpe. Predicting elections with twitter: What140 characters reveal about political sentiment. InProceedings of the fourth international AAAIconference on weblogs and social media, pages178–185, 2010.

[26] M. A. Turk and A. P. Pentland. Face recognition usingeigenfaces. In Computer Vision and PatternRecognition, 1991. Proceedings CVPR’91., IEEEComputer Society Conference on, pages 586–591.IEEE, 1991.

[27] V. Vonikakis and S. Winkler. Emotion-based sequenceof family photos. In Proceedings of the 20th ACMinternational conference on Multimedia, MM ’12,pages 1371–1372, New York, NY, USA, 2012. ACM.

[28] W. Wang and Q. He. A survey on emotional semanticimage retrieval. In Image Processing, 2008. ICIP

2008. 15th IEEE International Conference on, pages117–120. IEEE, 2008.

[29] T. Wilson, J. Wiebe, and P. Hoffmann. Recognizingcontextual polarity in phrase-level sentiment analysis.In Proceedings of the conference on Human LanguageTechnology and Empirical Methods in NaturalLanguage Processing, pages 347–354. Association forComputational Linguistics, 2005.

[30] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, andA. Torralba. Sun database: Large-scale scenerecognition from abbey to zoo. In Computer visionand pattern recognition (CVPR), 2010 IEEEconference on, pages 3485–3492. IEEE, 2010.

[31] C. C. Yang and T. D. Ng. Terrorism and crime relatedweblog social network: Link, content analysis andinformation visualization. In Intelligence and SecurityInformatics, 2007 IEEE, pages 55–58. IEEE, 2007.

[32] V. Yanulevskaya, J. Uijlings, E. Bruni, A. Sartori,E. Zamboni, F. Bacci, D. Melcher, and N. Sebe. In theeye of the beholder: employing statistical analysis andeye tracking for analyzing abstract paintings. InProceedings of the 20th ACM international conferenceon Multimedia, MM ’12, pages 349–358, New York,NY, USA, 2012. ACM.

[33] S. Zhang, M. Yang, T. Cour, K. Yu, and D. N.Metaxas. Query specific fusion for image retrieval. InComputer Vision–ECCV 2012, pages 660–673.Springer, 2012.