food image recognition using covariance of convolutional ......ieice trans. inf. & syst.,...

IEICE TRANS. INF. & SYST., VOL.E99–D, NO.6 JUNE 20161711

LETTER

Food Image Recognition Using Covariance of Convolutional LayerFeature Maps

Atsushi TATSUMA†a) and Masaki AONO†b), Members

SUMMARY Recent studies have obtained superior performance in im-age recognition tasks by using, as an image representation, the fully con-nected layer activations of Convolutional Neural Networks (CNN) trainedwith various kinds of images. However, the CNN representation is notvery suitable for fine-grained image recognition tasks involving food im-age recognition. For improving performance of the CNN representation infood image recognition, we propose a novel image representation that iscomprised of the covariances of convolutional layer feature maps. In theexperiment on the ETHZ Food-101 dataset, our method achieved 58.65%averaged accuracy, which outperforms the previous methods such as theBag-of-Visual-Words Histogram, the Improved Fisher Vector, and CNN-SVM.key words: food image recognition, convolutional neural networks, co-variance descriptor, pattern recognition, deep learning

1. Introduction

Food image recognition is an important research topic formany applications related to eating habits [1], [2]. Recently,Convolutional Neural Networks (CNN) has obtained im-pressive performance in image recognition tasks [3]. Evenfor the food image recognition benchmark [4], CNN outper-forms the conventional state-of-the-art methods such as theBag-of-Visual-Words Histogram (BoVW) [5] and the Im-proved Fisher Vectors (IFV) [6].

However, to obtain efficient recognition performance,CNN needs to learn networks with large-scale labeled train-ing image datasets. Moreover, CNN requires a GPU-basedparallel system to learn the networks in a practical compu-tation time. As a solution to these problems, recent stud-ies [7], [8] have shown that it obtains competitive or supe-rior performance learning simple classifiers, such as a linearSupport Vector Machine (SVM), with fully connected layeractivations of CNN trained with various kinds of images asthe image representation. On one hand, the CNN represen-tations do not need a large-scale image dataset and networktraining. On the other hand, in fine-grained image recog-nition tasks involving food image recognition, it is difficultfor the CNN representations to obtain good performance be-cause their networks have been trained with various imagesunrelated to the target domain. In this case, CNN needs re-

Manuscript received October 5, 2015.Manuscript revised January 19, 2016.Manuscript publicized February 23, 2016.†The authors are with Department of Computer Science and

Engineering, Toyohashi University of Technology, Toyohashi-shi,441–8580 Japan.

a) E-mail: [email protected]) E-mail: [email protected]

DOI: 10.1587/transinf.2015EDL8212

training in the networks with target domain images, whichis called fine-tuning.

Meanwhile, it is recognized that the lower covolutionallayers on CNN tend to capture low-level image features suchas oriented edges or corners [9]. These are less affected bythe domain of training data in comparison with the fully con-nected layers. We assume that these convolutional layer ac-tivations, which are called feature maps, also contain effi-cient image features. In this paper, we propose a new im-age representation which is composed of the covariancesof convolutional layer feature maps. Extracting image rep-resentations from convolutional layers, we try to improvethe recognition performance of CNN representation with-out fine-tuning the networks. The experimental result onthe ETHZ Food-101 dataset [4] shows that our method canobtain superior recognition performance in comparison toprevious methods including BoVW, IFV, and CNN-SVM.

2. Related Work

CNN has achieved superior recognition performance under-taking various image recognition tasks [3], [4]. In compar-ison, IFV [6], which is a kind of hand-crafted image rep-resentation, has achieved competitive performance to theCNN in fine-grained image recognition tasks involving foodimage recognition [10], [11]. IFV represents an image bythe distribution of low-level image features using higher-order statistics of local image descriptors including SIFTand SURF. In addition, the covariance descriptor uses thestatistics of local features as the image representation [12]–[14]. The covariance descriptor calculates the covariancematrix of local features such as the pixel location, intensityderivatives, and edge orientation. Unlike the CNN or IFV,the covariance descriptor does not require training data toextract the image representation.

CNN requires a large-scale image dataset and a GPU-based parallel system in order to learn the networks. Tosolve this problem, several studies [7], [8] take fully con-nected layer activations of trained CNN with various kindsof images as the image representation. The fully connectedlayer activations have sufficient representational power, ascompared with hand-crafted features, since they capturesemantic or high-level image features [7]. Although, forrecognition of images unrelated to the domain of the trainingdata, the fully connected layer activations are inadequate.In this situation, CNN needs additional training of the net-works with the substantial image data related to the target

Copyright c© 2016 The Institute of Electronics, Information and Communication Engineers

1712IEICE TRANS. INF. & SYST., VOL.E99–D, NO.6 JUNE 2016

Fig. 1 The overview of our feature maps covariance descriptor. An image is represented by the co-variance matrix of covolutional layer feature maps.

domain. In contrast, since the lower convolutional layer fea-ture maps capture low-level image features [9], they are in-sensitive to the domain of the training data.

In this paper, we consider extracting the covariance de-scriptor from the covolutional layers on trained CNN. Ourmethod has the advantage of being able to extract effectiveimage representations without additional network training,even if the images are unrelated to the domain of the train-ing data of trained CNN.

3. Covariances of Convolutional Layer Feature Maps

The lower convolutional layers tend to capture low-level im-age features, which are less affected by the contents of train-ing data. Meanwhile, the covariance descriptor, which ex-tracts the image representation by calculating the covariancematrix of local features, succeeds in obtaining the superiorperformances across several image recognition tasks [12]–[14]. To achieve improvements in recognition performancewithout fine-tuning the networks, we considered calcula-tions of the covariance descriptor using the feature maps oflower covolutional layers on a trained CNN. Hereinafter, wecall our method the Feature Maps Covariance Descriptor(FMCD).

An overview of FMCD is shown in Fig. 1. We considerthe d feature maps of size w×h outputted from the l-th layerto be the d dimensional local features of n = w×h points. LetF = [f1, . . . , fn] ∈ Rd×n denote the set of local features. Toobtain a representation of an image, we calculate covariancematrix of the local features

C =1

n − 1

n∑

i=1

(fi −m)(fi −m)�,

where m is the mean of the local features. The covariancematrix C is a symmetric matrix. The diagonal elements ofthe covariance matrix represent the variance of each featureand the off-diagonal elements represent their respective cor-relations.

The covariance matrix lies not on the Euclidean space,but on the Riemannian manifold of symmetric positive semi-define matrices. Since many machine learning algorithmswork on the Euclidean space, they are not suitable for co-variance matrices. To solve this problem, Pennec et al. [15]

proposed a method to map covariance matrices onto pointsin the Euclidean space.

The mapping method first projects the covariance ma-trix onto the Euclidean space that is tangent to the Rieman-nian manifold at the tangency point P. The projected vectory of the covariance matrix C is given by

y = logP(C) = P12 log(P−

12 CP−

12 )P

12 ,

where log(·) is the matrix logarithm operators. Let A =UDU� be the eigenvalue decomposition of a symmetric ma-trix. The matrix logarithm operator is given by

log(A) = U log(D)U�.

Then, the mapping method extracts the orthonormal coordi-nates of the projected vector that are given by the followingvector operator

vecP(y) = vecI(P− 1

2 yP−12 ),

where I is the identity matrix, and the vector operator atidentity is defined as

vecI(y) =[y1,1

√2y1,2

√2y1,3 . . . y2,2

√2y2,3 . . . yd,d

]�.

From the viewpoint of computational complexity, the iden-tity matrix is generally chosen for P [13], [14]. This trans-lates the projection step into a standard matrix logarithmcalculation. As a result, the vectorized covariance matrixis given by

c = vecI(log(C)).

We finally obtain the image representation to normal-ize the vector c with the signed square rooting normaliza-tion [16] and �2 normalization. Since the size of the covari-ance matrix is d× d, the number of dimensions of the imagerepresentation is (d2 + d)/2.

4. Experiments

We evaluate the recognition performance of the proposedmethod on the food image dataset, and compare with thestate-of-the-art methods.

LETTER1713

Fig. 2 Classification accuracies for the selected 20 classes on the ETHZ Food-101 dataset.

4.1 Dataset and Setup

In the experiments, as a food image dataset, we use ETHZFood-101 dataset (Food-101) [4] that contains 101,000 foodimages classified into 101 classes. Each class includes 750training images and 250 test images.

In our proposed method, we use the OverFeat [17] as atrained CNN. OverFeat is trained with the ImageNet 2012training dataset [18] which is comprised of 1.2 million im-ages classified into 1000 classes. OverFeat provides twokinds of network structures which are called the accuratenetwork and the fast network respectively. In our experi-ments, we choose the accurate network. For the convolu-tional layer feature maps, we use activations of the 1-st (L1)and 2-nd (L2) convolutional layers. The 1-st convolutionallayer extracts 96 feature maps of size 36×36 units. The 2-ndconvolutional layer extracts 256 feature maps of size 15×15units.

For the multi-class classifier, we use a one-vs-rest lin-ear SVM, and choose LIBLINEAR [19] as its implemen-tation. We set the parameter C to 1.0. Note that we do notaugment the training data by adding cropped and rotated im-ages [8].

We compare against the following methods: Bag-of-Visual-Words Histogram (BoVW) [5], Improved FisherVectors (IFV) [6], Mid-Level Discriminative Superpix-els (MLDS) [20], Random Forest Discriminant Compo-nents (RFDC) [4], Deep Convolutional Neural Networks(DCNN) [3], and CNN-SVM [8]. The BoVW and IFVare widely used for baseline method in image classifica-tion/recognition tasks. The MLDS and RFDC capture localfeatures by decomposing an image into discriminative re-gions. They obtained superior performance on the Food-101dataset within hand-crafted feature approaches. The DCNNtrained the network structure proposed by Krizhevsky etal. [3] with the Food-101 dataset. The CNN-SVM classi-fies images using the linear SVM classifier learned with�2-normalized fully connected layer activations of a trainedCNN. In general, the OverFeat network is used for the

Table 1 Performance comparison with the ETHZ Food-101 dataset. Thebold numeric value represents the best performance.

Methods AccuracyBoVW [5] 28.51%IFV [6] 38.88%MLDS [20] 42.63%RFDC [4] 50.76%CNN-SVM [8] 44.82%DCNN (trained with Food-101) [3], [4] 56.40%FMCD-L1 46.14%FMCD-L2 49.84%FMCD-L1+FMCD-L2 55.41%FMCD-L1+FUL 54.61%FMCD-L2+FUL 55.57%FMCD-L1+FMCD-L2+FUL 58.65%

trained CNN. We measure classification performance withaveraged accuracy.

4.2 Performance Comparison

Table 1 lists the classification accuracy of each method onthe Food-101 dataset. All our proposed methods outper-form the baseline methods including BoVW and IFV. Addi-tionally, FMCD-L1 and FMCD-L2 outperform CNN-SVMwhich uses �2-normalized fully connected layer activationsas the image representation. The experimental results showthat the covolutional layer feature maps have discriminativefeatures as well as the fully connected layer activations.

Furthermore, FMCD-L1+FUL and FMCD-L2+FUL,which indicate the method combining each convolutionallayer FMCD and �2-normalized fully connected layer ac-tivations by concatenating the features, outperform MLDSand RFDC. In addition, we obtain similar performance byconcatenating FMCD-L1 with FMCD-L2, which is denotedas FMCD-L1+FMCD-L2. From the results, we found thatthe proposed methods can obtain comparable or much betterrecognition performance than the hand-crafted methods.

Lastly, FMCD-L1+FMCD-L2+FUL, which denotesthe method concatenating FMCD-L1, FMCD-L2, and�2-normalized fully connected layer activations, exceedsDCNN, and achieves the best performance. This result sug-

1714IEICE TRANS. INF. & SYST., VOL.E99–D, NO.6 JUNE 2016

Fig. 3 Examples of recognition results. The bold word represents the correctly predicted label.

gests that even if the networks have been trained with vari-ous kinds of images, we can obtain superior recognition per-formance by using not only the fully connected layer acti-vation but also the covariances of covolutional layer featuremaps as the image representation. We think that the pro-posed method is effective when there is an insufficient tar-get domain dataset and computer environment to learn theCNN.

Figure 2 illustrates the classification accuracies for the20 classes from the Food-101 dataset. We selected the 20classes on which FMCD-L1 obtained high recognition per-formance. For many classes, the proposed methods outper-form the conventional method. Additionally, in Fig. 3, weshow examples of the recognition results. The experimen-tal results indicate that the covariance of convolutional layerfeature maps is effective for image representation.

5. Conclusion

In this paper, we proposed a new food image recognitionmethod using the covariances of convolutional layer fea-ture maps of trained CNN as image representations. In theexperiments on the ETHZ Food-101 dataset, our methodoutperforms state-of-the-art methods including CNN-SVM,which classifies images using linear SVM learned withfully connected layer activations of trained CNN. The re-sults demonstrate that convolutional layer feature maps con-tain efficient features as with fully connected layer activa-tions. To further verify the results, future work should in-clude performing the experiments with another food imagedataset [2], [21], and comparison with other convolutionallayer based method [22], [23].

Acknowledgments

We would like to thank an anonymous reviewer for his/herinsightful comments, which helped to improve the paper.This work was supported by Kayamori Foundation of Infor-mational Science Advancement and JSPS KAKENHI GrantNumbers 26280038, 15K12027, 15K15992.

References

[1] K. Aizawa, “Multimedia FoodLog: Diverse applications from self-monitoring to social contributions,” ITE Trans. Media Technologyand Applications, vol.1, no.3, pp.214–219, 2013.

[2] Y. Kawano and K. Yanai, “FoodCam: A real-time food recogni-tion system on a smartphone,” Multimedia Tools and Applications,vol.74, no.14, pp.5263–5287, 2015.

[3] A. Krizhevsky, I. Sutskever, and G.E. Hinton, “ImageNet classifica-tion with deep convolutional neural networks,” Advances in NeuralInformation Processing Systems, NIPS’12, vol.25, pp.1097–1105,2012.

[4] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101 – Min-ing discriminative components with random forests,” Proc. 13th Eu-ropean Conference on Computer Vision, ECCV’14, pp.446–461,2014.

[5] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of fea-tures: Spatial pyramid matching for recognizing natural scene cate-gories,” Proc. 2006 IEEE Conference on Computer Vision and Pat-tern Recognition, CVPR’06, vol.2, pp.2169–2178, 2006.

[6] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image clas-sification with the fisher vector: Theory and practice,” InternationalJournal of Computer Vision, vol.105, no.3, pp.222–245, 2013.

[7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng,and T. Darrell, “DeCAF: A deep convolutional activation feature forgeneric visual recognition,” Proc. 31st International Conference onMachine Learning, ICML’14, pp.647–655, 2014.

[8] A.S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN fea-tures off-the-shelf: An astounding baseline for recognition,” Proc.2014 IEEE Conference on Computer Vision and Pattern Recogni-tion Workshops, CVPRW ’14, pp.512–519, 2014.

[9] M.D. Zeiler and R. Fergus, “Visualizing and understanding convo-lutional networks,” Proc. 13th European Conference on ComputerVision, ECCV’14, pp.818–833, 2014.

[10] Y. Kawano and K. Yanai, “Food image recognition with deepconvolutional features,” Proc. 2014 ACM International Joint Con-ference on Pervasive and Ubiquitous Computing, UbiComp’14,pp.589–593, 2014.

[11] P.-H. Gosselin, N. Murray, H. Jegou, and F. Perronnin, “Revisitingthe fisher vector for fine-grained classification,” Pattern RecognitionLetters, vol.49, pp.92–98, Nov. 2014.

[12] O. Tuzel, F. Porikli, and P. Meer, “Region covariance: A fast descrip-tor for detection and classification,” Proc. 9th European Conferenceon Computer Vision, ECCV’06, vol.3952, pp.589–600, 2006.

[13] D. Tosato, M. Spera, M. Cristani, and V. Murino, “Characterizinghumans on riemannian manifolds,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol.35, no.8, pp.1972–1984, Aug. 2013.

[14] G. Serra, C. Grana, M. Manfredi, and R. Cucchiara, “Covarianceof covariance features for image classification,” Proc. InternationalConference on Multimedia Retrieval, ICMR ’14, pp.411–414, 2014.

[15] X. Pennec, P. Fillard, and N. Ayache, “A riemannian frameworkfor tensor computing,” International Journal of Computer Vision,vol.66, no.1, pp.41–66, 2006.

[16] H. Jegou and O. Chum, “Negative evidences and co-occurences inimage retrieval: The benefit of PCA and whitening,” Proc. 12thEuropean Conference on Computer Vision, ECCV’12, vol.7573,pp.774–787, 2012.

http://dx.doi.org/10.3169/mta.1.214

http://dx.doi.org/10.1007/s11042-014-2000-8

http://dx.doi.org/10.1007/978-3-319-10599-4_29

http://dx.doi.org/10.1109/cvpr.2006.68

http://dx.doi.org/10.1007/s11263-013-0636-x

http://dx.doi.org/10.1109/cvprw.2014.131

http://dx.doi.org/10.1007/978-3-319-10590-1_53

http://dx.doi.org/10.1145/2638728.2641339

http://dx.doi.org/10.1016/j.patrec.2014.06.011

http://dx.doi.org/10.1007/11744047_45

http://dx.doi.org/10.1109/tpami.2012.263

http://dx.doi.org/10.1145/2578726.2578781

http://dx.doi.org/10.1007/s11263-005-3222-z

http://dx.doi.org/10.1007/978-3-642-33709-3_55

LETTER1715

[17] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, andY. LeCun, “OverFeat: Integrated recognition, localization and de-tection using convolutional networks,” CoRR, vol.abs/1312.6229,2013.

[18] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei, “Im-ageNet: A large-scale hierarchical image database,” Proc. 2009IEEE Conference on Computer Vision and Pattern Recognition,CVPR’09, pp.248–255, 2009.

[19] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin,“LIBLINEAR: A library for large linear classification,” J. MachineLearning Research, vol.9, pp.1871–1874, June 2008.

[20] S. Singh, A. Gupta, and A.A. Efros, “Unsupervised discovery ofmid-level discriminative patches,” Proc. 12th European Conferenceon Computer Vision, ECCV’12, vol.7573, pp.73–86, 2012.

[21] X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, “Reciperecognition with large multimodal food dataset,” Proc. 2015 IEEEConference on Multimedia Expo Workshops, ICMEW’15, pp.1–6,June 2015.

[22] W. Zou, X. Wang, M. Sun, and Y. Lin, “Generic object detectionwith dense neural patterns and regionlets,” Proc. 25th British Ma-chine Vision Conference, BMVC’14, pp.1–11, 2014.

[23] L. Liu, C. Shen, and A. van den Hengel, “The treasure beneath con-volutional layers: Cross-convolutional-layer pooling for image clas-sification,” Proc. 2015 IEEE Conference on Computer Vision andPattern Recognition, CVPR’15, June 2015.


http://dx.doi.org/10.1007/978-3-642-33709-3_6

http://dx.doi.org/10.1109/icmew.2015.7169757

http://dx.doi.org/10.5244/c.28.72


food image recognition using covariance of convolutional ......ieice trans. inf. & syst.,...

Documents