compact descriptors for video analysis: the emerging mpeg...

1

Compact Descriptors for Video Analysis: theEmerging MPEG Standard

Ling-Yu Duan, Vijay Chandrasekhar, Shiqi Wang, Yihang Lou, Jie Lin, Yan Bai, Tiejun Huang,Alex Chichung Kot, Fellow, IEEE and Wen Gao, Fellow, IEEE

F

Abstract—This paper provides an overview of the on-going compactdescriptors for video analysis standard (CDVA) from the ISO/IEC movingpictures experts group (MPEG). MPEG-CDVA targets at defining astandardized bitstream syntax to enable interoperability in the contextof video analysis applications. During the developments of MPEG-CDVA, a series of techniques aiming to reduce the descriptor sizeand improve the video representation ability have been proposed. Thisarticle describes the new standard that is being developed and reportsthe performance of these key technical contributions.

1 INTRODUCTION

O VER the past decade, there has been an exponentialincrease in the demand for video analysis, which

refers to the capability of automatically analyzing thevideo content for event detection, visual search, tracking,classification, etc. Generally speaking, a variety of ap-plications can benefit from the automatic video analysis,including mobile augmented reality (MAR), automotive,smart city, media entertainment, etc. For instance, MARrequires object recognition and tracking in real-timefor accurate virtual object registration. With respect toautomotive, robust object detection and recognition arehighly desirable for warning the collision and cross-traffic. The increasing proliferation of surveillance sys-tems is also driving the developments of object detection,classification and visual search technologies. Moreover,a series of new challenges have been brought forwardin media entertainment, such as interactive advertising,video indexing and near duplicate detection, which allrely on robust and efficient video analysis algorithms.

For the deployment of video analysis functionalities inreal application scenarios, a unique set of challenges arepresented [1]. Basically, it is the central server which per-forms automatic video analysis tasks, such that efficienttransmission of the visual data via a bandwidth con-strained network is highly desired [2] [3]. The straight-forward way is to encode the video sequences and trans-mit the compressed visual data over the networks. Assuch, features can be extracted from the decoded videosfor video analysis purpose. However, this may createhigh-volume data due to the pixel level representation of

Ling-Yu Duan and Vijay Chandrasekhar are joint first authors.

Figure 1. Potential application scenarios of CDVA.

the video texture. One could imagine that 470,000 closed-circuit television (CCTV) cameras for video acquisitionare deployed in Beijing, China. Assuming that for eachvideo 2.5 Mbps bandwidth1 is required to ensure thatthey can be simultaneously uploaded to the server sidefor analysis, in total 1.2Tbps video data are transmittedon the internet highway for security and safety appli-cations. Due to the massive CCTV camera deploymentin the city, it is urgently required to investigate ways tohandle the large scale video data.

As video analysis is directly performed based onextracted features instead of the texture, shifting thefeature extraction and representation into the camera-integrated module is highly desirable, which directlysupports the acquisition of features at the client side. Assuch, compact feature descriptors instead of compressedtexture data can be delivered, which can completelysatisfy the requirements of video analysis. Therefore,developing effective and efficient compact feature de-scriptor representation techniques with low complexityand memory cost is the key to such “analyze then com-press” infrastructure [4]. Moreover, the inter-operabilityshould also be maintained to ensure that feature descrip-tors extracted by any devices and transmitted in anynetwork environments are fully operable at the serverend. The Compact Descriptors for Visual Search (CDVS)

1. 2.5Mbps is the standard bitrate for 720P video with standardframe rate (30fps).

2

standard [5] [6] developed by motion picture expertsgroup (MEPG), standardizes the descriptor bitstreamsyntax and the corresponding extraction operations ofstill images to ensure inter-operability for visual searchapplications. It has been proven to achieve high effi-ciency and low latency mobile visual search [6], and anorder of magnitude data reduction is realized by onlysending the extracted feature descriptors to the remoteserver.

However, the straightforward encoding of CDVS de-scriptors extracted frame by frame from video sequencescannot fulfil the applications of video analysis. For ex-ample, as suggested by CDVS, the descriptor length foreach frame is 4K, and for a typical 30fps video the featurebit rate is approximately to be 1Mbps. Obviously, thismay lead to excessive consumption of storage and band-width. Unlike still images, video combines a sequenceof high correlated frames to form a moving scene. To fillthe gap between the existing MPEG technologies andthe emerging requirements of video feature descriptorcompression, a Call for Proposals (CfP) on CompactDescriptors for Video Analysis (CDVA) [7] was issuedin 2015 by MPEG, targeting at enabling efficient andinter-operable design of advanced tools to meet thegrowing demand of video analysis. It is also envisionedthat CDVA can achieve significant savings in memorysize and bandwidth resources, and meanwhile providehardware-friendly support for the deployment of CDVAat application level. As such, the aforementioned videoanalysis applications such as MAR, automotive, surveil-lance and media entertainment can be flexibly supportedby CDVA [8], as illustrated in Fig. 1.

In Fig. 22, the framework of CDVA is demonstrated,which is comprised of keyframe/shot detection, videodescriptors extraction, encoding, transmission, decodingand video analysis against a large scale database. Duringthe development of CDVA, a series of techniques havebeen developed for these modules. The key technicalcontributions of CDVA are reviewed in this paper, in-cluding the video structure, advanced feature repre-sentation, and video retrieval and matching pipeline.Subsequently, the developments of the emerging CDVAstandard are discussed, and the performance of thekey techniques is demonstrated. Finally, we discuss therelationship between CDVS and CDVA and look into thefuture developments of CDVA.

2 THE MPEG CDVS STANDARD

MPEG-CDVS provides the standardized description offeature descriptors and the descriptor extraction processfor efficient and inter-operable still image search appli-cations. Basically, CDVS can serve as the frame levelvideo feature description, which inspires the inheritanceof CDVS features in the CDVA exploration. This sectiondiscusses the compact descriptors specified in CDVS,

2. The deep learning based features will be incorporated into theCDVA experimentation model in the 119th meeting in Jul. 2017.

which are capable of adapting the network bandwidthfluctuations for the support of scalability with the prede-fined descriptor lengths: 512 bytes, 1K, 2K, 4K, 8K and16K.

2.1 Compact Local Feature Descriptor

The extraction of local feature descriptors is requiredto be completed in a low complexity and memory costway. Obviously this is much more desirable for videos.The CDVS standard adopts the Laplacian of Gaussianinterest point detector. The low-degree polynomial (ALP)approach is employed to compute the local responseafter Laplacian of Gaussian filtering. Subsequently, arelevance measure is defined to select a subset of fea-ture descriptors, which is statistically learned based onseveral characteristics of local features including scale,peak response of the LoG, distance to image centre,etc. Handcrafted SIFT descriptor is adopted in CDVSas the local feature descriptors, and a compact SIFTcompression scheme achieved by transform followedwith ternary scalar quantization is developed to reducethe feature size. This scheme is of low-complexity andhardware favorable due to fast processing (transform,quantization and distance calculation). In addition tothe local descriptors, location coordinates of these de-scriptors are also compressed for transmission. In CDVS,the location coordinates are represented as a histogramconsisting of a binary histogram map and a histogramcounts array. The histogram map and counts array arecoded separately by a simple arithmetic coder and a sumcontext based arithmetic coder [9].

2.2 Local Feature Descriptor Aggregation

CDVS adopts the scalable compressed Fisher Vector(SCFV) representation for mobile image retrieval. Inparticular, the selected SIFT descriptors are aggregatedto the fisher vector (FV) by assigning each descriptorto multiple Gaussians in a soft assignment manner. Tocompress the high dimensional FVs, a subset of Gaussiancomponents in the Gaussian Mixture Model (GMM) areselected based on the their rankings in terms of thestandard deviation of each sub-vector. The number of se-lected Gaussian functions is dependent on the availablecoding bits, such that descriptor scalability is achieved toadapt to the available bit budget. Finally, one-bit scalarquantizer is applied to support fast comparison withHamming distance.

3 KEY TECHNOLOGIES IN CDVADriven by the success of MPEG-CDVS, which provides afundamental groundwork for the development of CDVA,a series of technologies have been brought forward. InCDVA, the key contributions can be categorized intothe video structure, video feature description and videoanalysis pipeline. The CDVA framework specifies how

3

Figure 2. Illustration of the CDVA framework.

the video is structured and organized for feature ex-traction, where key frame detection and inter featureprediction methods are presented. Subsequently, thedeep learning based feature representation is reviewed,and the design philosophy and compression methodsof the deep learning models are discussed. Finally, thevideo analysis pipeline which serves as the server sideprocessing module is introduced.

3.1 Video StructureVideo is composed of a series of highly correlatedframes, such that extracting the feature descriptors foreach individual frame may be redundant and lead tounnecessary computational consumptions. In view ofthis, a straightforward way is to perform key frame de-tection, following which only feature descriptors of thekey frames are extracted. In [10], the global descriptorSCFV in CDVS is employed to compare the distancebetween the current frame and the previous one. Inparticular, if the distance is lower than a given thresh-old, indicating that it is not necessary to preserve thecurrent frame for feature extraction, the current frameis dropped. However, one drawback of this method isthat for each frame the SCFV should be extracted, whichbrings additional computational complexity. In [11], thecolor histogram instead of the CDVS descriptors is em-ployed for the frame level distance comparison. As such,the SCFV descriptors in non-key frames do not needto be extracted. Due to the advantage of this scheme,it has been adopted into the CDVA experimentationmodel (CXM) 0.2 [12]. In [13], [14], Bailer proposed tomodify the segment produced by the color histogram. Inparticular, for each segment, the medoid frame of eachsegment is selected, and all frames within this segmentthat have lower similarity in terms of SCVF than a giventhreshold are further chosen for feature extraction.

The key-frame based feature representation has effec-tively removed the video temporal redundancy, resultingin low bitrate query descriptor transmission. However,this strategy has largely ignored the intermediate infor-mation between two key-frames. In [15], it is interestingto observe that densely sampled frames can bring bettervideo matching and retrieval performance at the expenseof increased descriptor size. In order to achieve a good

balance between the feature bitrate and video analysisperformance, the inter prediction techniques for localand global descriptors of CDVS have been proposed [15],[16], [17]. Specifically, in [15], the intermediate framesbetween two key frames are denoted as the predictiveframe (P-frame). In P-frame, the local descriptor is pre-dicted by the multiple reference frame prediction. Forthose local descriptors which cannot find correspondingreferences, they are directly written into the bit-stream.For global descriptors in P-frame, for the componentselected from both current and previous frames, thebinarized sub-vector is copied from the correspondingone in the previous frame to save coding bits. In [16],[17], it is further demonstrated that more than 50%compression rate reduction can be achieved by applyinglossy compression of local descriptors, without signifi-cant influence on the matching performance. Moreover,it is demonstrated that the global difference descriptorscan be efficiently coded using adaptive binary arithmeticcoding as well.

3.2 Deep Learning Based Video RepresentationRecently, due to the remarkable success of deep learning,numerous approaches have been presented to employthe Convolutional Neural Networks (CNNs) to extractdeep learning features for image retrieval [18], [19]. Inthe development of CDVA, the Nested Invariance Pool-ing (NIP) has been proposed to obtain the discriminativedeep invariant descriptors, and significant video analysisperformance improvement over traditional handcraftedfeatures has been observed. In this subsection, we willreview the development of deep learning features inCDVA from the perspectives of deep learning basedfeature extraction, network compression, feature bina-rization and the combination of deep learning basedfeature descriptors with handcrafted ones.

3.2.1 Deep Learning Based Feature ExtractionRobust video retrieval requires the features to be scale,rotation and translation invariant. The CNN modelsincorporate the local translation invariance by a suc-cession of convolution and pooling operations. In orderto further encode the rotation and scale invariance intoCNN, motivated by the invariance theory, the NIP was

4

proposed to represent each frame with a global featurevector [20], [21], [22]. In particular, the invariance theoryprovides a mathematically proven strategy to obtaininvariant representations with the CNNs. This inspiresthe improvement on the geometric invariance of deeplearning features based on the pooling operations of theintermediate feature maps in a nested way. Specifically,given an input frame, it can be rotated with R times,and for each time the pool5 feature maps (W ×H ×C) isextracted. Here, W and H denote the width and height ofthe map and C is the number of feature channels. Basedon the feature map, the multi-scale uniform region ofinterest (ROI) sampling is performed, resulting in the 5-D feature reforestation with dimension (R×S×W

′×H′×

C). Here, S is the number of sampled ROIs in multi-scaleregion sampling. Subsequently, NIP performs a nestedpooling over translations (W

′×H′), scales (S) and finally

rotations (R). Therefore, a C-dimensional global CNNfeature descriptor can be generated. The performanceof NIP descriptors can be further boosted by the PCAwhitening [18], [19]. To evaluate the similarity betweentwo NIP feature descriptors, the cosine similarity func-tion is adopted.

3.2.2 Network CompressionCNN models such as AlexNet [23] and VGG-16 [24] con-tain millions of neurons, which cost hundreds of MBs forstorage. This creates great difficulties in video analysis,especially when the CNN models are deployed at theclient side for feature extraction in the “analyze thencompress” framework. Therefore, efficient compressionmodel of the neural network is urgently required for thedevelopment of CDVA. In [25], [26], both scalar and vec-tor quantization (VQ) techniques using the Lloyd-Maxalgorithm are applied to compress the NIP model. Thequantized coefficients are further coded with Huffmancoding. Moreover, the model is further pruned to reducethe model size by dropping the convolutional layers.It is shown that the compressed models which havetwo orders of magnitude smaller than the uncompressedmodels lead to negligible loss in video analysis.

3.2.3 Feature Descriptor CompressionThe deep learning based feature descriptor generatedfrom NIP is usually in float-point, which is not efficientfor the subsequent feature comparison process. As ham-ming distance can facilitate effective retrieval especiallyfor large video collections, the NIP feature binarizationhas been proposed for compact feature representation[27]. In particular, the one-bit scalar quantizer is appliedto simply binarize the NIP descriptor. As such, muchless memory footprint and runtime cost can be achievedwith marginally degraded performance loss.

3.2.4 Combination of Deep Learning Based and Hand-crafted FeaturesFurthermore, in [20], [28], it is also revealed that there aresome complementary effects between CDVS handcrafted

and deep learning based features for video analysis. Inparticular, the deep learning based features are extractedby taking the whole frame into account while CDVShandcrafted descriptors sparsely sample the interestpoints. Moreover, the handcrafted features work rela-tively better in rich textured blobs, while deep learningbased features are more efficient in aggregating deeperand richer features for global salient regions. Therefore,the combination of deep learning based features andCDVS handcrafted features has been further investigatedin the CDVA framework [20], [22], [28], as shown inFig. 3. Interestingly, it is validated that the combinationstrategy achieves promising performance and outper-forms either deep learning based or CDVS handcraftedfeatures.

3.3 Video Analysis PipelineThe compact description of videos enables two typicaltasks in video analysis, including video matching andretrieval. In particular, video matching aims at determin-ing if a pair of videos shares the object or scene withsimilar content, and video retrieval performs searchingfor videos containing similar segment as the one in thequery video.

3.3.1 Video MatchingGiven the CDVA descriptors of the key frames in thevideo pair, pairwise matching can be achieved by com-paring them in a coarse to fine strategy. Specifically,each keyframe in one video is first compared with allof the keyframes in the other video in terms of theglobal feature similarity. If the similarity is larger thanthe threshold, implying that there is a possible matchbetween the two frames, the local descriptor comparisoncan be further performed with the geometric consistencychecking. The keyframe-level similarity is subsequentlycalculated by the multiplication of matching scores of theglobal and local descriptors. Finally, we can obtain thevideo-level similarity by selecting the largest matchingscore among all keyframe-level similarities.

Another criterion in video matching is the temporallocalization, which locates the video segment containingsimilar items of interest based on the recorded times-tamps. In [29], a shot level localization scheme wasadopted into CXM1.0. In particular, a shot is detected tobe the group of consecutive keyframes whose distanceto the first keyframe of this shot is smaller than a certainthreshold in terms of the color histogram comparison. Ifthe keyframe-level similarity is larger than a threshold,the shot that contains the key frame is regarded as thematching interval. Multiple matching intervals can alsobe concatenated together to obtain the final interval forlocalization.

3.3.2 Video RetrievalIn contrast to video matching, video retrieval is per-formed in a one-to-N manner, implying that the videos

5

Figure 3. Combination of handcrafted and deep learning based feature descriptors.

in the database are all visited and the top ones withhigher matching scores are selected. In particular, thekey-frame level matching with global descriptors is per-formed to extract the top Kg candidate keyframes inthe database. Subsequently, these key frames are fur-ther examined by local descriptor matching, and thekey frame candidate dataset is further shrunk to Kl

according to the rankings in terms of the combinationof global and local similarities. These key frames arereorganized into videos, which are finally ranked by thevideo level similarity following the principle in videomatching pipeline.

4 EMERGING CDVA STANDARD

4.1 Evaluation FrameworkThe MPEG-CDVA dataset includes 9974 query and 5127reference videos, and each video takes from 1 sec to 1+min durations [30]. In Fig. 4, we provide some typicalexamples from the MPEG-CDVA dataset. In total, 796items of interest in those videos are depicted, whichcan be further divided into three categories, includinglarge objects (eg. buildings, landmarks), small objects(e.g. paintings, books, CD covers, products) and scenes(e.g. interior scenes, natural scenes, multi-camera shots).Approximately 80% of query and reference videos wereembedded in irrelevant content (different from thoseused in the queries). The start and end embeddingboundaries were used for temporal localization in videomatching task. The remaining 20% of query videoswere applied with 7 modifications (text/logo overlay,frame rate change, interlaced/progressive conversion,transcoding, color to monochrome and contrast change,add grain, display content capture) to evaluate the effec-tiveness and robustness of the compact video descriptor

representation technique. As such, 4,693 matching pairsand 46,930 non-matching pairs are created. In addition,for large scale experiments 8,476 videos with a totalduration of more than 1,000 hours are involved as dis-tracters, which belong to UGC, broadcast archival andeducation.

The pairwise matching performance is evaluated interms of the matching and localization accuracy. In par-ticular, the matching accuracy is accessed by the ReceiverOperating Characteristic (ROC) curve. The True PositiveRate (TPR) given False Positive Rate (FPR) equaling to1% is also reported. When a matching pair is observed,the localization accuracy is further evaluated by theJaccard Index based on the temporal location of theitem of interest within the video pair. In particular, it is

calculated by[Tstart,Tend]

⋂[T ′

start,T′end]

[Tstart,Tend]⋃

[T ′start,T

′end

], where [Tstart, Tend]

denotes the ground truth and [T ′start, T

′end] denotes the

predicted start and end timestamps. The retrieval perfor-mance is evaluated by mean Average Precision (mAP),and moreover precision at a given cut-off rank R forquery videos (Precisian@R) is calculated. Here, R is setto be 100. As the ultimate goal is to achieve compactfeature representation, the feature bitrate consumptionis also measured.

4.2 Timeline and Core ExperimentsThe Call for Proposals of MPEG-CDVA was issued atthe 111th MPEG meeting in Geneva, in Feb. 2015, and re-sponses are evaluated in Feb. 2016. Table 1 lists the time-line for the development of CDVA. In the current stage,there are six core experiments (CE) in the explorationof the MPEG-CDVA standard. The first CE investigatesthe temporal sampling strategy to better understand theimpact of key frames and densities in video analysis.

6

Figure 4. Examples in the MPEG-CDVA dataset.

Table 1Timeline for the development of MPEG-CDVA standard.

When MileStone Comments

February, 2015 Call for Proposals is published Call for Proposals for Compact Descriptor forVideo Analysis

February, 2016 Evaluation of proposals and Core experiments None

October, 2017 Draft International Standard Complete specification. Only minor editorial changesare allowed after DIS.

June, 2018 Final Draft International Standard Finalized specification, submitted for approvaland publication.

The second CE targets at improving the matching andretrieval performance based on the segment level repre-sentation. The CE3 exploits the temporal redundanciesof feature descriptors to further reduce the bitrate forfeature representation. CE4 investigates the combinationstrategy of traditional handcrafted and deep learningbased feature descriptors, and CE5 develops compactrepresentation methods of the deep learning based fea-ture descriptors. Finally, CE6 study the approaches fordeep learning model compression to reduce the run timeand memory footprint for deep learning based featureextraction.

4.3 Performance Results

In this subsection, we report the performance resultsof the key contributions in the development of CDVA.Firstly, the performance comparisons with the evolutionof CXM models are presented. CXM 0.1 (released onMPEG-114) is the first version of CDVA experimenta-tion model that provides the baseline performance, andsubsequently CXM0.2 (MPEG-115) and CXM1.0 (MPEG-116) have been released. To flexibly adapt to differentbandwidth requirements as well as application scenarios,three operating points in terms of the feature descrip-tor bit rate 16KBps, 64KBps and 256KBps are defined.

Besides, in the matching operation, an additional crossmode 16 256KBps matching has also been considered.In Table 2, the performance comparisons from CXM0.1to CXM1.0 are listed. The performance improvementsfrom CXM0.1 to CXM0.2 are significant, and more than5% on mAP and 5% in terms of TPR@FPR are observed,which are mainly attributed to key frame sampling basedon color histogram. Comparing CXM0.2 with CXM1.0,the retrieval performance is identical since the changeslie in the video matching operation, which improve thelocalization performance based on the video shot toidentify the matching interval. Such matching schemeleads to more than 10% temporal localization perfor-mance improvement.

In Table 3, the performance comparisons betweenCXM and deep learning based methods are provided.Compared with CXM1.0, simply using the deep learningbased feature descriptors in 512 dimension without re-ranking techniques can bring about 5% improvements onboth mAP and TPR. It can be seen that the performanceof NIP descriptor extracted from a compressed modelonly suffers a negligible loss while the model size hasbeen reduced from 529.2M to 8.7M using pruning andscalar quantization. To meet the large-scale fast retrievaldemand, the performance of binarized NIP (occupyingonly 512 bits) and its combination with handcrafted

7

Table 3Performance comparisons between handcrafted and deep learning based methods.

mAP Precisian@R TPR@FPR=0.01 LocalizationAccuracy

CXM0.1 0.66 0.655 0.779 0.365CXM0.2 0.721 0.712 0.836 0.544CXM1.0 0.721 0.712 0.836 0.662NIP 0.768 0.736 0.879 0.725NIP+SCFV 0.826 0.803 0.886 0.723NIP (compressed model) 0.763 0.773 0.87 0.722NIP (compressed model) + SCFV 0.822 0.798 0.878 0.722Binarized NIP 0.71 0.673 0.86 0.713Binarized NIP+ SCFV 0.799 0.775 0.872 0.681

Table 2Performance comparisons with the evolution of CXM

models.

OperatingPoint CXM0.1 CXM0.2 CXM1.0

mAP 16KBps 0.66 0.721 0.72164KBps 0.673 0.727 0.727256KBps 0.68 0.73 0.73

Precisian@R 16KBps 0.655 0.712 0.71264KBps 0.666 0.718 0.718256KBps 0.674 0.722 0.722

TPR@FPR=0.01 16KBps 0.779 0.836 0.836

64KBps 0.79 0.843 0.843256KBps 0.8 0.846 0.84616 256KBps 0.786 0.838 0.838

LocalizationAccuracy 16KBps 0.365 0.544 0.662

64KBps 0.398 0.567 0.662256KBps 0.411 0.579 0.65216 256KBps 0.382 0.542 0.652

Table 4Runtime complexity comprarisons between CXM1.0 and

the deep learning based methods.

Retrievequery(s/q.)

Matchingpair(s/p.)

Non Matchingpiar(s/p.)

CXM1.0 38.63 0.37 0.21NIP 9.15 0.3 0.26NIP+SCFV 39.45 0.38 0.29Binarized NIP 2.89 0.29 0.25Binarized NIP+SCFV 33.24 0.37 0.28

feature descriptors are also explored. Compared withCXM1.0, the additional 512 bits deep learning baseddescriptor in the combination mode significantly booststhe performance from 72.1% to 79.9%. It is worth notingthat the results of deep learning method are under thecross-checking stage, and the Ad-hoc group plans tointegrate the NIP descriptor into CXM in MPEG 119thmeeting in Jul. 2017.

In Table 4, we list the runtime complexity betweenCXM1.0 and the deep learning based methods. In theexperimental setup, for each kind of feature descriptor,the database is scanned once to generate the retrievalresults. CXM1.0 adopts SCFV descriptor to obtain aninitial top500 results and then local descriptor re-ranking

is applied. The fastest method is binarized NIP thattakes 2.89 seconds to implement a video retrieval requestin 13603 videos (about 1.2 millon keyframes), and NIPdescriptor takes 9.15 seconds to complete this task. Forhandcrafted descriptor, the CXM1.0 takes 38.63 seconds,including both global ranking with SCFV and re-rankingwith local descriptors. It is worth mentioning that hereCDVA mainly focuses on the performance improvementin terms of the accuracy of matching and retrieval.Regarding the retrieval efficiency, some techniques thathave not been standardized in CDVS such as Multi-BlockIndex Table (MBIT) [31] indexing which can significantlyimprove the retrieval speed have not been integrated forinvestigation.

5 CONCLUSIONS AND OUTLOOK

The current development of CDVA treats the CDVSas the groundwork, as they serve the same purposeof using compact feature descriptors for visual searchand analysis. The main difference lies in that CDVS ismainly focusing on still images, while CDVA makes anextension to video sequences. Moreover, the backwardcompatibility of CDVS supports the feature decodingof the key frame with the existing CDVS infrastructure,such that every standard compatible CDVS decoder canreproduce the features of independently coded frames inthe CDVA bitstream. This can greatly facilitate the crossmodality search applications, such as using images asqueries to search videos, or using videos as queries tosearch corresponding images.

The remarkable technological progress in video featurerepresentation has provided a further boost for the stan-dardization of compact video descriptors. The key framerepresentation and inter feature prediction provide twogranularity levels in video feature representation. Thedeep learning feature descriptors have also been in-tensively investigated, including the feature extraction,model compression, compact feature representation, andthe combination of deep learned based features withtraditional handcrafted features. The optimization of thevideo matching and retrieval pipelines has also beenproved to bring superior performance in video analysis.

Nevertheless, the standardization of CDVA is alsofacing many challenges and more improvements are

8

expected. In addition to video matching and retrieval,more video analysis tasks (such as action recognition,abnormal detection, video tracking) need to be inves-tigated. This requires more advanced video represen-tation techniques to extract the motion information aswell as sophisticated deep learning models with highgeneralization abilty for feature extraction. Moreover,although the deep learning method has achieved sig-nificant performance improvement, more deep featurecompression and hashing work is necessary to achievecompact representation. Finally, the fusion strategy ofdeep learning feature and traditional handcrafted featurepose new challenges to the standardization of CDVA andopens up new space for future exploration.

REFERENCES

[1] “Compact descriptors for video analysis: Requirements for searchapplications,” ISO/IEC JTC1/SC29/WG11/N15040, 2014.

[2] Bernd Girod, Vijay Chandrasekhar, David M Chen, Ngai-ManCheung, Radek Grzeszczuk, Yuriy Reznik, Gabriel Takacs, Sam STsai, and Ramakrishna Vedantham, “Mobile visual search,” IEEESignal Processing Magazine, vol. 28, no. 4, pp. 61–76, 2011.

[3] Rongrong Ji, Ling-Yu Duan, Jie Chen, Hongxun Yao, TiejunHuang, and Wen Gao, “Learning compact visual descriptor forlow bit rate mobile landmark search,” International Joint Conferenceon Artificial Intelligence, vol. 22, no. 3, pp. 2456, 2011.

[4] A. Redondi, L. Baroffio, M. Cesana, and M. Tagliasac-chi, “Compress-then-analyze vs. analyze-then-compress: Twoparadigms for image analysis in visual sensor networks,” IEEEInternational Workshop on Multimedia Signal Processing, pp. 278–282,Sep. 2013.

[5] S. Paschalakis et al., “Information technologymultimedia con-tent description interfacepart 13: Compact descriptors for visualsearch,” ISO/IEC 15938-13:2015, 2015.

[6] Ling-Yu Duan, Vijay Chandrasekhar, Jie Chen, Jie Lin, Zhe Wang,Tiejun Huang, Bernd Girod, and Wen Gao, “Overview of theMPEG-CDVS standard,” IEEE Transactions on Image Processing,vol. 25, no. 1, pp. 179–194, 2016.

[7] “Call for Proposals for Compact Descriptors for Video Analysis(CDVA)-Search and Retrieval,” ISO/IEC JTC1/SC29/WG11/N15339,2015.

[8] “Compact descriptors for video analysis: Objectives, applicationsand use cases,” ISO/IEC JTC1/SC29/WG11/N14507, 2014.

[9] Sam Tsai, Zhe Wang, Giovanni Cordara, Ling-Yu Duan,Radek Grzeszczuk, and Bernd Girod, “CDVS Core Ex-periment 3: Stanford/Peking/Huawei contribution,” ISO/IECJTC1/SC29/WG11/M25883, 2012.

[10] Massimo Balestri, Miroslaw Bober, and Werner Bailer,“Cdva experimentation model (cxm) 0.1,” ISO/IECJTC1/SC29/WG11/N16064, 2016.

[11] Massimo Balestri, Gianluca Francini, Skjalg Lepsoy, MiroslawBober, and Sameed Husain, “Bridget report on cdva core ex-periment 1 (ce1),” ISO/IEC JTC1/SC29/WG11/M38664, 2016.

[12] Massimo Balestri, Miroslaw Bober, and Werner Bailer,“Cdva experimentation model (cxm) 0.2,” ISO/IECJTC1/SC29/WG11/N16274, 2016.

[13] Werner Bailer, “Jrs response to cdva core experiment 1,” ISO/IECJTC1/SC29/WG11/M39172, 2016.

[14] Werner Bailer, “JRS response to CDVA core experiment 1,”ISO/IEC JTC1/SC29/WG11/M39918, 2017.

[15] Shiqi Wang Lingyu Duan Jie Chen Alex Chichung Kot Siwei MaTiejun Huang Wen Gao Zhangshuai Huang, Longhui Wei, “Pku’sresponse to cdva ce1,” ISO/IEC JTC1/SC29/WG11/m38625, 2016.



[18] Artem Babenko and Victor Lempitsky, “Aggregating local deepfeatures for image retrieval,” in Proceedings of the IEEE Interna-tional Conference on Computer Vision, 2015, pp. 1269–1277.

[19] Giorgos Tolias, Ronan Sicre, and Herve Jegou, “Particular objectretrieval with integral max-pooling of cnn activations,” arXivpreprint arXiv:1511.05879, 2015.

[20] Yihang Lou, Yan Bai, Jie Lin, Shiqi Wang, Jie Chen, Vijay Chan-drasekhar, Lingyu Duan, Tiejun Huang, Alex Kot, and Wen Gao,“Compact deep invariant descriptors for video retrieval,” in DataCompression Conference. IEEE, 2017, pp. 73–82.

[21] Yan Bai, Yihang Lou, Shiqi Wang, Jie Chen, Lingyu Duan, SiweiMa, Tiejun Huang, Alex Kot, and Wen Gao, “Suggestions ondeep learning based technologies for the future development ofCDVA,” ISO/IEC JTC1/SC29/WG11/m39220, 2016.

[22] Yihang Lou, Feng Gao, Yan Bai, Jie Lin, Shiqi Wang, Jie Chen,Chunsheng Gan, Vijay Chandrasekhar, Lingyu Duan, TiejunHuang, and Alex Kot, “Improved retrieval and matching withcnn feature for CDVA,” ISO/IEC JTC1/SC29/WG11/m39219, 2016.

[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Ima-genet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.

[24] Karen Simonyan and Andrew Zisserman, “Very deep convolu-tional networks for large-scale image recognition,” arXiv preprintarXiv:1409.1556, 2014.

[25] Yan Bai, Yihang Lou, Vijay Chandrasekhar, Jie Lin, Shiqi Wang,Jie Chen, Lingyu Duan, Tiejun Huang, Alex Kot, and Wen Gao,“PKUs Response to CDVA CE4: NIP Network Compression,”ISO/IEC JTC1/SC29/WG11/m39853, 2016.

[26] Yan Bai, Yihang Lou, Zhaoliang Liu, Shiqi Wang, Vijay Chan-drasekhar, Jie Chen, Lingyu Duan, Tiejun Huang, Alex Kot,and Wen Gao, “PKUs Response to CDVA CE6: NIP NetworkCompression,” ISO/IEC JTC1/SC29/WG11/m40386, 2017.

[27] Yan Bai, Yihang Lou, Zhe Wang, Shiqi Wang, Vijay Chan-drasekhar, Jie Chen, Lingyu Duan, Tiejun Huang, Alex Kot, andWen Gao, “PKUs Response to CDVA CE5: Binarization for NIPdescriptor,” ISO/IEC JTC1/SC29/WG11/m40387, 2016.

[28] Yihang Lou, Yan Bai, Vijay Chandrasekhar, Jie Lin, Shiqi Wang,Jie Chen, Lingyu Duan, Tiejun Huang, Alex Kot, and Wen Gao,“PKUs response to CDVA CE4: Combination of NIP and CDVSdescriptors,” ISO/IEC JTC1/SC29/WG11/m39852, 2017.

[29] Massimo Balestri, Gianluca Francini, and Skjalg Lepsoy,“Improved temporal localization for CDVA,” ISO/IECJTC1/SC29/WG11/M39433, 2016.

[30] “Evaluation framework for compact descriptors for video analysis- search and retrieval,” ISO/IEC JTC1/SC29/WG11 N15338, 2015.

[31] Zhe Wang, Lingyu Duan, Jie Lin, Tiejun Huang, and Wen Gao,“MBIT: An indexing structure to speed up retrieval.,” ISO/IECJTC1/SC29/WG11/M28893, Korea, April, 2013.

compact descriptors for video analysis: the emerging mpeg...

Documents