annotating objects and relations in user-generated videos · fine-grained video content analysis is...

Annotating Objects and Relations in User-Generated VideosXindi Shang

National University of [email protected]

Donglin DiNational University of Singapore

[email protected]

Junbin XiaoNational University of Singapore

[email protected]

Yu CaoNational University of Singapore

[email protected]

Xun YangNational University of Singapore

[email protected]

Tat-Seng ChuaNational University of Singapore

[email protected]

dog-in front of-adult

dog-next to-dog dog-away-adult

dog-watch-dogdog-watch-dog dog-bite-dog

dog-in front of-dog

Figure 1: An example of recognizing objects and relations in a video. Each object is spatio-temporally localized and the relationsbetween each pair of objects are temporally localized across video frames.

ABSTRACTUnderstanding the objects and relations between them is indispens-able to fine-grained video content analysis, which is widely studiedin recent research works in multimedia and computer vision. How-ever, existing works are limited to evaluating with either smalldatasets or indirect metrics, such as the performance over images.The underlying reason is that the construction of a large-scale videodataset with dense annotation is tricky and costly. In this paper,we address several main issues in annotating objects and relationsin user-generated videos, and propose an annotation pipeline thatcan be executed at a modest cost. As a result, we present a newdataset, named VidOR, consisting of 10k videos (84 hours) togetherwith dense annotations that localize 80 categories of objects and 50categories of predicates in each video. We have made the trainingand validation set public and extendable for more tasks to facilitatefuture research on video object and relation recognition.

CCS CONCEPTS• Information systems→Multimedia databases; Informationextraction; •Computingmethodologies→ Scene understanding;Object recognition.

KEYWORDSdataset; video annotation; video content analysis; object recogni-tion; visual relation recognition

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’19, June 10–13, 2019, Ottawa, ON, Canada© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6765-3/19/06. . . $15.00https://doi.org/10.1145/3323873.3325056

ACM Reference Format:Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-SengChua. 2019. Annotating Objects and Relations in User-Generated Videos.In International Conference on Multimedia Retrieval (ICMR ’19), June 10–13, 2019, Ottawa, ON, Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3323873.3325056

1 INTRODUCTIONFine-grained video content analysis is crucial to bridging the gapbetween vision and language and enhancing the explainability ofmodern multimedia systems, which has been extensively demon-strated by the successes in applying object level analysis to videocaptioning and question answering tasks [21, 27, 30]. In specificdomains, such as the fashion and food recognition, the analyticgranularity is even refined into landmarks [15] and ingredients [4],respectively, in order to boost the retrieval performance with morediscriminative representation. As this granularity tends to be finer,understanding the relations between the objects becomes especiallyimportant as well. By looking into the relations instead of simplyconsidering the ensemble of objects, recent works [25, 29] haveshown its promise in generating robust representation for videocontent analysis. Still, most of these research works explore theobjects and relations in video by implicitly modeling and evaluation,leaving it unclear on whether or how well the proposed modelsunderstand the content at this granularity.

In fact, object and relation recognition in videos is very chal-lenging because it requires the need to visually understand manyperspectives of object entities, including appearance, identity, ac-tion, and interactions between them. For example, the appearanceof the dogs in Figure 1 may vary a lot due to its active behavior,the change of illumination, and occlusion. It will inevitably causegreat difficulty in determining the identity of the dog across videoframes, which is a necessary clue for further summarization ofvideo content. Moreover, the variance in action and interactionrepresentation poses another challenge for the model to learn theunderlying patterns and make a robust inference. To investigate the

Long Presentation Session 4: Multimedia Object Tracking ICMR ’19, June 10–13, 2019, Ottawa, ON, Canada

279

https://doi.org/10.1145/3323873.3325056

https://doi.org/10.1145/3323873.3325056

https://doi.org/10.1145/3323873.3325056

problem deeper, there are rising interests in research on video objectdetection [11, 22, 32] and video visual relation detection [20], whichformally define and study the problem with explicit evaluationmetrics. However, they are still primarily limited by the availabledatasets [19, 20] that are of small-scale and built upon video sourceunder certain constraints.

This gives rise to the main purpose of this paper: We aim to con-struct a larger scale video dataset with dense annotations of objectsand relations. In particular, the video sources will be collected fromthe Web to reflect the video content in real-world scenarios; theannotations will spatio-temporally localize the objects of interestby bounding-box trajectories, and temporally localize the relationsof interest between pairs of objects. Compare with image datasetsfor similar purpose [13, 18], the constructed dataset will serve as abenchmark dataset that can be used to develop and evaluate tech-niques at video level directly. It is worth noting that the dataset willalso address the limitations of image-level recognition that cannotdeterministically learn and infer the temporal visual concepts (e.g.,actions and dynamic relations), which can further aid the groundingof language in vision.

However, annotating the objects and relations in a large user-generated video dataset requires a tremendous amount of humanlabor. Supposing that the length of user-generated videos is at least150 frames (i.e., the length of typical micro videos), the workload forannotating 10,000 videos is then equivalent to annotating at least1,500,000 images, which reaches the scale of the prevailing imagedatasets. By assuming the resemblance of adjacent frames, one canpossibly reduce the workload through manually annotating key-frames at a fixed time interval and interpolating the remaining. But,this strategy only works for the annotation of movie and surveil-lance videos, because they usually have steady camera motion andpredictable object activities, which assures the video continuityassumption. In contrast, the content of user-generated videos istotally free-form and often of low quality, so the choice of a largetime interval in order to save the annotation cost will likely leadto poor interpolation quality at the intermediate frames, makingthe trade-off between the quality and cost difficult for this strat-egy. Therefore, in order to economically scale up the annotationin user-generated videos, we need a better strategy to select thekey-frames for manual labeling.

Another critical problem that needs to be addressed is how tosplit the whole annotation task into several manageable and feasiblesubtasks. If considering annotating objects and relations in a videoas a single task, it would require annotators to not only be veryfamiliar with the many annotation requirements, but also be highlyspecialized in the task. In other words, the annotators need to spenda lot of time and energy to perform the task at a constant throughput.Thus, finding such a group of specialists is difficult and costly aswell. On the other hand, video annotation should preferable beperformed as a macro-task rather than micro-task in practice [24].This is because simple macro-tasks such as determining the type ofobjects, or whether the bounding-box is good, can be performedeasily by most people with little cognitive overheads; whereasannotators need training and tend to make a lot of mistakes formicro-tasks such as identifying the objects to track and determiningthe full trajectories of objects. If the task on a video is split into toomany micro-tasks and assigned to many different annotators, then

there would be insufficient context for each annotator to label thetask correctly according to the overall content and consistently withother annotators. Hence, the design of subtasks greatly influencesthe annotation cost and quality.

In Section 3, wewill address the aforementioned issues by propos-ing an annotation pipeline that can achieve our annotation goalat a modest cost. The pipeline is typically effective for the user-generated videos whose length varies from seconds to several min-utes. Longer videos, which are generally created for special pur-poses, can be divided into shorter segments and fed into the pipeline.As shown in Figure 2, in the procedure for object annotation, wefirst temporally localize the objects of interest in videos. We thenspatially localize each of the objects at the key-frames generatedaccording to an interactive scheme. At each turn, the scheme willgenerate a finer set of key-frames compared to that at the previousturn, so that the objects can be localized more accurately acrossvideo frames if the budget permits more turns. In the procedurefor relation annotation, we conduct different annotation tasks ac-cording to the properties of different types of relations (i.e., actionrelation versus spatial relation). This allows us to achieve a goodbalance between the annotation quality and cost.

Section 4 will justify the effectiveness of the proposed annotationpipeline through several aspects. Section 5 will present the resultingdataset, the Video Object Relation (VidOR) Dataset, and analyzeits characteristics from various aspects.

2 RELATEDWORK2.1 Video AnnotationThe technique for video object annotation has been developed overa long period of time. Mihalcik and Doermann [16] propose the pro-totype of offline annotation tool, ViPER, for annotating bounding-box trajectories in videos. Dollár et al. [5] develop an analogous tooldedicated for large-scale pedestrian annotation in driving recordvideos. Driven by the emergence of crowdsourced labeling, Yuenet al. [31] design a online video annotation system, LabelMe Video,which supports both bounding-box trajectory and polygonal pathannotation by online users. VATIC [24], another online video an-notation system proposed by Vondrick et al. more recently, triesto economically scale up the video object annotation through theinsight of the pros and cons of different annotation strategies. Asfor the most important component on key-frame generation, mostworks adopt the strategies of either generating the key-frames at afixed time interval or letting the annotators determining them; thislimits the scalability of such approach especially for large dataset.

Recently, there is a rising number of works studying the temporalannotation of actions, activities, and relations [6, 7, 9, 20]. Intuitively,scaling up the temporal annotation is much easier than the spatio-temporal annotation in video object annotation. Current strategiesin temporal annotation can be categorized into two types. Onecategory is direct temporal localization by labeling the startingand ending frames of targets in a video [7, 9]. In order to find theboundary frame accurately, it normally requires the annotator tospend much time browsing the video. The other category is to tagthe presence of targets in short partitioned video segments andthen automatically merge the tagging results across the segmentsto achieve temporal localization [6, 20]. This approach generally


280

1

2

3

4

5

6

Tracking successfulTracking failed

1

3

6

1

2

3

4

5

6

Spat

ial O

bjec

t Loc

aliza

tion

4 5 6

Given the object pair (blue child, red stroller), finding and temporally localizing action relations.

push

watch

Action Relation Localization

Given the object pair (blue child, red stroller) in a 3s segment, selecting spatial relations from candidates.

1 2

o In front of§ Behindo Beneatho Aboveo Next to

o Insideo Awayo Towardso None of above

Spatial Relation Localization

Merge segments

Tem

pora

l Obj

ect L

ocal

izatio

n

1

2

3

4

5

6

child

stro

ller

baby

Figure 2: Illustration of the proposed annotation pipeline.

requires less effort and time of the annotators, but can only producecoarse-grained annotation results. While most works stick to onlyone of these strategies, we propose to adopt both of them in ourrelation annotation by carefully choosing the proper one for specifictypes of relations, such that the workload and cost of the annotationcan be significantly minimized.

2.2 Video DatasetsVideo datasets can be categorized based on the annotation level.The first category of datasets are those with video-level annota-tion, such as CCV [10], Kinetics [3] and MSR-VTT [26]. They aretypically used as benchmark for event recognition, action recogni-tion and video captioning. The second category is the dataset withsegment-level annotation. Typical datasets include THUMOS [9],MultiTHUMOS [28], ActivityNet [2] and ActivityNet Caption [12],which are used as benchmarks for the temporal localization ofaction, activity and event.

Datasets with object-level annotation are another category thatinitiates the research on fine-grained video content analysis. Track-ingNet [17] is a large-scale benchmark dataset for object tracking,which has a single object annotated by bounding-box trajectoryin every video. ImageNet-VID [19] is the benchmark dataset forvideo object detection over 30 categories of objects. In the field ofhuman-centric tasks, the dataset provided in [5] is benchmarkedfor pedestrian detection while AVA [6] is benchmarked for spatio-temporal action localization.

An emerging category of datasets is the dataset with relation-level annotation. Beyond the object-level dataset that only has aset of independent object annotations, this type of dataset providesstructured object annotations for more comprehensive understand-ing of the video content. VidVRD [20] is the only such dataset todate, which is built upon a subset of ImageNet-VID. However, aslimited by current annotation technique, the dataset is small withsparse annotation in the training set.

3 ANNOTATION PIPELINETo address the problems of complex annotation as discussed above,we propose an annotation pipeline (shown in Figure 2) consisting of

object annotation and relation annotation, each of which is furthersplit into two sub-procedures. Before elaborating on the details ofthese procedures, we first introduce several basic terms and howthe pipeline applies to a set of user-generated videos to producethe annotations in general.

An atomic task is a basic task specially defined in each of theprocedures, and will be completed and checked individually. Someatomic tasks are defined at the video level while some are defined atthe segment level. In a procedure, although different atomic taskscan be assigned to different annotators, it is preferable to assign thecontiguous atomic tasks in a video to the same annotator, becausethe annotator can leverage the context information in the earlieratomic tasks to produce more accurate and consistent annotationsfor the subsequent tasks. It will also help us to logically organizea large number of atomic tasks and efficiently achieve parallelismin the pipeline. Hence, we call a group of atomic tasks in a videoas video task, and we always pack the atomic tasks in this wayacross the whole pipeline. Overall, every video will be sequentiallyassigned with the four video tasks introduced in the followingsubsections.

Regarding the reward mechanism, we will describe how we paythe annotators in terms of the number of points for each of the videotasks. Since each video task has different level of annotation diffi-culty, we aim to balance the rewards such that the annotators feelthat they are being fairly compensated no matter which video taskis assigned to them. Though it could be suboptimal, we found that inpractice, our reward mechanism was acceptable by the annotators,who are professional annotators that we recruited online.

3.1 Spatio-temporal Object LocalizationIn order to annotate an object in video, annotators are normallyrequired to complete three sub-tasks: 1) browsing the video andfinding a new object of interest; 2) drawing the bounding-box forthe object at the frames that it appears; and 3) marking the objectas invisible at the frames that it is occluded or out-of-camera. Ap-parently, the second sub-task is the heaviest because it requires theannotator to draw bounding-boxes while the other two just need tosimply choose tags. Hence, a smart approach of selecting an object’s


281

key-frames other than simply skipping at a fixed time interval is im-perative to reduce the workload. Several previous works [5, 16, 31]have been proposed to let the annotators select the key-framesas human can decide better than an algorithm. However, this willimplicitly introduce additional cognitive loads to the annotatorsfor making such decision, and possibly cause them to spend un-necessary amount of time in browsing the video and drawing thebounding-boxes at unimportant frames. To address this problem, inour pipeline, we first complete the much simpler first and third sub-tasks by temporally localizing a new object for its visible portion.Then, we propose a novel strategy, inspired by the idea of divideand conquer, to interactively generate key-frames from coarse tofine, and spatially localize the object at the key-frames.

3.1.1 Temporal Object Localization. In this video task (i.e. the firstand third sub-tasks defined above), the annotator browses the videoto find objects belonging to the categories of interest. Once a newobject is found, the annotator creates a new tag for the object andlabel it with the correct category. Then, by browsing the video withrespect to this object, the annotator draws a bounding-box aroundthe object whenever it appears or disappears, so that the object canbe temporally localized in the video and its identity can be specifiedby these drawn bounding-boxes.

In order to encourage the annotators to find more objects andtemporally localize them thoroughly, they will be rewarded forevery bounding-box they draw above. Specifically, we reward 6points per such bounding-box due to the large amount of effortsneeded before drawing the bounding-box. Yet, an upper bound ofreward is set for each video to prevent the annotators from labelingexcessive objects and bounding-boxes. The sub-tasks are deemedcomplete if all the apparent objects and their appearing durationare sufficiently labeled.

3.1.2 Spatial Object Localization at Key-frames. After the temporallocalization, the appearing and disappearing frames of the anno-tated objects are available. We next select key-frames within eachvisible intervals and draw bounding-boxes to spatially localize ob-jects more accurately. For the frames that are not selected as key-frames, the system will automatically generate bounding-boxesby linear interpolation according to the adjacent key-frames andvisually check if these generated bounding-boxes are sufficientlyaccurate in tracking the object. If not, a new key-frame is selectedas an atomic task, typically at the middle of the interval, for hu-man annotators to draw the bounding-box. The idea is that truebounding-box trajectory in a shorter interval can be more accu-rately approximated by linear interpolation. So, we adopt this divideand conquer strategy to gradually generate more key-frames untilaccurate tracking is achieved.

To check if an interval can be approximated, we use a set ofrobust visual trackers T to automatically track the accuracy of thebounding-boxes at both ends in the forward and backward pass,and see if any of the trackers can successfully track the intervalin both direction. Specifically, supposing that tf and tb are thebounding-box trajectories produced by the tracker t in both theforward and backward direction, respectively. We determine that ainterval is accurately approximated if

maxt ∈T

{vIoU(tf , tb )} > 0.5, (1)

where vIoU is the voluminal Intersection-over-Union between thetwo trajectories, and KCF [8] and MOSSE [1] tracker are used in ourimplementation. If the interval is accepted by machine checking,the bounding-box trajectories in the interval of length L will beannotated with a weighted average of trajectory t∗f and t∗b :

t∗avд,i =ρi t∗f ,i + ρ

L−i+1t∗b,iρi + ρL−i+1

, i = 1, . . . ,L, (2)

where t∗ is the one that achieves maximum vIoU in Equation (1)and t∗·,i is the i-th bounding-box of the trajectory. We set ρ to 0.75as a global assumption of the tracking precision. Otherwise, a newkey-frame will be generated in the interval for the annotator. Incase that the interval is large, we will evenly generate multiple key-frames. To draw the bounding-box, the annotator will be presenteda short video clip starting from the previous key-frame fk−1 andthen visually track the bounding-box from fk−1 to fk .

As the workload of this task is determined by the number ofkey-frames generated, we can easily control the cost by limitingthe division depth. Also, the annotators are rewarded based on thetotal number of bounding-boxes drawn at the key-frames, whenthe video task is passed by sampling inspection. Specifically, wereward 1.5 points per bounding-box in this task.

3.2 Temporal Relation LocalizationGiven the set of spatio-temporally annotated objects in a video,annotating relations is to temporally localize the relations of interestfor each pair of objects. This is because the same pair of objectsmay have different relations at different time intervals. Due to thehuge combinations of possible object pairs at various time intervals,the number of atomic tasks increases drastically compared to thatof object annotation. This also amplifies the difficulty in searchingfor interesting relations from a large number of candidates. Severaltricks can be used to alleviate the pain by presenting part of videodata to the annotator. For example, the annotator is only allowed tobrowse the video clip where the two objects in a pair both appear.Additionally, just presenting the bounding-box trajectories of onepair of objects at a time can also avoid confusing the annotatorwith too many irrelevant annotations. Further, splitting the taskbased on the type of relation, namely the spatial relations andaction relations, helps to reduce the annotation difficulty withoutsignificant loss of quality.

3.2.1 Spatial Relation Localization. Spatial relations are the typeof relation that indicates the relative position between two objects,such as “A-in front of-B” and “A-towards-B”. Usually, the spatialrelation between two objects will not change frequently; and dif-ferent categories of spatial relation are often mutually exclusive ata time. This allows us to annotate them at a coarse grained levelwith simpler atomic tasks. Hence, in addition to the tricks men-tioned above, the trimmed video clips can also be further partitionedinto overlapping segments of fixed temporal length, typically in3-second segments with 1 second overlap, forming a group of sim-ple atomic tasks. In each atomic task, the annotator is presenteda single segment with a pair of objects, and asked to choose oneor more categories of spatial relation presence. By merging the


282

adjacent segments with same labeled category, it is easy to au-tomatically consolidate the segment-level results into the overalltemporal localization.

In the video task, we use sampling inspection to check the anno-tation quality of the atomic tasks, and the annotators are rewardedbased on the total number of the atomic tasks in the video. Specifi-cally, we reward 1 point per atomic task.

3.2.2 Action Relation Localization. Action relation includes therelations like “A-watch-B”, “A-hold-B”, “A-kick-B”, etc., whose dura-tion ranges from transient to long-lasting. Thus, annotating actionrelation with good quality requires the temporal localization atframe-level precision, which needs a new approach different fromthat for spatial relation. Moreover, action relations occur much lessfrequently than spatial relations, so there will be a large portionof segments without any action relation, stultifying the effect ofworking on segments.

While it is not likely to simplify the atomic task that needs totemporally localize action relations for a pair of objects, we can stillreduce the overall workload by utilizing the information of objectcategory provided by the previous stage. Since the categories ofsubject for action relation are constrained to be human and animal,we can automatically filter out many object pairs according to thisconstraint, reducing a large number of atomic tasks for annotation.

For quality control, as there is naturally a small portion of objectpairs that are annotated with action relations, we mainly focus onchecking the annotation quality on these object pairs to ensurehigh precision. To achieve considerable recall as well, we rewardthe annotators based on the number of relations localized in orderto encourage them find out more. Specifically, we reward 1 pointfor viewing an atomic task and 8 points for localizing a relation.

4 JUSTIFICATIONSIn this section, we will justify the effectiveness of our proposedsolutions to address the main issues discussed in Section 1. To thisend, we show how each of the solutions reduce the workload andcost in annotating objects and relations in user-generated videos.The reported statistics in this section is based on the annotation of10,000 videos, whose details will be further presented in Section 5,using our proposed annotation pipeline.

4.1 Effectiveness of the Key-frame GenerationScheme

As shown in Table 1, the average number of manually labeledbounding-boxes is only 4.08 percent of the total number of bounding-boxes, meaning that the scheme can significantly save the workloadand cost for bounding-box annotation. We can also infer that theaverage annotation frequency is every 24 frames, which is similarto that in some related works [6, 17] in generating key-frames ata fixed time interval. This suggests that the resulting bounding-box trajectories are at least of similar quality overall. However,our scheme is actually producing trajectories of better quality bysubstantially exploiting the valuable human labor, because it adap-tively selects the key-frames where there are larger object or cameramotions, or more severe illumination and occlusion. For example,Table 1 demonstrates that the number of key-frames generated foractive objects (e.g. frisbee and racket) is generally more than that

Table 1: The percentage (%) of bounding-boxes generatedby manual labeling, KCF[8] and MOSSE[1] trackers (the re-maining is generated by linear interpolation) against eachobject category. After sorting according to the percentageofmanually labeled bounding-boxes, the categories of top-5and bottom-5 are shown in the 1st and 2nd group in the table.The last row shows the statistics over all the categories.

manual (%) KCF (%) MOSSE (%)piano 2.03 50.57 43.56baby seat 2.35 53.41 41.82panda 2.50 62.41 28.89guitar 2.50 50.51 39.52toilet 2.52 44.49 44.54ski 13.90 33.27 24.23bat 14.06 30.86 14.60surfboard 19.51 30.78 15.04frisbee 21.28 31.86 6.35racket 22.08 26.71 12.37all categories 4.08 56.39 31.19

for inactive objects (e.g. piano and baby seat). Even for a singleobject, the scheme can also generate key-frames adapted to thesituation when tracking of the object is difficult.

On the other hand, with the assistance of human intelligence,the trackers used in our scheme can automatically produce a largeproportion (87.58% as indicated in the last row of the table) ofbounding-boxes with sufficient confidence and quality. In terms oftraining object detectors, we can use these bounding-boxes as dataaugmentation to train robust built-in classification models sincethey provide a massive quantity of variant regions over the objectsin addition to the accurately localized ones. Moreover, the qualityof the automatically produced bounding-boxes can be controlledby some parameters. In our annotation setting, we require the over-lapping between the two trajectories produced by forward andbackward tracking to be at least 0.5 in vIoU. Hence, more accuratebounding-box trajectories can be obtained by setting the require-ment of the vIoU higher. However, this will reduce the numberof automatically produced bounding-boxes by the trackers, whilemore human assistance is needed.

We can also see from Table 1 that the use of two trackers (i.e.KCF and MOSSE) compensate for each other’s weakness to achievegood tracking results, but there is still 8.34% of bounding-boxes gen-erated by linear interpolation due to the failure of tracking at thoseframes. This is because existing trackers still have limitation whenthe tracked object is of small size with large motion, or has highdeformation. However, as our scheme interactively and selectivelyasks human to label the bounding-boxes at the frames that aredifficult for the existing trackers, we can use these bounding-boxesas difficult training samples to develop more robust visual trackers.This can further reduce the workload and cost of annotation infuture.

4.2 Effectiveness of Splitting Spatial andAction Relation Localization

Asmentioned in Section 2.1, there are two approaches to temporallylocalize relations in video. One is to partition a video clip of interest


283

Table 2: The number of trimmed video clips and their totalduration in (subject, object) pairs that need to be annotated(the 1st group) and the relations annotated finally (the 2ndgroup). A trimmed video clip is a complete and continuousclip inwhich the pair and the relation triplet exists.We showthe statistics of the (subject, object) pairs of all categories inthe 1st row, while we constrain the categories of the subjectsto be human and animal in the 2nd row.

# of trimmed clips duration (seconds)(all, all) 429,064 3,667,493(human & animal, all) 263,520 2,265,280spatial relation 272,432 2,630,625action relation 106,114 440,549

into shorter overlapping segments (3-second segments with 1 sec-ond overlap in our annotation setting) and merge the annotationresults at segment level to form the overall temporal localization,leaving annotators to do the simple selection tasks for the segments.The other approach is to directly present the trimmed video clipto the annotators and let them find and localize the relations ofinterest, which is a harder task but will produce more accurateannotation results for shorter relations. In our pipeline, we splitthe annotation of spatial and action relations and apply differentapproaches in order to exploit their respective characteristics forless annotation workload and cost. Therefore, we justify the effec-tiveness of this solution based on analysis of the reward mechanismin Section 3.2 and the statistics provided in Table 2.

First, we argue that it can save much cost by adopting the firstapproach rather than the second one in spatial relation annotation.According to the total duration in the 1st row of Table 2, we canpartition all the trimmed video clips into around 1.83 million 3-second segments with overlap of 1 second. Thus, the first approachwould require a cost of 1.83 million points based on the rewardmechanism. However, according to the number of trimmed clips inthe 1st and 3rd rows in the table, the second approach would requirea cost of 0.43+0.27∗8 = 2.59million points, which is 42%more thanthat required by the adopted method. On the other hand, spatialrelations are usually of long duration, because the relative spatialstates between two objects will not switch fast in general. We canalso see from the 3rd row in Table 2 that the average duration ofthe final annotated spatial relations is 9.7s, which is much longerthan that of action relations. So it will not significantly decreasethe accuracy of temporal localization to use the first approach witha 1-second granularity for spatial relations.

Second, since the subjects of action relations can only be humanor animal, separating the annotation of action relations from spatialrelations can naturally reduce the number of trimmed video clipsthat need to be annotated. Comparing the first two rows in Table 2,we can see that this workload can be reduced by 39%. Regardingcost, we can show that the aforementioned two approaches costroughly the same on the annotation of action relations. For thefirst approach, according to the total duration in the 2nd row, wecan partition all the trimmed video clips into around 1.13 million3-second segments with overlap of 1 second, and thus this approachcosts 1.13 million points. For the second approach, according to thenumber of trimmed video clips in the 2nd and 4th rows, it wouldrequire a cost of 0.26 + 0.11 ∗ 8 = 1.14 million points. While there

Figure 3: Statistics of the types of relation triplets intrain/val set. The dark area shows thenumber andportion oftriplet types unique in the training set; the grey area showsthose appearing in both of the training and validation set;and the light area shows those unique to the validation set.

is little saving for the first approach, it still cannot offset the lossin accuracy of temporal localization, especially for the transientrelations, such as “A-hit-B” and “A-throw-B”. Hence, the approachused in action relation annotation should be different from thatused in spatial relation annotation. This verifies the effectivenessof the proposed solution that splits the annotation of spatial andaction relations.

5 THE DATASET: VidORAs one of the major contributions of this paper, we present a novellarge-scale video dataset, named Video Object Relation (VidOR),comprising of 10,000 user-generated videos with dense annota-tions on 80 categories of objects and 50 categories of predicates.Remarkably, using the annotation pipeline introduced in the previ-ous sections, we are able to construct the dataset at a modest cost.Figure 6 shows some examples from the dataset with visualizationof annotations. In the rest of the paper, we will further analyze thecharacteristic of VidOR from several aspects and discuss severalbenchmark tasks based on it.

5.1 Video SourceWe source the videos for VidOR from YFCC-100M [23], a largepublicly and freely accessible multimedia collection containing 0.8million videos from Flickr. Most of the videos are user-generatedwhose content ranges from indoor to outdoor, from daily life tobusiness occasion, and from low quality to professional quality.According to our statistics, the average length of videos is around30 seconds, but the actual video length ranges from several secondsto minutes. In order to avoid extreme cases that can cause hugedifficulty in annotation, we filter out videos that have one of thefollowing properties:

• extremely low resolution that cannot be properly viewed;• heavily shaking camera motions in most parts of video;• heavy artificial effect;• containing only one object from the 80 categories;• containing a crowd of objects from the 80 categories.

Finally, we selected 10,000 qualified videos from the source andused them for the annotation.

5.2 Dataset SplitTo use the dataset for model development, we split the dataset into7,000 videos for training, 835 videos for validation, and 2,165 videos


284

adul

tch

ild toy

baby dog

chai

rca

rta

ble

cup

sofa

ball/

spor

ts_b

all

bottl

esc

reen

/mon

itor

guita

rca

tbi

cycle

back

pack

baby

_sea

twa

terc

raft

bird

cam

era

hand

bag

lapt

opce

llpho

nest

ool

dish

duck

hors

ebe

nch

mot

orcy

clepi

ano ski

cake

baby

_wal

ker

elep

hant fish

bat

snow

boar

dbu

s/tru

ckpe

ngui

nch

icken

elec

tric_

fan

ham

ster

/rat

shee

p/go

atfa

ucet

surfb

oard

airc

raft

fruits sink

refri

gera

tor

train pig

cattl

e/co

wsk

ateb

oard

rabb

ittu

rtle

tiger

micr

owav

epa

nda

lion

suitc

ase

brea

dka

ngar

oosc

oote

rve

geta

bles

rack

etov

entra

ffic_

light

cam

elbe

arcr

aban

telo

peto

ilet

stop

_sig

nsn

ake

squi

rrel

leop

ard

frisb

eest

ingr

aycr

ocod

ile

101

102

103

104

Per C

ateg

ory

Data

Size

HumanAnimalOther

Figure 4: Object statistics per category in train/val set. The categories are grouped into three upper level categories: Human(3),Animal(28) and Other(49). It can be found that the number of objects in Human category accounts for 56.34% of the total objectoccurrences, while that in Animal and Other category are 35.78% and 7.98%, respectively.

next

_to

in_f

ront

_of

watc

h

behi

nd

away

towa

rds

bene

ath

abov

e

hold

lean

_on

spea

k_to

ride

touc

h

hug

hold

_han

d_of

carry hi

t

play

(inst

rum

ent)

push

grab bite

rele

ase

care

ss lift

pull

pat

wave

pres

s

insid

e

use

poin

t_to

chas

e

feed kiss

thro

w

kick

smel

l

wave

_han

d_to lick

driv

e

clean

shou

t_at

knoc

k

sque

eze

get_

on

get_

off

shak

e_ha

nd_w

ith cut

open

close

100

101

102

103

104

105

Per C

ateg

ory

Data

Size

Spatial PredicateAction Predicate

Figure 5: Predicate statistics per category in train/val set. Each bar indicates the number of relationswhose predicate belongs tothat category. The two types of predicates (i.e. spatial(8) and actions(42)) are highlighted in different colors. Their proportionsare 76.77% and 23.23%, respectively.

for testing. Also, we ensure that all the annotated categories ofobjects and predicates will appear in each of the train/val/test set.

5.3 Object CategoriesWe determine the set of object categories to annotate based on thecategories used in the prevailing object detection datasets [14, 19].In order to capture the diversity in human-centric relations, wefurther divide the common class of human (a.k.a. person) in thesedatasets into adult, child and baby. This results in a set of 80 objectcategories, which can be found in Figure 4.

Overall, there are 38,602 objects annotated in the train/val set,meaning that the average number of objects annotated per videois 4.9. Additionally, the number of objects per category roughlyfollows the long-tail distribution, as shown in Figure 4. We canalso find that there are more than half of objects belonging to thecategories of human.

5.4 Predicate CategoriesOur selection of predicate categories to annotate is inspired by [6].We selected 50 categories of basic predicates (shown in Figure 5),including 42 categories of atomic action predicates and 8 categoriesof common spatial predicates. This is different from [20], in whichmost of the predicate categories are generated by combining twocategories from a small set of action and spatial predicate categories.

As ambiguity about view point exists in the definition of spatialrelation, we normalize the view point to be from object instead ofcamera because it containsmore semantic information. For example,

the predicate “car-behind-child” suggests that the car is behind thechild’s back regardless of the camera position. In the case that theorientation of object is inapplicable (e.g. ball), the predicates suchas “in front of” and “behind”, will not be annotated.

We count the number of relations with respect to predicatecategory in the train/val set, and show the statistics in Figure 5. Itcan be seen that the spatial relations account for a large proportion;meanwhile, the “watch” action relations has the same magnitudeas most of the spatial relations. In total, there are 297,352 relationsannotated in the train/val set. On average, there are 29.2 spatialrelations and 8.8 action relations per video.

5.5 Relation StatisticsIn addition to the relation statistics based on the type of predicate,we further look into the relation statistics at triplet level. Throughthe combination of object pairs and predicates, there will be 320,000possible types of relation triplets if not considering some impossiblecombinations. In fact, we can see from Figure 3 that there are 6,258types of relation triplets (dark and grey area) appearing in thetraining set, and 2,410 types (grey and light area) appearing in thevalidation set.

Moreover, 2,115 types of relation triplets appear in both thetraining and validation sets, while 295 types of relation tripletsonly appear in the validation set. This forms a zero-shot learningscenario that requires the models to predict labels that they do notsee during the training phase, which has been preliminary studiedby several works. In the validation set, three are 30,142 relation


285

Figure 6: Several video examples fromVidORdataset. The examples cover the content from indoor to outdoor scenes, includingevents like sport, party and wedding, etc. For each video example, we show two key-frames to reflect the changes of relations.To visualize the annotations, for example <A,watch, B>, the bounding-boxes ofA andB are displayed around the correspondingobjects in different colors, and the relation is displayed within A’s bounding-box in the color of B’s bounding-box.

instances in total, among which 641 relation instances belong tothe 295 triplet types.

5.6 Benchmarks TasksWe discuss and define three application scenarios in which VidORcan be readily used for benchmarking.

Video Object Detection is the first step towards relation under-standing in videos. Its primary goal is to develop robust objectdetectors that can not only localize objects from the 80 categorieswith bounding boxes in every video frames, but also link the bound-ing boxes that indicate the same object entity into a trajectory. Thechallenges in VidOR requires the detectors to understand which ob-ject entities are in a video and how their locations change over time,and technically overcome the difficulties in free camera motion, aswell as illumination and object deformation, etc.

The task of Video Visual Relation Detection (VidVRD) as definedin [20] is well suited to VidOR. It aims to detect relations fromthe 80 object and 50 predicate categories, and spatio-temporallylocalize them by localizing the bounding-box trajectories of thesubject and object within the maximal duration of the relation. InVidOR, the detection of action relations requires the detectors ableto recognize the object of an action, possibly by recognizing theobject’s response. As for spatial relations, the detectors are requiredto recognize the orientation of objects (if applicable) and the relativespatial configuration between the subject and object.

While VidOR is initially constructed for object and relation recog-nition, we note that there are sufficient number of human and 42

categories of common action annotated in the dataset. So the datasetcan be adapted as a benchmark for the task of action detection asdefined in [6], which aims to detect actions from the 42 categoriesand spatio-temporally localize the bounding-box trajectory of thesubject within the maximal duration of the action. It typically re-quires the action detectors to overcome the large variation withineach category of action representation and learn the intention ofthe action.

6 CONCLUSIONWe proposed an annotation pipeline to annotate objects and rela-tions in user-generated videos at large scale. The pipeline addresseskey issues in key-frame generation and task decomposition in or-der to scale up the annotation at a modest cost. To demonstrateits effectiveness, we annotated 10,000 videos in the real world andanalyzed the statistics from the annotation procedure. Furthermore,we presented a novel large-scale video dataset1 constructed by theproposed annotation pipeline. With the dense annotations at objectand relation level, the dataset can serve as a benchmark for manymultimedia tasks in fine-grained video analysis, and facilitate theresearch on bridging the gap between vision and language.

ACKNOWLEDGMENTSThis research is part of the NExT++ project, supported by the Na-tional Research Foundation, Prime Minister’s Office, Singaporeunder its IRC@SG Funding Initiative.1Available at https://lms.comp.nus.edu.sg/research/vidor.html


286

https://lms.comp.nus.edu.sg/research/vidor.html

REFERENCES[1] David S Bolme, J Ross Beveridge, Bruce A Draper, and Yui Man Lui. 2010. Visual

object tracking using adaptive correlation filters. In Computer Vision and PatternRecognition (CVPR), 2010 IEEE Conference on. IEEE, 2544–2550.

[2] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles.2015. Activitynet: A large-scale video benchmark for human activity under-standing. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. 961–970.

[3] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? anew model and the kinetics dataset. In Computer Vision and Pattern Recognition(CVPR), 2017 IEEE Conference on. IEEE, 4724–4733.

[4] Jingjing Chen and Chong-Wah Ngo. 2016. Deep-based ingredient recognition forcooking recipe retrieval. In Proceedings of the 2016 ACM onMultimedia Conference.ACM, 32–41.

[5] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. 2009. Pedestriandetection: A benchmark. In Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on. IEEE, 304–311.

[6] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, YeqingLi, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk-thankar, et al. 2018. AVA: A video dataset of spatio-temporally localized atomicvisual actions. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. 6047–6056.

[7] Fabian Caba Heilbron and Juan Carlos Niebles. 2014. Collecting and annotatinghuman activities in web videos. In Proceedings of International Conference onMultimedia Retrieval. ACM, 377.

[8] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. 2015. High-speedtracking with kernelized correlation filters. IEEE Transactions on Pattern Analysisand Machine Intelligence 37, 3 (2015), 583–596.

[9] Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, RahulSukthankar, and Mubarak Shah. 2017. The THUMOS challenge on action recog-nition for videos “in the wild”. Computer Vision and Image Understanding 155(2017), 1–23.

[10] Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C Loui.2011. Consumer video understanding: A benchmark database and an evaluationof human and machine performance. In Proceedings of the 1st ACM InternationalConference on Multimedia Retrieval. ACM, 29.

[11] Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, andXiaogang Wang. 2017. Object detection in videos with tubelet proposal networks.In Proc. CVPR, Vol. 2. 7.

[12] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles.2017. Dense-captioning events in videos. In Proceedings of the IEEE InternationalConference on Computer Vision. 706–715.

[13] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, JoshuaKravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.2017. Visual genome: Connecting language and vision using crowdsourced denseimage annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.

[14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Commonobjects in context. In European conference on computer vision. Springer, 740–755.

[15] Ziwei Liu, Sijie Yan, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2016. Fashionlandmark detection in the wild. In European Conference on Computer Vision.Springer, 229–245.

[16] David Mihalcik and David Doermann. 2003. The design and implementation ofViPER. University of Maryland (2003), 234–241.

[17] Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and BernardGhanem. 2018. TrackingNet: A Large-Scale Dataset and Benchmark for ObjectTracking in the Wild. In The European Conference on Computer Vision (ECCV).

[18] Matteo Ruggero Ronchi and Pietro Perona. 2015. Describing Common HumanVisual Actions in Images. In Proceedings of the British Machine Vision Conference(BMVC 2015), Mark W. Jones Xianghua Xie and Gary K. L. Tam (Eds.). BMVAPress, Article 52, 12 pages.

[19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, SeanMa, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.2015. Imagenet large scale visual recognition challenge. International Journal ofComputer Vision 115, 3 (2015), 211–252.

[20] Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua.2017. Video visual relation detection. In Proceedings of the 2017 ACM on Multime-dia Conference. ACM, 1300–1308.

[21] Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, andXiangyang Xue. 2017. Weakly supervised dense video captioning. In 2017 IEEEConference on Computer Vision and Pattern Recognition (CVPR). IEEE, 5159–5167.

[22] Xu Sun, Yuantian Wang, Tongwei Ren, Zhi Liu, Zheng-Jun Zha, and GangshanWu. 2018. Object Trajectory Proposal via Hierarchical Volume Grouping. InProceedings of the 2018 ACM on International Conference on Multimedia Retrieval.ACM, 344–352.

[23] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni,Douglas Poland, Damian Borth, and Li-Jia Li. [n. d.]. YFCC100M: The New Data

in Multimedia Research. Commun. ACM 59, 2 ([n. d.]), 64–73.[24] Carl Vondrick, Donald Patterson, and Deva Ramanan. 2013. Efficiently scaling

up crowdsourced video annotation. International Journal of Computer Vision 101,1 (2013), 184–204.

[25] XianWu, Guanbin Li, Qingxing Cao, Qingge Ji, and Liang Lin. 2018. InterpretableVideo Captioning via Trajectory Structured Localization. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. 6829–6837.

[26] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video descriptiondataset for bridging video and language. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. 5288–5296.

[27] Ziwei Yang, Yahong Han, and Zheng Wang. 2017. Catching the temporal regions-of-interest for video captioning. In Proceedings of the 2017 ACM on MultimediaConference. ACM, 146–153.

[28] Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, andLi Fei-Fei. 2018. Every moment counts: Dense detailed labeling of actions incomplex videos. International Journal of Computer Vision 126, 2-4 (2018), 375–389.

[29] Huanyu Yu, Shuo Cheng, Bingbing Ni, Minsi Wang, Jian Zhang, and XiaokangYang. 2018. Fine-Grained Video Captioning for Sports Narrative. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. 6006–6015.

[30] Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-endconcept word detection for video captioning, retrieval, and question answering.In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,3261–3269.

[31] Jenny Yuen, Bryan Russell, Ce Liu, and Antonio Torralba. 2009. Labelme video:Building a video database with human annotations. In Computer Vision, 2009IEEE 12th International Conference on. IEEE, 1451–1458.

[32] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In 2017 IEEE InternationalConference on Computer Vision (ICCV). IEEE, 408–417.


287

annotating objects and relations in user-generated videos · fine-grained video content analysis is...

Documents