crowded scene understanding by deeply learned attributes

Crowded Scene Understanding by Deeply Learned Attributes∗

Jing Shao1 Kai Kang1 Chen Change Loy2 Xiaogang Wang1

1Department of Electronic Engineering, The Chinese University of Hong Kong2Department of Information Engineering, The Chinese University of Hong Kong

[email protected], [email protected], [email protected], [email protected]

1. IntroductionDuring the last decade, the field of crowd analysis had

a remarkable evolution from crowded scene understanding,including crowd behavior analysis [13, 6, 7, 10, 8, 14, 16],crowd tracking [1, 9, 17], and crowd segmentation [2, 3,15]. Much of this progress was sparked by the creation ofcrowd datasets as well as the new and robust features andmodels for profiling crowd intrinsic properties. Most of theabove studies on crowd understanding are scene-specific,that is, the crowd model is learned from a specific sceneand thus poor in generalization to describe other scenes.

Attributes are particularly effective on characterizinggeneric properties across scenes. In the recent years, studiesin attribute-based representations of objects, faces, actions,and scenes have drawn a large attention as an alternativeor complement to categorical representations as they char-acterize the target subject by several attributes rather thandiscriminative assignment into a single specific category,which is too restrictive to describe the nature of the targetsubject. Furthermore, scientific studies have shown that dif-ferent crowd systems share similar principles that can becharacterized by some common properties or attributes. In-deed, attributes can express more information in a crowdvideo as they can describe a video by answering “Who isin the crowd?”, “Where is the crowd?”, and “Why is crowdhere?”, but not merely define a categorical scene label orevent label to it. For instance, an attribute-based represen-tation might describe a crowd video as the “conductor” and“choir” perform on the “stage” with “audience” “applaud-ing”, in contrast to a categorical label like “chorus”. Re-cently, some works [10, 16] have made efforts on crowd at-tribute profiling. But the number of attributes in their workis limited, as well as the dataset is also small in terms ofscene diversity.

2. Methodology and ExperimentIn this paper, we introduce a new large-scale crowd video

dataset with crowd attribute annotation designed to under-∗The long version “Deeply Learned Attributes for Crowded Scene Un-

derstanding” is presented in the main conference as oral.

shopping mall

stock marketairport

platform

passageway

ticket counter

street

escalator

stadium

concert

stagelandmarksquare

schoolparkrink indoorbeach

bazaar

church

conference center

classroom

templebattlefield

runway

restaurant

customer

passenger

tag stadium

mpedestrian

po

nnnnndmark

plat

lllanlanaanlalanlanlanlanaudiencemm

conductor

choir

dancer

model

photographer

star

speaker

geprotestor

platformob

parader

police

par

ortsoldier

studentteacher

sttagrunner

skater

swimmer

pilgrim

newly-wed couple

y stwwwwwwwaaaaaaayaaapapapapappapassssssssaaagagagaageeeeeeeeweweeewwewewewspeakekk r

policesssss

newlyy

papapapapapapapapapap sssssssssssagagagagagagagagaggageeeeeeeewewewewwwwwwwwaaaaaapappapappapasssssssaaagagageeeeewe

resttttttttauauauauauaaaauaastandcu

queue

unterucu

sit

poortkneel

con tcoger

stadigp rianpe

d

geag

nccceeerrrtttwalktagge

p otestprop

errun

wave

rkapplaud

strswssws

torcheer

ride

t oldieso er

pllaudppa

swim

skate

u omustommustoustodance

h skater

ed coupleww

hy-w

photograph

miummiumn board

wait

buy ticket

checkck-k-in/out

beabbrunwayyrun

speaker

watch shopping mall

mermodel

photographer

s

mere

rperformance

pedestrianrianpedancerdancer

nnd iband performance

mmermmerrrri

chorus

red-carpet show ppw

fashion show

war

ockstofight

zaarbaeeee passeenba

ngprotest

rrmmdisaster

parade

ock

d l

to

ed couplewwy-w

ocktocarnival

quaaressssssqesssssssurant

checeremony

speech

zaara

l tfmmmmommmmmommmmmmmmmmmo

ntt

baiimiim zaarabammmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmgraduation

pmarket nkrin

aiaiaiaiaairpaaaaaairpak mttt

k markek mconference

schohh olrapherrattend classesrkkkkkkkkkkk

eeererwedding

dcondcond

fbbb

ger

ductduuctdu tortor

b db

passeenngmarathon

nn ec eeecpicnic

nkrin qquu

utticket copilgrimage

edestteotestorpro rtor

edestteshopping

nennnn wayyyyyyyyy nnnnnnnstock exchange

dining

nceh

ecut the ribbon

esesesesesesesssssscacacacacacac llllllalalalalal tottototottooooor

onference centerntertemp

ba

e

classrochurch

ferenchoir

star

coattlefield

paradderc

par

studeessssssscccccccesssscccccccm

teaacherte

epolicessssss

ceeee cennstarstar

skateeeeeeeee cee

paradwait

rrerereeeeeeeeeeeeeeedddddclassroomm

d aaaaaaarrprprprprprprpppprprprprppeeeeetetetettttttettettt sssssssssssssshohohohohohhohohohooooohohohohooowwwwwwwwwwwwwwwwwwsssssshohohooooooowwwwwwwwwaaaaaccccccc cacaacacacaaaaaallalalatototototototototototorcacaacacaacalatotottotor

churchchurchfashion shofashionshion shoshion h

wawawawwawawawaaawawawawawawwwarrrrrrrrrrrrrrrrraaaaaaaaa

quaarsqquaarsqpolicesssssss

rrrrerererererererererererererereeeedd- aaaaaaaaaaaaaaaarprpprprprprprprprprprprpprprprpetetettttttttetetttt ssssssssssssssetttttt ssssssss saaaaaaaaaaaaaaaaaacccccccccccccccccc

paraderadedddquaaressssssqssssssssquaarssqssssssss

ommommpeechsp h

stude

pilgrpilgrggradp g

choir sdiningngdinindiningg

outdoor

Figure 1. A quick glance of WWW Crowd Dataset with its at-tributes. Red represents the location (Where), green represents thesubject (Who), and blue refers to event/action (Why). The area ofeach word is proportional to the frequency of that attribute in theWWW dataset.

stand crowded scenes. We exploit deep models to learn thefeatures for each attribute from the appearance and motioninformation of each video, and apply the learned models forrecognizing attributes in unseen crowd videos.The largest crowd dataset with crowd attribute anno-tation. To our best knowledge, the Who do What at some-Where (WWW) Crowd Dataset1 is the largest crowd datasetto date. It contains 10, 000 videos from 8, 257 crowdedscenes. The videos in the WWW crowd dataset are all fromreal-world, collected from various sources, and captured bydiverse kinds of cameras. We further define 94 meaningfulattributes as high-level crowd scene representations, shownin Fig. 1. These attributes are navigated by tag informationof the crowd videos from Internet. They cover the commoncrowded places, subjects, actions, and events.

1http://www.ee.cuhk.edu.hk/˜jshao/WWWcrowd.html

1

http://www.ee.cuhk.edu.hk/~jshao/WWWcrowd.html

Figure 2. Deep model. The appearance and motion channelsare input in two separate branches with the same deep archi-tecture. Both branches consist of multiple layers of convolu-tion (blue), max pooling (green), normalization (orange), and onefully-connected (red). The two branches then fuse together to onefully-connected layers (red).

Extensive experiment evaluation. Since videos possessmotion information in addition to appearance, we examinedeeply learned crowd features from both the appearance andmotion aspects. Compared with the method that directly in-puts a single frame and multiple frames to the deep neuralnetwork, we propose the motion feature channels inspiredfrom [10] as the input of the deep model and develop amulti-task deep model to jointly learn and combine appear-ance and the proposed motion features for crowded sceneunderstanding. The network is shown in Fig. 2. In all theexperiments, we employ the area under ROC curve (AUC)as the evaluation criteria.1) Deeply learned static features (DLSF). To evaluate ourDLSF from the appearance channels only, we select a set ofstate-of-the-art hand-craft static features (i.e. SIFT, GIST,HOG, SSIM, and LBP) that have been widely used in sceneclassification for comparison, named as SFH in Table 1.2) Deeply learned motion features. We also report the per-formance of the deeply learned motion features in Table 1,compared with two baselines. One is the histogram of ourproposed motion descriptor (MDH), and another is densetrajectory (DenseTrack) [12].3) Deeply learned motion features. The deep model com-bining the DLSF and DLMF is compared with five base-lines. It includes two combinations of appearance andmotion (i.e. SFH+MDH and SFH+DenseTrack), a hand-craft feature extracting spatio-temporal motion patterns(STMP) [5], and two state-of-the-art deep models (i.e. SlowFusion [4] and Two-stream [11]).

From the experimental results with the proposed deepmodel, we show that our attribute-centric crowd dataset al-lows us to do a better job in the traditional crowded sceneunderstanding and provides potential abilities in cross-scene event detection and crowd video retrieval.

User study on the WWW dataset. Appearance and mo-tion cues play different roles in crowded scene understand-ing. We further conduct a user study to measure how ac-curately humans can recognize crowd attributes, and withwhich type of data that users can achieve the highest ac-curacy. This study is necessary and essential to provide areference evaluation to our empirical experiments. Specif-

Our Methods meanAUC Baselines mean

AUC#

winsDLSF 0.87 SFH 0.81 67/94

DLMF 0.68 MDH 0.58 85/94DenseTrack [12] 0.63 72/94

DLSF + DLMF 0.88

SFH+MDH 0.80 78/94SFH+DenseTrack 0.82 72/94

STMP [5] 0.72 89/94Slow Fusion [4] 0.81 74/94Two-stream [11] 0.76 89/94

Table 1. Compare deeply learned features with baselines. The lastcolumn shows the number of attributes (out of the total number of94) on which our proposed deep features have higher AUC thanbaselines.

ically, it is interesting to see how human perception (whengiven different data types) correlated with the results ofcomputational models.

References[1] S. Ali and M. Shah. Floor fields for tracking in high density crowd

scenes. In ECCV. 2008. 1[2] A. B. Chan and N. Vasconcelos. Modeling, clustering, and segment-

ing video with mixtures of dynamic textures. TPAMI, 30(5):909–926,2008. 1

[3] K. Kang and X. Wang. Fully convolutional neural networks forcrowd segmentation. arXiv preprint arXiv:1411.4464, 2014. 1

[4] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei. Large-scale video classification with convolutional neuralnetworks. In CVPR, 2014. 2

[5] L. Kratz and K. Nishino. Anomaly detection in extremely crowdedscenes using spatio-temporal motion pattern models. In CVPR, 2009.2

[6] C. C. Loy, T. Xiang, and S. Gong. Multi-camera activity correlationanalysis. In CVPR, 2009. 1

[7] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos. Anomalydetection in crowded scenes. In CVPR, 2010. 1

[8] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior de-tection using social force model. In CVPR, 2009. 1

[9] M. Rodriguez, J. Sivic, I. Laptev, and J.-Y. Audibert. Data-drivencrowd analysis in videos. In ICCV, 2011. 1

[10] J. Shao, C. C. Loy, and X. Wang. Scene-independent group profilingin crowd. In CVPR, 2014. 1, 2

[11] K. Simonyan and A. Zisserman. Two-stream convolutional networksfor action recognition in videos. In NIPS, 2014. 2

[12] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recognitionby dense trajectories. In CVPR, 2011. 2

[13] X. Wang, X. Ma, and W. E. L. Grimson. Unsupervised activityperception in crowded and complicated scenes using hierarchicalbayesian models. TPAMI, 31(3):539–555, 2009. 1

[14] S. Yi, X. Wang, C. Lu, and J. Jia. L0 regularized stationary timeestimation for crowd group analysis. In CVPR, 2014. 1

[15] C. Zhang, H. Li, X. Wang, and X. Yang. Cross-scene crowd countingvia deep convolutional neural networks. In CVPR, 2015. 1

[16] B. Zhou, X. Tang, H. Zhang, and X. Wang. Measuring crowd collec-tiveness. TPAMI, 36(8):1586–1599, 2014. 1

[17] F. Zhu, X. Wang, and N. Yu. Crowd tracking with dynamic evolutionof group structures. In ECCV. 2014. 1

crowded scene understanding by deeply learned attributes

Documents