crowded scene understanding by deeply learned attributes
TRANSCRIPT
Crowded Scene Understanding by Deeply Learned Attributes∗
Jing Shao1 Kai Kang1 Chen Change Loy2 Xiaogang Wang1
1Department of Electronic Engineering, The Chinese University of Hong Kong2Department of Information Engineering, The Chinese University of Hong Kong
[email protected], [email protected], [email protected], [email protected]
1. IntroductionDuring the last decade, the field of crowd analysis had
a remarkable evolution from crowded scene understanding,including crowd behavior analysis [13, 6, 7, 10, 8, 14, 16],crowd tracking [1, 9, 17], and crowd segmentation [2, 3,15]. Much of this progress was sparked by the creation ofcrowd datasets as well as the new and robust features andmodels for profiling crowd intrinsic properties. Most of theabove studies on crowd understanding are scene-specific,that is, the crowd model is learned from a specific sceneand thus poor in generalization to describe other scenes.
Attributes are particularly effective on characterizinggeneric properties across scenes. In the recent years, studiesin attribute-based representations of objects, faces, actions,and scenes have drawn a large attention as an alternativeor complement to categorical representations as they char-acterize the target subject by several attributes rather thandiscriminative assignment into a single specific category,which is too restrictive to describe the nature of the targetsubject. Furthermore, scientific studies have shown that dif-ferent crowd systems share similar principles that can becharacterized by some common properties or attributes. In-deed, attributes can express more information in a crowdvideo as they can describe a video by answering “Who isin the crowd?”, “Where is the crowd?”, and “Why is crowdhere?”, but not merely define a categorical scene label orevent label to it. For instance, an attribute-based represen-tation might describe a crowd video as the “conductor” and“choir” perform on the “stage” with “audience” “applaud-ing”, in contrast to a categorical label like “chorus”. Re-cently, some works [10, 16] have made efforts on crowd at-tribute profiling. But the number of attributes in their workis limited, as well as the dataset is also small in terms ofscene diversity.
2. Methodology and ExperimentIn this paper, we introduce a new large-scale crowd video
dataset with crowd attribute annotation designed to under-∗The long version “Deeply Learned Attributes for Crowded Scene Un-
derstanding” is presented in the main conference as oral.
shopping mall
stock marketairport
platform
passageway
ticket counter
street
escalator
stadium
concert
stagelandmarksquare
schoolparkrink indoorbeach
bazaar
church
conference center
classroom
templebattlefield
runway
restaurant
customer
passenger
tag stadium
mpedestrian
po
nnnnndmark
plat
lllanlanaanlalanlanlanlanaudiencemm
conductor
choir
dancer
model
photographer
star
speaker
geprotestor
platformob
parader
police
par
ortsoldier
studentteacher
sttagrunner
skater
swimmer
pilgrim
newly-wed couple
y stwwwwwwwaaaaaaayaaapapapapappapassssssssaaagagagaageeeeeeeeweweeewwewewewspeakekk r
policesssss
newlyy
papapapapapapapapapap sssssssssssagagagagagagagagaggageeeeeeeewewewewwwwwwwwaaaaaapappapappapasssssssaaagagageeeeewe
resttttttttauauauauauaaaauaastandcu
queue
unterucu
sit
poortkneel
con tcoger
stadigp rianpe
d
geag
nccceeerrrtttwalktagge
p otestprop
errun
wave
rkapplaud
strswssws
torcheer
ride
t oldieso er
pllaudppa
swim
skate
u omustommustoustodance
h skater
ed coupleww
hy-w
photograph
miummiumn board
wait
buy ticket
checkck-k-in/out
beabbrunwayyrun
speaker
watch shopping mall
mermodel
photographer
s
mere
rperformance
pedestrianrianpedancerdancer
nnd iband performance
mmermmerrrri
chorus
red-carpet show ppw
fashion show
war
ockstofight
zaarbaeeee passeenba
ngprotest
rrmmdisaster
parade
ock
d l
to
ed couplewwy-w
ocktocarnival
quaaressssssqesssssssurant
checeremony
speech
zaara
l tfmmmmommmmmommmmmmmmmmmo
ntt
baiimiim zaarabammmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmgraduation
pmarket nkrin
aiaiaiaiaairpaaaaaairpak mttt
k markek mconference
schohh olrapherrattend classesrkkkkkkkkkkk
eeererwedding
dcondcond
fbbb
ger
ductduuctdu tortor
b db
passeenngmarathon
nn ec eeecpicnic
nkrin qquu
utticket copilgrimage
edestteotestorpro rtor
edestteshopping
nennnn wayyyyyyyyy nnnnnnnstock exchange
dining
nceh
ecut the ribbon
esesesesesesesssssscacacacacacac llllllalalalalal tottototottooooor
onference centerntertemp
ba
e
classrochurch
ferenchoir
star
coattlefield
paradderc
par
studeessssssscccccccesssscccccccm
teaacherte
epolicessssss
ceeee cennstarstar
skateeeeeeeee cee
paradwait
rrerereeeeeeeeeeeeeeedddddclassroomm
d aaaaaaarrprprprprprprpppprprprprppeeeeetetetettttttettettt sssssssssssssshohohohohohhohohohooooohohohohooowwwwwwwwwwwwwwwwwwsssssshohohooooooowwwwwwwwwaaaaaccccccc cacaacacacaaaaaallalalatototototototototototorcacaacacaacalatotottotor
churchchurchfashion shofashionshion shoshion h
wawawawwawawawaaawawawawawawwwarrrrrrrrrrrrrrrrraaaaaaaaa
quaarsqquaarsqpolicesssssss
rrrrerererererererererererererereeeedd- aaaaaaaaaaaaaaaarprpprprprprprprprprprprpprprprpetetettttttttetetttt ssssssssssssssetttttt ssssssss saaaaaaaaaaaaaaaaaacccccccccccccccccc
paraderadedddquaaressssssqssssssssquaarssqssssssss
ommommpeechsp h
stude
pilgrpilgrggradp g
choir sdiningngdinindiningg
outdoor
Figure 1. A quick glance of WWW Crowd Dataset with its at-tributes. Red represents the location (Where), green represents thesubject (Who), and blue refers to event/action (Why). The area ofeach word is proportional to the frequency of that attribute in theWWW dataset.
stand crowded scenes. We exploit deep models to learn thefeatures for each attribute from the appearance and motioninformation of each video, and apply the learned models forrecognizing attributes in unseen crowd videos.The largest crowd dataset with crowd attribute anno-tation. To our best knowledge, the Who do What at some-Where (WWW) Crowd Dataset1 is the largest crowd datasetto date. It contains 10, 000 videos from 8, 257 crowdedscenes. The videos in the WWW crowd dataset are all fromreal-world, collected from various sources, and captured bydiverse kinds of cameras. We further define 94 meaningfulattributes as high-level crowd scene representations, shownin Fig. 1. These attributes are navigated by tag informationof the crowd videos from Internet. They cover the commoncrowded places, subjects, actions, and events.
1http://www.ee.cuhk.edu.hk/˜jshao/WWWcrowd.html
1
Figure 2. Deep model. The appearance and motion channelsare input in two separate branches with the same deep archi-tecture. Both branches consist of multiple layers of convolu-tion (blue), max pooling (green), normalization (orange), and onefully-connected (red). The two branches then fuse together to onefully-connected layers (red).
Extensive experiment evaluation. Since videos possessmotion information in addition to appearance, we examinedeeply learned crowd features from both the appearance andmotion aspects. Compared with the method that directly in-puts a single frame and multiple frames to the deep neuralnetwork, we propose the motion feature channels inspiredfrom [10] as the input of the deep model and develop amulti-task deep model to jointly learn and combine appear-ance and the proposed motion features for crowded sceneunderstanding. The network is shown in Fig. 2. In all theexperiments, we employ the area under ROC curve (AUC)as the evaluation criteria.1) Deeply learned static features (DLSF). To evaluate ourDLSF from the appearance channels only, we select a set ofstate-of-the-art hand-craft static features (i.e. SIFT, GIST,HOG, SSIM, and LBP) that have been widely used in sceneclassification for comparison, named as SFH in Table 1.2) Deeply learned motion features. We also report the per-formance of the deeply learned motion features in Table 1,compared with two baselines. One is the histogram of ourproposed motion descriptor (MDH), and another is densetrajectory (DenseTrack) [12].3) Deeply learned motion features. The deep model com-bining the DLSF and DLMF is compared with five base-lines. It includes two combinations of appearance andmotion (i.e. SFH+MDH and SFH+DenseTrack), a hand-craft feature extracting spatio-temporal motion patterns(STMP) [5], and two state-of-the-art deep models (i.e. SlowFusion [4] and Two-stream [11]).
From the experimental results with the proposed deepmodel, we show that our attribute-centric crowd dataset al-lows us to do a better job in the traditional crowded sceneunderstanding and provides potential abilities in cross-scene event detection and crowd video retrieval.
User study on the WWW dataset. Appearance and mo-tion cues play different roles in crowded scene understand-ing. We further conduct a user study to measure how ac-curately humans can recognize crowd attributes, and withwhich type of data that users can achieve the highest ac-curacy. This study is necessary and essential to provide areference evaluation to our empirical experiments. Specif-
Our Methods meanAUC Baselines mean
AUC#
winsDLSF 0.87 SFH 0.81 67/94
DLMF 0.68 MDH 0.58 85/94DenseTrack [12] 0.63 72/94
DLSF + DLMF 0.88
SFH+MDH 0.80 78/94SFH+DenseTrack 0.82 72/94
STMP [5] 0.72 89/94Slow Fusion [4] 0.81 74/94Two-stream [11] 0.76 89/94
Table 1. Compare deeply learned features with baselines. The lastcolumn shows the number of attributes (out of the total number of94) on which our proposed deep features have higher AUC thanbaselines.
ically, it is interesting to see how human perception (whengiven different data types) correlated with the results ofcomputational models.
References[1] S. Ali and M. Shah. Floor fields for tracking in high density crowd
scenes. In ECCV. 2008. 1[2] A. B. Chan and N. Vasconcelos. Modeling, clustering, and segment-
ing video with mixtures of dynamic textures. TPAMI, 30(5):909–926,2008. 1
[3] K. Kang and X. Wang. Fully convolutional neural networks forcrowd segmentation. arXiv preprint arXiv:1411.4464, 2014. 1
[4] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei. Large-scale video classification with convolutional neuralnetworks. In CVPR, 2014. 2
[5] L. Kratz and K. Nishino. Anomaly detection in extremely crowdedscenes using spatio-temporal motion pattern models. In CVPR, 2009.2
[6] C. C. Loy, T. Xiang, and S. Gong. Multi-camera activity correlationanalysis. In CVPR, 2009. 1
[7] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos. Anomalydetection in crowded scenes. In CVPR, 2010. 1
[8] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior de-tection using social force model. In CVPR, 2009. 1
[9] M. Rodriguez, J. Sivic, I. Laptev, and J.-Y. Audibert. Data-drivencrowd analysis in videos. In ICCV, 2011. 1
[10] J. Shao, C. C. Loy, and X. Wang. Scene-independent group profilingin crowd. In CVPR, 2014. 1, 2
[11] K. Simonyan and A. Zisserman. Two-stream convolutional networksfor action recognition in videos. In NIPS, 2014. 2
[12] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recognitionby dense trajectories. In CVPR, 2011. 2
[13] X. Wang, X. Ma, and W. E. L. Grimson. Unsupervised activityperception in crowded and complicated scenes using hierarchicalbayesian models. TPAMI, 31(3):539–555, 2009. 1
[14] S. Yi, X. Wang, C. Lu, and J. Jia. L0 regularized stationary timeestimation for crowd group analysis. In CVPR, 2014. 1
[15] C. Zhang, H. Li, X. Wang, and X. Yang. Cross-scene crowd countingvia deep convolutional neural networks. In CVPR, 2015. 1
[16] B. Zhou, X. Tang, H. Zhang, and X. Wang. Measuring crowd collec-tiveness. TPAMI, 36(8):1586–1599, 2014. 1
[17] F. Zhu, X. Wang, and N. Yu. Crowd tracking with dynamic evolutionof group structures. In ECCV. 2014. 1