spatio-temporal texture modelling for real-time crowd anomaly …shura.shu.ac.uk/18876/3/wang...

Spatio-temporal Texture Modelling for Real-time Crowd Anomaly Detection

WANG, Jing <http://orcid.org/0000-0002-5418-0217> and XU, Zhijie

Available from Sheffield Hallam University Research Archive (SHURA) at:

http://shura.shu.ac.uk/18876/

This document is the author deposited version. You are advised to consult the publisher's version if you wish to cite from it.

Published version

WANG, Jing and XU, Zhijie (2016). Spatio-temporal Texture Modelling for Real-time Crowd Anomaly Detection. Computer Vision and Image Understanding, 144, 177-187.

Copyright and re-use policy

See http://shura.shu.ac.uk/information.html

Sheffield Hallam University Research Archivehttp://shura.shu.ac.uk

http://shura.shu.ac.uk/

http://shura.shu.ac.uk/information.html

*corresponding author

Tel: +44 (0)1484472156

Fax: +44 (0)1484422252

Email: [email protected]

Spatio-temporal Texture Modelling for Real-time Crowd Anomaly Detection

Jing Wang and Zhijie Xu*, Visualisation, Interaction and Vision Research Group,

University of Huddersfield

Abstract

With the rapidly increasing demands from surveillance and security industries, crowd behaviour

analysis has become one of the hotly pursued video event detection frontiers within the computer

vision arena in recent years. This research has investigated innovative crowd behaviour detection

approaches based on statistical crowd features extracted from video footages. In this paper, a new

crowd video anomaly detection algorithm has been developed based on analysing the extracted

spatio-temporal textures. The algorithm has been designed for real-time applications by deploying

low-level statistical features and avoiding complicated machine learning and recognition processes.

In the experiments, the system has been proved as a valid solution for detecting anomaly behaviours

without strong assumptions on the nature of crowds, for example, subjects and density. The

developed prototype shows improved adaptability and efficiency against chosen benchmark systems.

Keywords: crowd anomaly, spatio-temporal volume, spatio-temporal texture

1. Introduction The increasing demands of intelligent surveillance applications have triggered many innovative

developments in automated video event detection areas. These techniques, such as crowd anomaly

detection, can be used in monitoring and tracking emergency situations occurred in crowded scenes,

including busy motorways, high streets, sporting events, and open air concerts.

Modelling crowd scenes from video recordings present tough challenges due to severe occlusion

problems among crowd subjects. Traditional top-down approaches, which focused on accurately

tracking and identifying the so- alled a o al eha iou s of o d e tities, have been proved

inefficient and inaccurate for crowd-based anomaly detection [1]. Crowd scenes often contain

uncertainties such as changes of subject density, average subject size, shape, and boundaries of

these entities that can bring ambiguities to the definition and interpretation of the meaning and

natures of crowd anomalies. For tackling those problems, it has been widely acknowledged that the

modelling of all (or most of) the crowd behaviours through analysing their constituent subjects

characteristics in a chosen feature space is unavoidable (the so-called bottom-up process).

While the visual appearance of individual su je t’s behaviour in a crowd scene may vary, their intra-

/inter- group dynamics within certain quantifiable feature space are often statistically identical, for

example, bird flocks, fish schools, and even vehicles on motorway. The swarm behaviours are

constructed of si ila visual t ials for a studied scene. In the case of crowd anomaly analysis, such

as people’s sudde gathe i g o dispersion, complex interactions of crowd subjects changes the

appearances of the local/global image observations, which triggers this research to model the crowd

scene ha ges through analysing their overall grouping dynamics by forming a real-time (or near

real-time) anomaly warning applications.

It is worth noting that the normal a d a o al events are intrinsically ambiguous on semantic

level. Certain crowd behaviours are normal in one scenario but may become hazardous in others.

For example, crowds running in a marathon are o al , but the lot suddenly start running in a

shopping mall may present an accident and emergency scenario. Based on the nature of surveillance

applications, the occurrence of anomaly events usually counts a very small percentage of the entire

surveillance cycle and demands immediate verification and response. In this research, it is

considered reasonable to define normal crowd behaviours as dominate pattern. Instead of

composing complex event models for semantic interpretation, a normality crowd model in this

research can be learnt and self-updated by abstracting the visual features along its timeline.

The core research problem highlighted in this research is to abstract effective visual features which

can accurately describe normal crowd activities by learning marked instances along the spatial and

temporal domain, and to derive highly robust decision making algorithm in real-world settings. Due

to the application-oriented nature of this research, the system should also be easily adopted by

closed-circuit television (CCTV) systems and run in real-time with a minimum lag.

In this paper, a crowd anomaly detection algorithm has been introduced based on image textures

formulated by spatio-temporal information. The so-called spatio-temporal texture (STT) has been

proven as an effective dynamic representation media and mechanism for single human action

detection. This research explores its characteristics in maintaining the statistical consistency across

normal crowd events domain and its sensitivity to anomalies, or sudden changes. A redundancy

feature space has been built based on the STT structure through wavelet-based texture

representations, which allows a flexible multi-criteria binary decision making mechanism to be

constructed for detecting the crowd scene anomalies.

This paper is structured as follows: a literature review of the related crowd modelling techniques

have been introduced in Section 2. Section 3 focuses on the design of STT-based crowd modelling

approach. An innovative crowd anomaly detection algorithm has been developed in the research

and presented in Section 4. System evaluations against benchmarking approaches have been

highlighted in Section 5. Section 6 concludes the research with a discussion on the pros-and-cons of

the devised approach and envisaged future improvements.

2. Literature Review Crowd behaviour analysis using computer vision (CV) techniques have attracted attentions in the

research and application domains since 1990s [2].Various algorithms and techniques for assisting the

precision and speed of the processes were developed in the last two decades. In general, there are

three representative approaches: the individual feature-based; the flow field-oriented and spatio-

temporal feature driven approaches.

In the first category, crowd behaviour is often treated as an assembly of individual activities

aggregated from each crowd entity. For example, a crowd movement along a busy street can be

recognized as a group of people walking in a same direction. In general, these methods focused on

describing crowd attributes by locating, isolating and analysing each crowd member. Benefited from

recent development on machine learning theories and practices, those methods, such as tracking

and locating pedestrians through face detection [3] from crowd scenes, estimating crowd size

through head contour counting [4], have been proved as powerful individual-oriented tracking

strategies for complex video processing [2, 5]. The reported practical techniques in this category

treat the crowd as the complex dynamic background and paying more attention on the behaviours

of the individual crowd members.

Methods in the second category define the crowd scene as a dynamic flow field, which is the most

popular approach to-date, fo a al si g o d featu es. Ea l studies, su h as the Mi ko ski f a tal di e sio odel [6], and the flow- ased o d otio odel [7], had focus on the extraction of

crowd attributes from the vector fields to describe crowd density, moving directions, and boundaries.

In recent years, more attention has shifted towards application-oriented techniques to improve

crowd pattern interpretation [8-10]. In 2007, Ali [11] first introduced a crowd scene model based on

fi ite ti e L apu o e po e t field - an extension of the flow-filed model - for segmenting

extremely dense crowd scenes recorded in videos. The segmentation outputs are then been used in

the so- alled floo field odel al ulatio fo tracking specific individuals from high density human

crowds [12]. This model has also been applied in group tracking that containing multiple or

intersected crowd entities [13]. Rod iguez’s off-line dominating crowd moving direction learning

algorithm [14] has also been proven an effective flow-based tracking approach. Similar researches,

such as Crowd Kanade-Lucas-Tomasi (KLT) corners [15], multi-label optimisation [16] and Lagrangian

pa ti le t aje to ies o k-flo odel [17], developed anomaly crowd visual features from flow-

filed information. Those methods demonstrated their potentials in tracking the dynamic crowd

under extremely crowded and partial occluded conditions but are bound to pre-defined crowd

patterns (i.e., hu a crowd) and specific applications.

Different from the first category, the flow field-oriented techniques are mainly based on the so-

alled glo al o d otio s. The i pa t of the lo al a d i di idual o d e e s is ofte ignored. However, it is often observed that acts from localized entity or entity group can bring

significant changes to the crowd consistency and break the harmony in between the prior-

knowledge of the flow field and the dominant motions. For analysing crowd entity characteristics, a

more generic crowd model should be established for describing the interactions of both the local

and group entity features. This is also one of the motivations of this research.

The third significant approach defines the crowd scene videos as Spatio-temporal Volume (STV),

which combines global video dynamics into a three-dimensional feature space. For example, in 2009,

Benezeth [18] introduced a motion labelling method based on the co-occurrence of features defined

in STV. The model has been implemented as a potential function in the Markov Random Field (MRF)

process for anomaly detection. Kratz [19] introduced STV-based motion patterns in 3D volumetric

environment to highlight the spatial-temporal statistical characteristics of extremely crowded scenes.

In 2012, Bertini [20] developed a STV-based anomaly location detection approach through using

localised cuboids in an unsupervised learning framework across the entire STV domain. The method

has been proven a valid approach when modelling spatio-temporal features without parameter

settings.

Re e tl , a othe i po ta t post- p o essi g odel employing information fusion techniques has

been introduced into crowd behaviour understanding studies. Pilot research outputs, such as

Meh a ’s so ial fo e odel [21] a d its opti ised e sio s su h as i te a tio fo e [22, 23]

have been adopted in practices to serve as the preliminary assumptions for crowd behaviours and

crowd scene simulations. The model requires pre-defined conditions to be satisfied before operation,

for example, the majority of a crowd should move towards same target area or congregation

locations, which restricts its flexibility in real-world applications.

It is also worth noting that un-/self- supervised machine learning algorithms are becoming popular

for detecting anomaly crowd events. Those studies define normal crowd behaviours as dominate

pattern and anomaly events can be distinguished by learning those patterns. For example, Feng

introduced an online self-organizing map (SOM) [24] to model crowd scene, which keep updating its

patterns by using new on-line observations. Jiang [25] used unsupervised clustering algorithm to

compare individuals and its neighbours. Any different behaviours between them were recognised as

abnormalities.

In this research, an innovative anomaly crowd detection strategy based on the statistical crowd

features extracted from spatio-temporal video slices has been devised. The texture models contain

strong statistical characteristics for describing repetitiveness and randomness of recorded scenes,

which inherently combines local and global crowd entity features. It is based on the assumption that

a crowd scene and its related dynamic patterns can be encapsulated in texture models and to be

perceived by human perception and intuitions. While many similar approaches also combine texture

features with spatio-temporal information such as Mahadevan [26] and Ryan [27], the proposed

method can detect real-time anomaly events without time-consuming machine learning and

environment pre-assumptions.

3. Spatio-temporal Texture Modelling Spatio-temporal Texture (STT) model is a statistical model developed in this research. STT is sensitive

to the changes of crowd motion and can be used for monitoring crowd activates in real-time.

Figure 1 illustrates the main steps for defining STT from raw video data. In this pipeline, STT is

composed by using spatio-temporal volume and its slices. The slices contain image texture

information and can be analysed by using wavelet transforms. After transforming each slice from

spatio-temporal domain to wavelet space, the STT is then modelled by statistic and correlation

calculations on related wavelet sub-band images.

3.1. STV Texture Patterns

As illustrated in Figure 2, a Spatio-temporal Volume (STV) is defined in a 3D Cartesian space denoted

by X, Y, and T (time) axes. In this structure, the concept of an individual frame is replaced by a

continuous 3D volume section, in which its density, envelop and slices are all factors to the final

interpretation of the model.

The STV data structure transforms the video event detection process from a conventional 2D frame-

based mechanism into a 3D model analysis operation. Through this transformation, dynamic

Figure 1. STT construction pipeline

spatio-temporal

volume

wavelet

transformation

volume

slices

STTfeature

constructionvideostream

i fo atio of a o d’s o e e t a e ep ese ted the a iatio of 3D shapes, flo s o poi t clouds. Various pattern recognition, shape analysis and matching algorithms can be applied to the

volumetric natured crowd events.

As shown on the right hand side of Figure 2, a slice is generated by inserting a clipping plane at

chosen position (dash-line marked region) and going through the STV along the T axis. In this

research, the position and direction of each STV slice are controlled by the local crowd region

(shaded segments on the clipping plane), which will be explained in detail in Section 4.1.

3.2. Texture Pattern Similarity

Based on the viewpoint of human intuition, a static crowd texture contains spatially homogeneous

image regions composed of the crowd members in random locations, of varied colours, and sizes. As

illustrated in Figure 3(a), each frame marked by the dash lines has been cropped evenly into four

sub-regions displayed alongside the o igi al i age. The appea a e of ea h o d e e is different, but the sub-regions are quite similar and even visually indistinguishable. This similarity was

caused by the pre-attentive decision of the human observer and stemmed from human vision

biology and psychology. It is from this angle that this research set to investigate the spatial similarity

and of crowded scenes using extracted STV slices as pattern textures.

(a) crowd image scenes (b) crowd STV slice from Figure 2

Figure 3. Visual randomness and similarity of crowd images

The spatial attributes shown in Figure 3(a) are also applicable for modelling crowd dynamic. For

example, captured by the STV slice shown in Figure 3(b), sub-regions divided by dashed line denote

the different time sections along the video stream. Because the Marathon example used in Figure 2

does not have sudden changes in terms of crowd behaviours, although the sub-regions contain

different individual details, their compositing pattern textures are identical.

Figure 2. A STV video section and its spatio-temporal slice

In case of an anomaly crowd event shown in Figure 4, where a CCTV footage containing a sudden

gathering and dispersing of a group of people during a shop rob has been studied, the top row

displays four snapshots at some key time points (t1, t2, t3 and t4). Between time t1 and t2,

pedestrians were walking normally along the street. After t3, people start to rush into a shop. The

changed patterns of pedestrians were encapsulated in the corresponding STV slices as shown in the

bottom of the figure. The differences of the STV slices segments [t1, t2] and [t3, t4] are obvious to

human observers, hence, the visual differences in STV slices can represent changes from crowd

videos, which can be used for modelling a detection model of anomaly crowd events.

3.3. STT feature extraction

Visual similarity is an intuitive concept based on the image appearance. Specifically, this visually

undistinguished image contains both randomness and similarity. One of the classic mathematic

models for describing this relationship from finite lattice is called Homogeneous Random Field (HRF)

which was first introduced by Julesz [28] and then formalized by Zhu et al. [29] in 2000. It had since

been widely adopted in nature image understanding [30] and texture feature modelling based on

the statistical principles theories. In this research, HRF has been used for composing the STT feature

space.

3.3.1. Steerable pyramid wavelet transformation

The fine-to-coarse-based image description schemes have been widely used in image recognition

and analysis. Since a crowded scene contains rich information in both local and global feature levels

over the entire STV space, an efficient translation- and rotation-invariant wavelet scheme based on

the so- alled stee a le p a id [31] has been used in this research.

The input image for each transformation is a sub-region of STV slice captured by a sliding window

along the time (T) axis. Since the low pass band is subsampled by a factor of two along both axes, the

size of the image need to be normalised to � × �.

Figure 4. An anomaly crowd event and one of its texture patterns of the STV slices

Figure 5 illustrates the core of a steerable pyramid where an image is treated as a linear combination

of wavelet sub-bands from each layer of the pyramid. After splitting the input image into the high-

and low-pass bands, the low-pass band is further decomposed into a group of oriented sub-bands

and henceforth. The filers of this wavelet transformation are polar-separable in the Fourier domain,

which is a complex pair of even- and odd-symmetric filters corresponded by the real and imaginary

parts of the filtering results.

The wavelet filter can be written as

� , � = { cos � log 4�� , �4 < < �, �4 , � (1)

�� , � = � � ∈ [ ,� − ] (2)

where radial and angular parts are

= { cos � log �� , �4 < < �, �4 , � (3)

� � = { �− �− !√�[ �− ]! [cos � − �� ]�− , |� − �� | < �, ℎ (4)

In the Equation, , � are polar frequency coordinates, � , � is low-pass band filter, and �� , �

denotes the oriented filter with N directions. The initial value of the low- /high- pass filter can be

defined by:

� = � �2,�= � , � (5)

Figure 5. Steerable pyramid implementation strategy

For example, as shown in Figure 6, the initial input of the process pipeline is the grayscale subarea of

the STV slices from marathon video clip. In this 3-scale and 4-orientation wavelet sub-bands, the

magnitudes hold important structural information of the pattern images.

3.3.2. HRF principles and STT features

Based on the wavelet transformation, HRF highlights three groups of visual features. The crowd

model can be constructed based on HRF for texture modelling, which has been summarised as:

Fundamental low-level features

The grayscale distributions extracted from each low-pass band and the down-sampled image of the

steerable pyramid. The measurement is based on calculation of means, variance, skewness, kurtosis

minimum and maximum values of every input STV slice sub-region, variance of the high-pass band,

and skewness and kurtosis of the every low pass image at each scale.

Coefficient features

The coefficient features are the local auto-correlations of the wavelet sub-bands. The features have

been used for evaluating the periodical and long range correlations of the image distributions. Since

the steerable pyramid can be an over-sampled linear representation introducing redundant feature

dimensions, this research deployed an improved coefficient feature scheme based on auto-

correlation at each low-pass band only for creating the scale-invariant model. Specifically, for

measuring the characters of the texture frequencies and regularities, raw auto-coefficient

correlations on each low pass band need to be measured.

Magnitude features

As shown in Figure 6, the large magnitudes appear seemingly at the same locations in each image

s ale, hi h ep ese ts the edges , o e s a d a s i the su -bands. Using texture analysis

te h i ues, su h as se o d-o de te tu e featu es [32], the correlation of magnitudes from image

sub-bands have been integrated into the design. This type of features is calculated by using cross-

correlation of the pairs at adjacent positions, orientations and scales. Central samples of the auto-

correlation of magnitude of each sub-band, cross-correlation of each sub-band magnitudes with

those of other orientations at the same scale and coarser scales are recorded. The edge characters

Figure 6. Real parts of band pass images from 3-scales and 4-orientations example steerable pyramid

based on cross-correlation of the real part of coefficients with both the real and imaginary part of

the phase-dou led oeffi ie ts at all o ie tatio s at the pa e t’s s ales a e also al ulated.

During system testing, the total number of feature points is 710 on a 4-scales and 4-orientations

wavelet transforms. In this paper, a redundant STT feature space has been designed. It is

emphasised that the redundancy is caused by overlapped calculation of HRF components. For

example, the variation of low pass image is also included in the autocorrelation. During the test, it is

dis o e ed that the o e lappi g a tuall a t as a dou le he k e ha is hi h a sig ifi a tl improve robustness of the decision making algorithm.

4. System implication of real-time crowd anomaly detection As illustrated in Figure 7, the system starts from building up a video buffer only containing certain

number of video frames before STV construction for real-time purpose. In the experiment, the

buffer has been setup less than 90 frames allowance because crowd anomaly is usually occurred

within 3 seconds by using 30 frames per second (fps) video settings.

The system contains two modules: learning and detection. Both modules share the similar STT

construction algorithm illustrated in Figure 1. In this research, a normality crowd model has been

lea t a st a ti g the o al i sta es’ “TT featu es. A o d e e ts diffe e t f o o ality

should be alarmed.

4.1. Crowd region detection

The video footages contain not just rich dynamic data, but also signal noises and unwanted

background information. It is essential to rapidly locate the crowded region and filtering out the

noises. As shown in the Figure 7, the learning module detects the moving crowd regions before

composing the STV slices, and shares the information for detection. This operation allows more

dynamic information rather than static background and noises to be recorded on STV slices.

Figure 7. Real-time crowd anomaly detection pipeline

During the development, so-called average flow field has been used in the prototype to evaluate

the dynamic level of image scenes. The average flow field is composed by a group of binary

calculations on optical flow field. Specifically, given a video clip containing �� frames, the average

flow field , ∈ ℝ can be defined by

U = ∑ uii , and (6)

= { , | | � | |, ℎ , (7)

where � • :ℝ → ℝ calculate the average magnitude value of each flow filed. denotes the

Horn-Schunck optical flow field [33]calculated between the ith

and the i+1th

frame. During the

learning, the set of each flow field output can be denoted as = [ , , … ��− ]. Figure 8(b) shows an example of calculated by using video clips of Figure 8(a). In the average flow

field, higher values denote more dynamic changes across the timeline which is mainly caused by the

crowd movement. Lower values, on the other hand, are usually caused by noise and insignificant

changes. In the experiment, locations where , 5 have been ignored based on experience for

further processing.

(a) image scene (b) average flow field

(c) boundary of average flow filed (d) Slices on XT and YT direction guided by the boundary

Figure 8: Location and direction calculation of STV slices based on average flow filed

Figure 8(c) highlights the boundary of , which has been detected by using a group of morphological

ope atio s su h as ope , a d o e hull . Those ou da ies li ited the idth of “TV sli es alo g the time line. Based on the definition of STV slices introduced in Section 3.1, a group of STV slices

need to be sampled inside the region of . For simplification, only XT (horizontal) slices and YT

(vertical) slices are used. As illustrated in the Figure 8(d), each sampled slice has been marked by

lines across the XY (the frame) field. For keeping the detection accuracy and efficiency, it is not

necessary to sample each slice per pixel, the distance between each slices is set between 10 and 50

depending on the image size and resolution.

4.2. Normality crowd modelling

The average flow field provides the size, locations and directions for a group of STV slices. During the

ideo uffe i g, those sli es’ dist i utio i fo atio is set up as onstants. The slices are renewed

10

20

30

40

50

60

70

80

90

100

� = �� − �� + times for the whole learning process, where �� is the length of the video and ��

denotes the length of video buffer.

Given a group of STV slices for learning, the statistical texture features can be abstracted for

establishing the model of crowd activities. Each STV slice instance has its own STT feature space = [ , ,… �] , ( = , ,… , �; = , ,… , ), where , = , ,… , � denotes the

elements from STT feature space summarised in Section 3.3 and denotes the total number of slices

used in the video. Those operations generate × � STT features in total for calculating the statistical

distributions for the learning.

During the experiment, it has been discovered that with fixed , values, , , … , L

approximately obey Gaussian distribution � �, � . This empirical approximation works well on

many testing videos for anomaly detection (see Section 5), and has been used for modelling the

normal crowd activities in this research. For each learning video, � , � is defined as

� = �∑ , (8)

� = √�∑ ( − � ) . (9)

4.3. Online crowd anomaly detection

C o d a o al dete tio is a i a de isio aki g task that the s ste should la el o alit o a o alit to the ideo sa ples th ough o pa i g the dete ted “TT featu es ith o ality

crowd model. For online purpose, the decision is made for each buffered video clips during the video

playing. Same as learning progress, the STT features are extracted from STV slices located at

positions by average flow field.

Denoting the STT feature for crowd anomaly detection as ̃ = [ ̃ ], ( = , ,… , ; = , ,… , �).

In this research, the binary decision for each STT element is simply judged by whether the element

obeys 3-sigma rule of Gaussian distribution, which is

= { ̃ ∈ [� − � , � + � ] � ℎ . (10)

This operation has composed � sub-de isio s fo o e “TV sli e. Fo aki g a fi al de isio , � , a

oti g e ha is has ee i t odu ed:

� = { �∑ > ℎ . (11)

Equation 11 starts from calculating a positive rate for all sub-decisions. A threshold, , is then

compared with the positive rate for making a final decision for the STV slice. In the voting

mechanism, the threshold can be recognised as a pass-rate for the decision. Higher pass-rate means

that the final positive decision for a slice requires more votes from its positive voters ( = ).

Since STV slices are independently distributed inside the average flow filed, � can be recognised as

the local decision for a crowd image scene. Each � can mark the normality or abnormality crowd

event of its local area.

The designed prototype is an effective decision making system. The time consumption of the

algorithm is much lower than the time used for video buffering and playing (see details in Section

5.1). By using parallel programming strategy, the anomaly crowed can be detected before clearing

current video segment from buffer, which guarantees the real-time performance of decision making

during the video play.

5. System evaluations In this research, a prototype system has been implemented to test the devised anomaly crowd

detection model. The prototype has been run on a host PC with a 64bit Core i7 CPU (2X3.07GHz) and

4GB RAM.

During the evaluation, this work has been compared with many benchmarking approaches such as

Spatio-temporal Compositions (STC) [1] and Inference by Composition (IBC) [34], The STC highlights

its real-time performance and the IBC has been considered as one of the most accurate method for

anomaly detection. All the methods have been tested under same hardware and software settings.

2 popular online video databases, UMN [35] and UCSD [20], have been used for the system tests.

The UMN Dataset has been adopted for testing the system design under a controlled environment.

It contains 11 scenarios subjecting to 3 different indoor and outdoor backgrounds (UMN1, UMN2,

and UMN3). Each video records a group of people wondering in the scene and then escaping. The

UCSD dataset contains two video scenes (Ped1 and Ped2) of pedestrians walking along the road. The

anomaly events have been defined as some cars or bicycles quickly go through those pedestrians

which could build up hazard road situation.

During the experiments, the video buffer has been set up for holding 3 seconds video clips for all the

tests. All the video frames have been resized into 320×240 pixels. Only grayscale channel have been

used. For extracting STT features, 3-scales and 4-orientations steerable pyramid wavelet transforms

have been applied. The size of input STV slices has also been normalised into 256×256 pixels through

Bicubic interpolation.

5.1. Algorithm real-time performance

The designed feature extraction and decision making algorithm is an effective solution for anomaly

crowd detection. This test is used for evaluating the time consumption of each step of the detection

algorithm.

Figure 9. Time consumption of algorithm

As illustrated in Figure 9, the time consumption is calculated by measuring and averaging elapsed

time of each step 50 turns based on different video footages. For a buffered video clip, the algorithm

takes averagely 1658ms (18.4ms/frame based on 30fps video clip) for normal/abnormal event

detection. The break-down time consumption of anomaly detection has also been illustrated in the

figure by using different colour labels.

It is also worth noting that although some time consuming steps such as STV construction, wavelet

transforms and STT construction take 3.3ms/frame to 5.6ms/frame for their calculations, the whole

time used by detection is still less than the video buffering. During the experiment, an optimised

prototype has been developed by running video buffering and anomaly crowd detection as two

parallel processes. The video can play without any delaying caused by the detection process, which

is suitable for applications of real-time surveillance system.

The time consumption of this algorithm has also been compared with the popular approaches

illustrated in Table 1.

Dataset Method

STT STC IBC

UMN1 15 17 1818

UMN2 18 19 2200

UMN3 16 15 2712

UCSD-Ped1 18 19 2100

UCSD-Ped2 18 22 2900

Table 1. Efficiency tests on different databases (unit: ms/frame)

In the table, the STT feature-based approach introduced in this research performs faster than all the

other three benchmarking approaches. During the test, it also takes fewer memories for the data

processing and storage, which is important advantage for many intelligent surveillance systems.

5.2. Accuracy performance on public database

To test the accuracy and robustness of developed anomaly crowd event detection system, receiver

operating characteristic (ROC) curve is deployed during the test. The points on ROC curve are

defined by true- and false-positive rate of the detection system. Firstly, each video frame has been

hand-marked by labels i.e. o alit a d a o alit as ground truths. The true-positive is

then counted when a normality ground truth is marked correctly by the detection system. Otherwise,

0

500

1000

1500

2000

2500

3000

3500

video buffer anomaly detection

Av

era

ge

tim

e c

on

sum

pti

on

(m

s)

detection

comparison

STT feature construction

wavelet transform

STV slices construction

STV construction

video buffer

the false-positive will be recorded. For making a ROC curve, threshold T used as voting pass-rate (see

Section 4.3) should be increased from 0% to 100% with 10% steps, which generates 11 points for a

ROC curve.

UMN database

Figure 10 shows some video snapshots of UMN dataset. 5 seconds video sampled from beginning of

each clip has been used for learning the normal crowd behaviours. Those 5 seconds records only

contain a group of people wondering in the scene. During the detection, the system should show

alarms when those group of people are escaping.

UMN1

UMN2

UMN3

Figure 10 UMN examples

Figure 11 shows the OCR curves for detecting anomaly crowd events from UMN dataset in the scene

1, 2 and 3. In the figure, STT feature-based approach developed in this research shows much better

performance in all three scenes compared with STC and IBC. Because the anomaly crowd event in

UMN datasets are occurred in the entire image scene, the better accuracy and robustness

performances are contributed by the nature of STT model that both local randomness and global

similarity can be described together along the spatio-temporal domain.

UMN1 UMN2 UMN3

Figure 11. UMN OCR tests and comparison

UCSD database

As illustrated in Figure 12, the a o al e e t defi ed i UC“D data ase is o e lo all tha UMN.

In the UCSD, only localised image visual features are changed. During the test, the crowd video

samples only contain pedestrians have been used for training. The system should show several

alarms when road hazards are occurred such as cars, bicycles, and skateboarders passing through

slowly moving pedestrians.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositi

ve R

ate

STT

STCIBC

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositi

ve R

ate

STT

STCIBC

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositi

ve R

ate

STT

STCIBC

UCSD-

Ped1

UCSD-

Ped2

Figure 12. UCSD examples and detection results

Figure 12 also illustrates some detection results, the horizon and vertical lines marked on the video

f a es a e the lo atio s of XT a d YT sli es hi h ha e ee dete ted as a o al . “i e ea h “TV slice is defined independently, the marks can located the local areas containing crowd anomaly.

UCSD-Ped1 UCSD-Ped2

Figure 13. UCSD OCR tests and comparison

Same evaluation strategy has been used for evaluating the system performance on UCSD dataset.

The test results have been represented by the OCR curves shown in Figure 13, the proposed STT

method shows comparable results of STC and IBC, also uses less time and system memory resources,

which is contributed by the simple and effective decision making algorithm introduced in Section 4.

5.3. STT feasibility test

The STT features are designed by using STV slices based on HRF texture features. Actually, for

describing the visually undistinguished image, many texture models have been developed in recent

years. In this test, many other texture models such as textons [36], and multivariate image analysis

(MIA) [37] are compared with proposed STT model. Those texture models are used to represent STV

slices by using N-dimensional feature vectors. Same strategies introduced in Section 4.2 and 4.3 have

been applied for evaluating their performance.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositi

ve R

ate

STT

STCIBC

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositi

ve R

ate

STT

STCIBC

Figure 14. OCR tests based on different texture models

As shown in the Figure 14, the detection accuracy performance evaluated by using OCR curves on

UCSD video datasets. Compared with other texture model, the devised approach and algorithms in

this research have shown promising characteristics and for detecting crowd anomaly. It has been

proved that the wavelet-based texture model is a superior tool for representing local randomness

and global similarity. In addition, because the STT features contains redundant feature sets, many

other texture models can be recognised as a subsets of this feature space, which cannot

comprehensively describe the visual appearance of visually undistinguished images.

6. Conclusions and future work In this research, an innovative spatio-temporal texture based crowd anomaly detection method has

been developed. The prototype system starts from transforming video footages into an STV

structure. After applying benchmarking average flow field templates extracted from live video feeds,

the highly dynamic areas from the crowd scenes can be quickly identified, which then guides the

auto-selection of sizes, locations and directions of the sampling STV slices for further study. For

disti guishi g o d o alities f o a o al eha iou s, the sa pled “TT te tu e patte s a e then analysed based on a Gaussian approximation model on the normal and abnormal crowd

behaviour ratio. A weighted multi-binary evaluation algorithm has also been devised for online

decision making based on the STT comparison output. The prototype system has shown satisfactory

real-time performance during the tests and some promising characteristics for future intelligent

CCTV surveillance applications.

The statistical STT features encapsulate both the local variations as well as global similarities of the

established video volumes. Comparing with the state-of-the-art techniques, the devised method has

shown increased efficiency, accuracy and flexibility. Based on the rotation- and translation- invariant

wavelet transforms, STT models can be composed for more detailed analysis based on the similarity

statistics between two crowd scenes along the timeline. The low-level texture feature-based

thresholding algorithm in this project ensures the real-time performance of the system.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositi

ve R

ate

STT

textonsMIA

Although the current approach and system strategy are well-suited for high- and medium-dense

crowd scenes of highly homogeneous STT feature patterns, for low-dense crowd scenes where

individual o d ele e t’s eha iou a ha e sig ifi a tl la ge i pa t o the “TT sli e, the devised algorithm will gradually lose its performance superiority and robustness over other

conventional image/frame-based approaches. Future work will see a crowd density estimation

algorithm being investigated and embedded to the system for adaptive feature selection and

pattern recognition, which can open up new revenues for exploring more intelligent and adaptive

crowd monitoring and early warning systems for real-life applications.

References 1. Roshtkhari, M.J. and M.D. Levine, An on-line, real-time learning method for detecting

anomalies in videos using spatio-temporal compositions. Computer Vision and Image

Understanding, 2013. 117(10): p. 1436-1452.

2. Zhan, B., et al., Crowd analysis: a survey. Machine Vision and Applications, 2008. 19(5-6): p.

345-357.

3. Swets, D.L., B. Punch, and J. Weng. Genetic algorithms for object recognition in a complex

scene. in Image Processing, 1995. Proceedings., International Conference on. 1995. IEEE.

4. Ryan, D., et al. Crowd counting using group tracking and local features. in Advanced Video

and Signal Based Surveillance (AVSS), 2010 Seventh IEEE International Conference on. 2010.

IEEE.

5. Chan, A.B., Z.-S. Liang, and N. Vasconcelos. Privacy preserving crowd monitoring: Counting

people without people models or tracking. in Computer Vision and Pattern Recognition, 2008.

CVPR 2008. IEEE Conference on. 2008. IEEE.

6. Marana, A.N., et al. Estimating crowd density with Minkowski fractal dimension. in Acoustics,

Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on.

1999. IEEE.

7. Boghossian, B. and S. Velastin. Motion-based machine vision techniques for the management

of large crowds. in Electronics, Circuits and Systems, 1999. Proceedings of ICECS'99. The 6th

IEEE International Conference on. 1999. IEEE.

8. Saxena, S., et al. Crowd behavior recognition for video surveillance. in Advanced Concepts for

Intelligent Vision Systems. 2008. Springer.

9. Ihaddadene, N. and C. Djeraba. Real-time crowd motion analysis. in Pattern Recognition,

2008. ICPR 2008. 19th International Conference on. 2008. IEEE.

10. Garate, C., P. Bilinsky, and F. Bremond. Crowd event recognition using hog tracker. in

Performance Evaluation of Tracking and Surveillance (PETS-Winter), 2009 Twelfth IEEE

International Workshop on. 2009. IEEE.

11. Ali, S. and M. Shah. A lagrangian particle dynamics approach for crowd flow segmentation

and stability analysis. in Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE

Conference on. 2007. IEEE.

12. Ali, S. and M. Shah, Floor fields for tracking in high density crowd scenes, in Computer Vision–ECCV 20082008, Springer. p. 1-14.

13. Rodriguez, M., S. Ali, and T. Kanade. Tracking in unstructured crowded scenes. in Computer

Vision, 2009 IEEE 12th International Conference on. 2009. IEEE.

14. Rodriguez, M., et al. Data-driven crowd analysis in videos. in Computer Vision (ICCV), 2011

IEEE International Conference on. 2011. IEEE.

15. Wang, S. and Z. Miao. Anomaly detection in crowd scene. in Signal Processing (ICSP), 2010

IEEE 10th International Conference on. 2010. IEEE.

16. Ullah, H. and N. Conci. Crowd motion segmentation and anomaly detection via multi-label

optimization. in ICPR workshop on Pattern Recognition and Crowd Analysis. 2012.

17. Wu, S., B.E. Moore, and M. Shah. Chaotic invariants of lagrangian particle trajectories for

anomaly detection in crowded scenes. in Computer Vision and Pattern Recognition (CVPR),

2010 IEEE Conference on. 2010. IEEE.

18. Benezeth, Y., et al. Abnormal events detection based on spatio-temporal co-occurences. in

Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. 2009. IEEE.

19. Kratz, L. and K. Nishino. Anomaly detection in extremely crowded scenes using spatio-

temporal motion pattern models. in Computer Vision and Pattern Recognition, 2009. CVPR

2009. IEEE Conference on. 2009. IEEE.

20. Bertini, M., A. Del Bimbo, and L. Seidenari, Multi-scale and real-time non-parametric

approach for anomaly detection and localization. Computer Vision and Image Understanding,

2012. 116(3): p. 320-329.

21. Mehran, R., A. Oyama, and M. Shah. Abnormal crowd behavior detection using social force

model. in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.

2009. IEEE.

22. Raghavendra, R., et al., Abnormal crowd behavior detection by social force optimization, in

Human Behavior Understanding2011, Springer. p. 134-145.

23. Raghavendra, R., et al. Optimizing interaction force for global anomaly detection in crowded

scenes. in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference

on. 2011. IEEE.

24. Feng, J., C. Zhang, and P. Hao. Online Learning with Self-Organizing Maps for Anomaly

Detection in Crowd Scenes. in ICPR. 2010.

25. Jiang, F., Y. Wu, and A.K. Katsaggelos. Detecting contextual anomalies of crowd motion in

surveillance video. in Image Processing (ICIP), 2009 16th IEEE International Conference on.

2009. IEEE.

26. Mahadevan, V., et al. Anomaly detection in crowded scenes. in Computer Vision and Pattern

Recognition (CVPR), 2010 IEEE Conference on. 2010. IEEE.

27. Ryan, D., et al. Textures of optical flow for real-time anomaly detection in crowds. in

Advanced Video and Signal-Based Surveillance (AVSS), 2011 8th IEEE International

Conference on. 2011. IEEE.

28. Julesz, B., Visual pattern discrimination. Information Theory, IRE Transactions on, 1962. 8(2):

p. 84-92.

29. Zhu, S.C., X.W. Liu, and Y.N. Wu, Exploring texture ensembles by efficient markov chain

monte carlo-toward a trichromac theor of te ture. Pattern Analysis and Machine

Intelligence, IEEE Transactions on, 2000. 22(6): p. 554-569.

30. Hyvärinen, A., J. Hurri, and P.O. Hoyer, Natural Image Statistics: A Probabilistic Approach to

Early Computational Vision. Vol. 39. 2009: Springer.

31. Simoncelli, E.P. and W.T. Freeman. The steerable pyramid: A flexible architecture for multi-

scale derivative computation. in Image Processing, International Conference on. 1995. IEEE

Computer Society.

32. Stickgold, R., L. James, and J.A. Hobson, Visual discrimination learning requires sleep after

training. Nature neuroscience, 2000. 3(12): p. 1237-1238.

33. Brox, T., et al., High accuracy optical flow estimation based on a theory for warping, in

Computer Vision-ECCV 20042004, Springer. p. 25-36.

34. Boiman, O. and M. Irani, Detecting irregularities in images and in video. International Journal

of Computer Vision, 2007. 74(1): p. 17-31.

35. "Unusual crowd activity dataset of University of Minnesota," http://mha.cs.umn.edu.

36. Malik, J., et al., Contour and texture analysis for image segmentation. International Journal

of Computer Vision, 2001. 43(1): p. 7-27.

37. Esbensen, K. and P. Geladi, Strategy of multivariate image analysis (MIA). Chemometrics and

Intelligent Laboratory Systems, 1989. 7(1): p. 67-86.

http://mha.cs.umn.edu/

spatio-temporal texture modelling for real-time crowd anomaly …shura.shu.ac.uk/18876/3/wang...

Documents