a framework for an event driven video surveillance system

A Framework For An Event Driven VideoSurveillance SystemDeclan Kieran, Jonathan Weir, WeiQi Yan

Speech, Image and Vision SystemsInstitute of ECIT, School of EEECS

Queen’s University Belfast, BT7 1NNEmail: {dkieran03, jweir07, w.yan}@qub.ac.uk

Abstract—In this paper we present an event drivensurveillance system that uses multiple cameras. The pur-pose of this system is to enable thorough exploration ofsurveillance events. The system uses a client-server webarchitecture as this provides scalability for further develop-ment of the system infrastructure. The system is designedto be accessed by surveillance operators who can reviewand comment on events generated by our event detectionprocessing modules. We do not just focus on event detection,but are working towards the optimization of event detection.A multiple camera network system that tracks a movingobject (or person) and decides if this is an event of interestis also examined. Dynamic switching of the cameras isimplemented to aid in human monitoring of the network.The camera displayed in the main view should be thecamera with the most interesting activity occurring. Unusualactivity is defined as activity occurring that is not of thenorm. Normal activity is considered to be everyday repeatedactivity. Further thought will be given to the extension ofthis system into a distributed system that would effectivelycreate an event web system. Our contributions are to thedevelopment of automated real-time switching of cameraviews to aid camera operators in the effort of effectivevideo surveillance, and also the detection of events of interestwithin a surveillance environment, with appropriate alertsand storage of these events. To the best of our knowledgethis system provides a novel approach to the technologicalsurveillance paradigm.

I. INTRODUCTION

With the coming of the 2012 London Olympics,surveillance will play a much more crucial role than everbefore in the UK. The problems encapsulating surveil-lance include the capturing of surveillance media, eventdetection and analysis, huge storage overheads, and searchand retrieval from the archived surveillance media. Fromour research into surveillance systems, we have found thata system that provides a fully event driven approach tosurveillance has yet to defined.

Video surveillance is commonly used to describe ob-servations made from a distance by means of electronicsensors. Surveillance using electronic sensors can func-tion both as a deterrent to help prevent crime, as wellas an investigative tool to aid us in identifying an in-terloper. Surveillance networks timely record importantevidence and intelligent surveillance can provide videoand audio analysis that may generate auto alarms whichcan strengthen a human’s response time and efficiency,while greatly reducing laborious human work. With the

decreasing of camera, microphone and computer costs,surveillance networks are now widespread; huge stores ofsurveillance media are collected by these networks. Oursystem is working towards implementing a sustainableand extensible framework to provide this by using theevent driven paradigm. The problem is how to bridgethe gap between the raw media data and semantic an-notation and descriptions. The semantic concept of anevent bridges this gap. Other issues within video surveil-lance include the deployment of sensors as this affectsthe efficiency and effectiveness of any event detectiontechniques that are to be applied in practice. The mediabeing captured should be sufficient for overall analysis ofa given environment.

The term event describes the content of a piece ofsurveillance media as a semantic unit; it bridges the gapbetween the physical world and semantic cyberspace.Without intelligent management, the analysis of the hugevolume of daily archived surveillance media will be im-possible, and events of interest or revealing event secretsmay be missed as the process of manual searching is notefficient enough. Our system uses the novel approach ofevent driven surveillance to help in the management oflarge stores of surveillance media [1].

Video surveillance chiefly deals with spatiotemporaldata which have the attributes and coherences in seman-tics, the video is not recorded in a vacuum, it exists inits ambient context with other data such as audio. Inorder to fully utilize the attributes, our contribution is toautomatically add the events in the collected surveillancemedia to the stores in the web architecture to facilitatefurther review.

Currently, the search in commercial surveillance sys-tems is low level based, such as the given region of a videoframe. Usually, low features such as colour, histogramand texture comparisons are employed to search for therelevant regions. The search results are object basedand only show the relevant pictures. This ignores thecoherence of the video frames and discards the semanticsof image and video understanding. Therefore, the goal ofour research is to present a novel way of tackling videosurveillance at a semantic level using events; we detectevents from surveillance videos and store them in our websystem to provide an extensible system that will searchfor higher attention events intelligently.

JOURNAL OF MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY 2011 3

© 2011 ACADEMY PUBLISHERdoi:10.4304/jmm.6.1.3-13

Video surveillance is becoming more important as itsuses in tackling security issues become more prevalent.An important application of video surveillance is the mon-itoring of densely populated areas where crime and otheremergencies are common place. The major metropolitancities of the world are constantly growing in populationand as such, video surveillance can be of great helpto the reduction of crime, be it as a deterrent or asevidence. Also the response to emergencies such as roadaccidents can benefit with the use of video surveillance,in that alerts can be given about situations that requirean appropriate and timely action to be taken. Also withthe steady reduction in the cost of camera and computerequipment, surveillance networks are more accessible toeveryday businesses.

Video surveillance networks can have tens to thousandsof cameras. The camera to operator ratio can be manycameras to a single operator. This means that full scrutinyof all cameras for events of interest will very likelybe extremely difficult. Assistance, such as intelligentswitching to the camera of most importance, and thedetection and storage of events of some semantic interestwould be useful to an operator, as human operators cannotbe expected to maintain perfect concentration for anyreasonably long period of time [2]. A lot of researchhas been carried out into how and why new systemsshould be created. An extensive review on third generationsurveillance systems can be found in [3] where severalfuture directions are considered for how more robustdistributed systems may be achieved.

This paper will discuss dynamic switching of cameraviews and then how events of interest may be detected.The system dynamically switches the camera view, whichis considered to be of most interest, to a main view ona single monitor interface. Each camera will be labelledwith a unique number and the main view will be labelledwith a time stamp and corresponding camera number toaid operator coherence of which camera has the mostactivity within it.

A more useful aid to an operator is the detection ofunusual events that may or may not be of interest. Anevent describes the content of a video as a semantic unit; itbridges the gap between the physical world and semanticcyberspace. A picture is said to be worth a thousandwords, an event can be comprised of thousands of pic-tures. A video therefore encapsulates much more thanan image. A surveillance video may convey invaluablesecurity information.

The next step in the implementation of a system thatwill recognise different events that are of some interestis to consider the structure of both temporal and spatialcharacteristics. An event can be described as [4] a longterm temporal objects that extend over tens or hundreds ofimage frames. The similar traits of spatiotemporal objectsare they have multiple spatial scales and multiple temporalscales. These characteristics hold true for all events thatwould be of interest in a surveillance system. A personfalling over and a bag being left alone have quite different

spatial and temporal characteristics and as such shouldbe distinguishable. These types of events are of specialinterest in many surveillance systems. Considering a bussetting that has been fitted with closed circuit television(CCTV) cameras, a bag being left alone or a person fallingover could potentially be unsafe events and would need tobe recognised. An intelligent surveillance system must betrained to recognise events first of all and use a variancein those values to decide whether or not the movementin the video comprises a given event. The variance in theexact training value sets is needed as people and bags areusually of different shapes and sizes.

In the rest of this paper, Section II introduces the stateof art within the field of motion and event detection.Section III introduces the overall design of the system.Section IV presents the systems implementation. SectionV provides our contribution and development within thesurveillance field. The results are presented within SectionVI. In Section VII we give our conclusions.

II. RELATED WORK

Although surveillance technologies have been widelystudied in the new century, there are however still emerg-ing problems to be resolved. One of prominent problemsis to use an event as a semantic unit. Others includeachieving event detection from surveillance videos andevent search and retrieval using a database of events. Inour research, we use events as a semantic unit to improvethe intelligent surveillance system.

In event detection, events are detected and analyzedsemantically. Event detection is a procedure to match pre-viously well defined patterns that have a high occurrencewith new incoming patterns. The patterns are random inthat they contain noise. Therefore, the key point of eventdetection is to remove the noise and find the kernels.Event detectors sense changes in the attributes of objectswhich participate in an event, ultimately causing a changein event status.

Databases are employed to save events. Sunrise [5]is an industrial-strength database system for real-timeevent processing and aggregation for telecommunicationapplications. It is a main-memory database platform thatsupports scalability and parallel processing with the ser-vice authoring environment. Our system wants to developthis idea to provide an intuitive system that can be usedby average computer operators. The instantly indexedmultimedia database system [6], as the name suggests, itperforms real-time indexing of real world events as theytake place. Called Lucentvision, it has a rich set of indicesderived from disparate sources and allows domainspe-cific retrieval and visualization of multimedia data. Thisparticular system concentrates on media captured from asporting context. Our system wishes to focus on a moregeneral solution.

The IBM Smart Surveillance System (S3) has an openand extensible architecture for video analysis and data-management http://www.research.ibm.com/peoplevision/.Its role in video analysis is to encode the camera streams

4 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY 2011

© 2011 ACADEMY PUBLISHER

and send them to the video or streaming database, andalso to analyze the camera streams for events and sendthe resulting metadata in XML format to the metadatadatabase. Its role in data management is in providing ahuman-interface layer for queries, alerts, events and real-time event statistics. We believe our system will provide amuch more intuitive interface for users. By following theGmail and YouTube models, we also believe that eventhistory and refinement will be further achieved by firstlyits ease of use and secondly the YouTube model providesa proven intuitive interface for recording and reviewingcomments made on video media.

Event detection from surveillance and monitoring isapplied in many disparate circumstances. Typical vehicle-related events in unmanned airborne vehicle surveillance[7] include: approach checkpoint , stop short before arriv-ing, car goes through checkpoint, car avoids checkpoint,move inside, and leaving. Another example is theft ata phone-booth. The events include: bringing to object,attacking a person, using phone, taking away the object,passing by, etc. An approach for event detection in livesports, based on the analysis and alignment of web-castingtext and broadcast sports video, is presented in [8]. Thesystem detects live sports events using only partial contentcaptured from the web and TV. Detailed event semanticsare extracted and exact event boundaries are defined. Itfurther creates a personalized summary related to a certainevent, player, or team, according to a user’s preference.Three modules are provided: live text/video capturing,live text/video analysis, and live text/video alignment.Events involving two-person interactions have also beenthe subject of research [9]. Two-person interactions area combination of single-person actions, which are them-selves composed of a human body-part gesture. Eachgesture is an elementary motion event and is composedof a sequence of instantaneous poses at each frame.

Being able to represent an event and its constituentparts as a usable unit of information is still an issuein event detection. The Video Event Representation Lan-guage (VERL) is presented in [10] and uses an ontologyframework for representing and annotating video events.This work tries to formalise the semantic concept ofan event to make it a workable unit of information.Human actions are represented by multiple triplets alignedaccording to spatiotemporal constraints between actions.Zhou et al [11] detect unusual events via multiple cameramining. The unusual event detection adopts a two-stagetraining scheme to bootstrap a probabilistic model forcommon events. An event not classified as common isconsidered unusual. Zhang et al [12] proposed a semi-supervised adaptive HMM framework, in which commonevent models are initially learned from a large data set,whilst unusual event models are learned by Bayesianadaptation. In [13], crowd behaviour is characterizedby observing the crowd flow, with unsupervised featureextraction to encode normal crowd behaviour.

Up to now, most of the work on event computingin video surveillance has been primarily concerned with

event detection from archived video, or on capturingevents from live streams within a multimodal multimediaenvironment [14]. It has also tended to concentrate ona particular case of surveillance, such a sports, vehicleor pedestrian surveillance, as well as many others. Onereason for this has been the lack of event collections.Therefore, it has had little chance to concentrate onother event operations such as classification, exploration,reasoning, mining, prevention and prediction. Distinguish-ing from the current pervasive research work, we havecaptured and stored sufficient events in our database usingour developed system and developed a system architecturethat will be conducive to the general case of surveillance.

We have developed such an event driven intelligentvideo surveillance system. In this real-time system, wedetect unusual events using Finite State Machines. Theevents are the human activities between two stable states.We adopt the Youtube.com structure, the captured eventand its video will be timely uploaded to our web server.We use H.264 to compress the surveillance video streams.The streams are therefore easier to transmit to mobiledevices. The events are archived in our event databaseand the relevant metadata is saved for analysis. Based onour database system, we can conduct event search andretrieval.

Motion detection in video has been researched thor-oughly and many techniques have been explored toachieve reliability. Reliability is of great importance in thefield of security and when intelligent video surveillanceis to be employed, the robustness of the surveillancesystem is something that needs to be considered heavily.Real-time intelligent video surveillance requires that theprocessing is not so intensive that the actual checking is inline with what is currently happening. Many techniquesfor robustness and efficiency have been researched andsome are described here.

Video surveillance is moving away from just the col-lection of data with human operators manually observingthe data to provide intelligent analysis [15]–[17] of eventsand human actions of some semantic significance. Changet al. describe feature extraction and fusion for videosurveillance [18], [19]. Jain et al. have achieved theconcept of multiple perspective interactive video [20],[21]. Their approach is said to be the early precursorof experiential computing [22]. In [23], the experientialcomputing paradigm is elaborated. In an experiential [24],[25] computing environment, users may directly applyall of their sensors to observe the data and informationof interest related to an event. Furthermore, a user mayexplore the data that has a particular interest in the contextof that event.

Motion detection in video is concerned with the detec-tion of movement of foreground objects in a sequenceof images, while identifying the static background ofthe image. Motion detection techniques can range fromsimple to complex. The simple difference method [26]detects change in a video sequence by computing thedifference of pixel intensities of a given image pair for



each time instant t. The SD method is very sensitive tonoise and illumination changes which can be seen asstructural changes. The Local Intensity Gradient (LIG)method described by Parker [27] is a more robust methodfor change detection. This method is capable of threshold-ing images that have been produced where there has beena variance in illumination. Background updating basedon the Kalman filter is described by Foresti [28] and isused to minimize errors in background change detectionprocesses which can be caused by sensor noise effects.

Nonmonotonicity is discussed in [29], where a systemmay accept a “belief” is correct unless something ab-normal happens. This is a very powerful concept as itwould allow the system to reason about the events thatwere under consideration. Bayesian analysis is used in[30] and an algorithm using a probabilistic finite-stateautomaton (PFA) is used to build an event recognitionsystem. This is a variant on a structured Hidden MarkovModel [31] (HMM). In [32] a semi-supervised adaptedHMM to detect unusual events is discussed. A dynamicBayesian network is presented in [33] where two audioand one visual input are modelled as individual streams,and the streams correspond to a multi-stream HMM.The visual stream is connected to a Kalman filter whichuses information from all streams to smooth the imagedata. Their experiments show that this technique performsbetter than a single stream HMM.

Stringa et al. [34] describes a multimedia surveillancesystem which is used to alert the surveillance operatorwhen an abandoned object is detected in the waiting roomof an unattended railway station. This event would bemade up of many constituent events, such as a personwalking in or setting a bag down etc. A discriminative-based system is described in [35] where constituentevents of a larger event are recognised using Discrimi-native Analysis and Statistical Pattern Recognition tech-niques. In [36] the Sampling Importance Resampling(SIR) method which is used for modeling the evolutionof distributions is described. Gordon et al [37] developedthe dynamics aspects. A general importance samplingframework [38] is described that elegantly unifies manyof these methods that have been developed. A special caseof this framework is used for visual tracking [39].

The previous event detection techniques are all verycomputationally expensive, and this is where our researchtakes a different approach. We use the surface of a 3Dhistogram that represents a temporal unit of video. Thiscan be computed in Matlab with reasonably good speed,and returns unique results using our technique. We wishto discuss ways in which minimal processing can achievemaximal results.

In this paper, we work for the development of ex-tensible event detection and understanding. The eventsthat have been detected using the detection modules arevery helpful in terms of learning, training, analyzing,predicting and influencing further actions. These eventscould be given to a learning algorithm, such as a supportvector machine in order to further process the information

in order to deduce other facts that may have been missed.These events, along with the sensor and alarm data,

will be presented on web pages with the relevant videoinformation using coherences we have explored. Theevent, sensor and alarm WebPages will be linked togetherto form an EventWeb [40]. Our system is designed to pro-vide a general system that may incorporate many differentsurveillance environments. To the best of our knowledgewe believe this system provides a novel approach to issueof surveillance.

III. SYSTEM DESIGN

Currently we use the system we have developed torecord and collect everyday events. This differs fromother existing works, as we do not just focus on eventdetection, but are working towards the optimization ofevent detection. Our system is designed to refine theraw events detected. The motivation being to fully utilizethe events archived in our database to achieve a betterunderstanding of the event paradigm. With our currentunderstanding of events that have happened within a givenenvironment, our system will find other relevant eventsthat can help us to better understand the target event ina fuller manner. The optimization of event detection isachieved by using the ontology of sensor fusion. Thissensor fusion combines all available information sourcesin order to give an accurate representation of what ishappening in the surrounding environment.

A. Event Detection

The primary focus of this system is to provide anintuitive system that will help in the management of largestores of surveillance media, and facilitate the refine-ment of the events archived in our database. We detectevents using a Finite State Machine (FSM). Events areregarded as a state change. Finite State Machines [41]contain a finite number of states and produce outputson state transitions after receiving inputs. The machineshave the attributes: equivalence, isomorphism and min-imization. A FSM is represented by a state transitiondiagram, a directed graph whose vertices correspondto the states of the machine and whose edges corre-spond to the state transitions; each edge is labelledwith the input and output associated with the transition.http://en.wikipedia.org/wiki/Finite state machine. We as-sume an event detected within an environment onlyreflects one aspect of the event, the description of theevent is not full and complete and some entities arelost. The events detected from the same environmentusing all the sensors, the event description is refined andthe description is much richer in detail and thereforemore accurate. The events are then stored in the databasesystem we have developed. The system includes storagefor events, sensors, media metadata, user information andpresentation information. The database system secures thesurveillance events, and relevant information such as userIP, user comments, click times etc can be found.



Fig. 1: Events under consideration.

To test the system, three distinct events are considered.The events considered in the development of this systemare given in the state transition diagram given in Figure1.

The events consist of the movement of a person withina single room. In Figure 1 the label (a) shows the eventof a person walking into a room and sitting down on achair. Label (b) shows the event consisting of a personstanding up and walking out of a room. Label (c) denotesthe event that is made up of the two previous events. Theevents under consideration contain two or more states,but other events that could be derived may contain moreor less states. It is plain to see that other smaller eventscould be derived, such as a sitting in a chair event (onestate).

B. System Architecture

The architecture of the web system facilitates exten-sibility, scalability, easy event review and search andretrieval of events by surveillance operators. The me-dia streams are fed into the system and the associatedmetadata is extracted for future review, reasoning andmining techniques. The event detection modules detectevents in the media streams and these events are fedthrough a rating generator module. If an event has asufficiently severe rating then an alarm will be issuedto the appropriate operator. The event information isstored in the event information database. The high levelarchitecture of the system is described in Figure 2.

For a user to enter the web system, they must loginthrough our secure Java applet login. On successful loginto the system, the presentation details of the websitewill be extracted from the presentation database, anddepending on the level of access the user has, they will bepresented with the event information and the appropriatesystem controls.

C. Data Model

The system uses a relational database system for stor-age and retrieval of events. The schema of the databasesystem is given in Figure 3.

Fig. 4: EventWeb sensor control page.

In addition to event information and event relatedmedia data, such as the media metadata, other importantinformation relating to other functionality of the systemis also stored. This includes data relating to sensorsconnected to the system, data relating to the detectionmodules currently implemented in the system, alarm datarelating to alarms that have may be issued as a resultof an event having a sufficiently high rating. There isalso presentation data contained in the database and thisdictates how the system will be presented to a given user.

Following the Youtube.com model the operator canthen add comments relating to the events and thesecomments can be replied to. This gives a mechanism forconstant feedback on the accuracy of the event detection.This system has the ability to scale up to as large as thesphere of surveillance is needed. It therefore would beable to gather very large amounts of event information.

IV. SYSTEM IMPLEMENTATION

The video surveillance web system implements a securejava applet login that makes use of Sun MicrosystemsAPIs for HTTPS over SSL. The main page allows theuser to view the sensors currently connected to the system.Other functions available on this page are the ability toregister, modify or remove a sensor from the system.There is also an option to pause the current sensor thatis displayed on the main view, and to use automaticswitching of the main sensor view. Figure 4 shows anexample of the page that gives this functionality.

The automatic switching of the sensors implements analgorithm that checks each sensor for the level of activity.The sensor with the greatest level of activity is consideredto be the most interesting and therefore would be setin the main view. The user may also manually select adifferent sensor from those displayed in the menu on theright showing all sensors connected to the system. Belowthe current sensor display, the latest event informationis displayed. This includes the event id, the start time,the duration, the description of the event and the ratingevaluation assigned to the event.



Fig. 2: EventWeb system architecture.

A user can navigate the system by use of the navigationbuttons shown on the left of Figure 4. If the user selectsthe events page they are presented with an inbox boxsimilar to an email account. Figure 5 gives an exampleof this.

The user may search for events using an ordinary query,or use mining or reasoning. Each numbered event pro-vides a link to a more detailed display page for the givenevent. Events highlighted green have been commented,and those highlighted red have not been commented.

When the user selects a link they are directed the maindisplay page of that event. An example of this page isshown in Figure 6.

The display and functionality of this page closelyfollows the Youtube.com model. The user is presentedwith previous comments made by other users and has theoption to created a new comment or reply to a previouslymade comment. Events that have some coherence to thedisplayed event are displayed in the menu on the left ofthe screen, again in a similar way to Youtube.com. Theuser may also select the sensor profile to review more

Fig. 5: EventWeb event inbox.

information about the sensor that captured the event.This is an intuitive interface for operators to use, as

it is a generally well known and widely used format. Itis also a very useful design as it allows for a consistent



Fig. 3: EventWeb system architecture.

Fig. 6: EventWeb event display page.

history of events and event observations to be recordedand displayed.

The database has been successfully implemented us-ing MySQL and the web pages, as shown previously,have been written using PHP script. The system wasdeveloped using the free Apache distribution XAMPPhttp://www.apachefriends.org/en/xampp.htm.

V. OUR CONTRIBUTIONS

In this section, the research we have carried out is dis-cussed. First the process of how the system incorporatescamera switching is discussed. Then event detection isdiscussed.

A. Mechanics of Event Driven Camera Switching

Figure 7 shows the logical flow for the camera switch-ing system.

The system reads and stores a single image from eachcamera. The system then waits a small fraction of asecond before reading another image from each camera.Noise reduction is then carried out on each image to pre-pare the images for subtraction. A rotationally symmetricGaussian lowpass filter of size 3 with standard deviationσ (positive) of 1.3 to simulate a Gaussian blur, is appliedto each image before subtraction. Equation (1) gives theprocess of creating a 2D Gaussian blur G, where in thiscase, x and y variables would relate to image coordinates.

G =1

2πσ2e−

x2+y2

2σ2 (1)



Fig. 7: Flowchart of camera automatic switching.

This provides blurring, and as such removal of noise,but not to the extent where detail of some use is lost.This requires minimal processing by Matlab and thereforedoes not noticeably affect the real-time performance ofthe system. Modules written using Matlab code can becompiled to specific executables that can be used byreal-time systems in order to perform this type of imageprocessing.

In the subtraction stage, each image pair that relateto an individual camera are subtracted from each other.These differences are stored and then checked to findwhich is the greatest. The main camera view is nowset to the camera showing the greatest difference. Thisis effective in that whichever camera the moving objectis closest to will exhibit the greatest image difference.Also as this does not require a great deal of processing, itlends itself well to a real-time video surveillance system.When there is no activity the system can be consideredto be idle. If further activity occurs, the previous processis followed.

B. Event Detection

An event within video consists of a sequence of imageswhere there is some semantically important object indifferent positions in each image. This moving objectcan be distinguished from the static background. Differentevents can be distinguished by their spatial and temporalcharacteristics. Events that have larger temporal charac-

Fig. 8: Flow of event detection.

teristics may be comprised of many smaller events, and assuch it may be the case that many events have the samestarting spatial and temporal characteristics. This givesthe necessity for the creation of a logical framework todistinguish between the smaller and larger events. Figure8 shows a possible simple logic structure for recognisingdifferent events as they happen. This structure is notextensive and could be extended to as large as wouldbe needed to recognise composite and overall eventsof interest. Essentially if it was to cover all possibleimaginable events, it could be extended indefinitely. Forthe purposes of clarity, Figure 8 considers a very smallsubset of possible events that may happen within a roomwith a chair, with the flow chart showing where otherevents may be added.

Events that are considered in this system are a personentering a room, sitting on a chair and then leaving theroom, a person entering a room and sitting on a chair, aperson standing up from a chair and leaving the room,and a person appearing to draw graffiti on a wall. Thesystem will use training sets of each event to determinethe spatial and temporal characteristics of each event.These characteristics will be derived from the histogramsurface of the sequence of images. For each frame thestandard deviation will be taken of the histogram. Thestandard deviation of all of these values will then betaken to determine a classifier for the event detectionsystem. Equation (2) describes the process of derivingthe classifier, where c is the classifier, std denotes thestandard deviation, h denotes the histogram of the input,fi denotes the video frames, and N is number of frames.

c = std

N∑i=1

std(h(fi)) (2)

VI. RESULTS

The results of the experimental system are providedhere. First the results of the camera switching system



Fig. 9: Camera deployment in surveillance room.

Fig. 10: Image sequence captured using auto cameraswitching.

are provided and then the results of the event detectiontechniques are provided.

The cameras in the room were calibrated so that allareas of the room were in view of one or more of thecameras. Figure 9 gives the layout of the camera roomwith the width and height distances shown. The roomsize is 9 metres long by 5 metres wide. The cameras arelabelled with the last two digits of their correspondingIP numbers. The angles of view are given by differentcoloured lines emanating from each camera. The figureshows one path taken that covers all areas of the room.

To test the camera switching mechanism, many pathswere taken around the room to identify if the camera thatshowed the closest view of the person walking around theroom was switched to the main view. Figure 10 shows theprocess of switching in action. The person shown in thecamera took the path shown in the camera deploymentfigure, so that the most pertinent positions in the roomwere covered.

These types of test were repeated to check the robust-ness of the switching system. The average correct switchto the idea camera is 84%. The issues that have arisen arethe large overlap of camera 36 and its ability to see somuch of the room in relation to the other cameras. Table

TABLE I: Ideal camera within deployment.

0 1 2 3 4 50 Camera

46Camera46

Camera31

Camera32

Camera32

Camera32

1 Camera46

Camera46

Camera46

Camera32

Camera32

Camera32

2 Camera42

Camera42

Camera36

Camera42

Camera36

Camera51

3 Camera42

Camera42

Camera36

Camera36

Camera51

Camera51

4 Camera42

Camera42

Camera51

Camera51

Camera51

Camera51

5 Camera42

Camera42

Camera51

Camera51

Camera51

Camera51

Fig. 11: Sample histograms for each event to be consid-ered.

I gives the ideal cameras that the switch should set themain view to, where the coordinates of 0 to 5 match thoseshown in Figure 9.

The implementation uses the simple difference (SD)method on each image pair after noise reduction has beencarried out. These differences are then checked to findthe greatest difference. The screen that has showed thegreatest change is then displayed on the screen. This hasshown to be reasonably robust.

A. Event Detection Implementation

The 3D histogram surface will be used to capture thecharacteristics of a given event. Figure 1 shows the statetransition diagram of the activities involved in these exper-iments. The events under consideration may only containone state, or may contain more. It is plain to see thatother smaller events could be derived. Figure 11 showssample histograms for each of the events mentioned. Eachhistogram frame has 32 bins, and the labels correspondto the labelled events in Figure 1.

Each histogram can be seen to be reasonably distinctfrom the others, which shows this technique has goodclassification properties.

To train the system, each event is repeated and theprocess described in equation (2) is used to derive classi-fication values for each event. Figure 12 shows the resultsfor each set of events in our test of the system. The C axisgives the values obtained from using the aforementionedprocess. The Sample axis shows the video sample number.For this experiment, 100 samples of each event wereexamined by the system.

Using simple conditional checks on thresholds derivedfrom the mean value of a set of points yields very goodresults, with the correct event being detected 85% of the



Fig. 12: Sample histograms for each event to be consid-ered.

TABLE II: Mean and variance of event classificationvalues.

(a) (b) (c) (d)Mean 22407 20529 17337 19686

Variance 992177 1369937 247993 2664415

time. Table II gives the mean and variance values foreach set of events and illustrates that there is significancetestable difference between each event. The mean andvariance values are then used by the system in order toclassify each of the events. They are obtained from the3D arrays that hold the histogram data.

VII. CONCLUSIONS

We believe we have presented a novel, general, andextensible system that can be applied to the issue ofscalable, intuitive video surveillance and event explorationand understanding. Using our system with the specialisedarchitecture and spatiotemporal based approaches wehave presented, using appropriate mining and reasoningtechniques will deliver a multitude of useful event datathat can be used to further extract hidden and secretsurveillance events.

With a progressive understanding of the events thatoccur within a given environment, we can learn moreabout event detection. Our system delivers this by exploit-ing current web technologies and current tried and testedintuitive interfaces for users to review detected eventswithin. This data model provides good base for sensor,media, event and user information.

As event detection is a stochastic process, techniquesto robustly detect events are invariably computationallyexpensive. Computer equipment is still becoming moreand more inexpensive; however barriers are beginning tobe reached on single core speeds of processors. Thereforefor real time video surveillance, computationally inexpen-sive algorithms will be a great help in the development ofthis field. The technique we have described in this paperis reasonably computationally inexpensive and has beenshown to work. However accuracy is a key concern insecurity surveillance and more work needs to be done todevelop advancements in this field.

Without intelligent management of the huge volume ofdaily archived surveillance video, it will be impossible toanalyse the video content for events of interest efficiently.Video surveillance chiefly deals with spatiotemporal datawhich has attributes and coherences in semantics, thevideo is not recorded in vacuum, it exists in its ambientcontext with other data. In order to fully utilize theattributes, our future work will be to automatically explorethe hidden events in the collected surveillance videos,the outcomes will be to timely provide correct alarms,explore the implicitly existing coherences among eventsand prevent any unwanted (or antisocial) events fromhappening.

An event recognition module will receive the inputinformation and will classify an event as standard orhazard on the basis of pre-defined object motion mod-els. Based on statistical pattern recognition, an event isregarded as a stochastic temporal process and two eventswill be considered as being similar if they could havebeen generated by the same stochastic process.

REFERENCES

[1] W. Yan, D. Kieran, S. Rafatirad, and R. Jain, “Acomprehensive study of visual event computing,” MultimediaTools and Applications, pp. 1–39, 2010. [Online]. Available:http://dx.doi.org/10.1007/s11042-010-0560-9

[2] P. Atrey, A. El Saddik, and M. Kankanhalli, “Effectivemultimedia surveillance using a human-centric approach,”Multimedia Tools and Applications, pp. 1–25, 2010. [Online].Available: http://dx.doi.org/10.1007/s11042-010-0649-1

[3] M. Valera and S. A. Velastin, “Intelligent distributed surveillancesystems: a review,” IEE Proceedings - Vision, Image and SignalProcessing, vol. 152, no. 2, pp. 192–204, 2005.

[4] L. Zelnik-Manor and M. Irani, “Event-based analysis of video,”IEE Proceedings - Computer Society Conference on CVPR, vol. 2,no. 2, pp. 123–130, 2001.

[5] J. Baulier, S. Blott, H. F. Korth, and A. Silberscharz, “A databasesystem for real-time event aggregation in telecommunication,” inProc. 24th Int. Conf. Very Large Data Bases, VLDB, 24–27 1998,pp. 680–684.

[6] G. S. Pingali, Y. Jean, A. Opalach, and I. Carlbom, “Lucentvision:Converting real world events into multimedia experiences,” Pro-ceedings of the IEEE International Conference on Multimedia andExpo, vol. 3, pp. 1433–1436, 2000.

[7] S. Hongeng and R. Nevatia, “Large-scale event detection usingsemi-hidden markov models,” in Proceedings of the Ninth IEEEInternational Conference on Computer Vision, 2003, pp. 1455–1462.

[8] J. Xu, C.and Wang, K. Wan, Y. Li, and L. Duan, “Live sportsevent detection based on broadcast video and web-casting text,”in MULTIMEDIA ’06: Proceedings of the 14th annual ACMinternational conference on Multimedia, 2006, pp. 221–230.

[9] S. Park and J. K. Aggarwal, “Event semantics in two-personinteractions,” International Conference on Pattern Recognition,vol. 4, pp. 227–230, 2004.

[10] A. R. J. Francois, R. Nevatia, J. Hobbs, and R. C. Bolles, “Verl: Anontology framework for representing and annotating video events,”IEEE MultiMedia, vol. 12, no. 4, pp. 76–86, 2005.

[11] H. Zhou and D. Kimber, “Unusual event detection via multi-camera video mining,” in ICPR ’06: Proceedings of the 18thInternational Conference on Pattern Recognition, 2006, pp. 1161–1166.

[12] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan, “Semi-supervised adapted hmms for unusual event detection,” in CVPR’05: Proceedings of the 2005 IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, vol. 1, 2005, pp.611–618.



[13] E. L. Andrade, S. Blunsden, and R. B. Fisher, “Modelling crowdscenes for event detection,” in ICPR ’06: Proceedings of the 18thInternational Conference on Pattern Recognition, 2006, pp. 175–178.

[14] P. Atrey, M. Kankanhalli, and R. Jain, “Information assimilationframework for event detection in multimedia surveillancesystems,” Multimedia Systems, vol. 12, pp. 239–253, 2006.[Online]. Available: http://dx.doi.org/10.1007/s00530-006-0063-8

[15] G. L. Foresti and C. Mahoen, P. Regazzoni, Multimedia Video-Based Surveillance System, Requirements, Issues and Solutions.Dordrecht, The Netherlands: Kluwer Academic Publishers, 2002.

[16] C. Regazzoni, G. Fabri, and G. Vernazzza, Advanced Video-based Surveillance System. Dordrecht, The Netherlands: KluwerAcademic Publishers, 2002.

[17] P. Remagnio, G. Jones, N. Paragios, and C. Regazzoni, Video-based Surveillance Systems, Computer vision and Distributedprocessing. Dordrecht, The Netherlands: Kluwer AcademicPublishers, 2002.

[18] G. Wu, Y. Wu, L. Jiao, Y. F. Wang, and E. Chang, “Multicameraspatio-temporal fusion and biased sequence data learning forsecurity surveillance,” Pmc. of ACM Multimedia, pp. 528–538,2003.

[19] Y. Wu, L. Jiao, G. Wu, E. Chang, and Y. F. Wang, “Featureextraction and biased statistical inference for video surveillance,”Proc. of IEEE International Conference on Advanced Video andSignal Based Surveillance, pp. 284–289, 2003.

[20] P. H. Kelly, A. Katkere, D. Y. Kumura, S. Moezzi, S. Chatterjee,and R. Jain, “An architecture for multiple perspective interactivevideo,” Proc. of ACM Multimedia, pp. 201–212, 1995.

[21] S. Santini and R. Jain, “A multiple perspective interactive videoarchitecture for vsam,” Proc. of the 1998 Image UnderstandingWorkshop, pp. 51–55, 1998.

[22] R. Jain, “Experiential computing,” Communications of the ACM,vol. 46, no. 7, pp. 48–55, 2003.

[23] J. Davis and X. Chen, “Calibrating pan-tilt cameras in wide-areasurveillance networks,” Proc. of IEEE International Conference onComputer Vision, vol. 1, no. 1, pp. 144–149, 2003.

[24] W. Yan, S. Kankanhalli, and M. J. T. Wang, J. Reinders, “Experi-ential sampling for monitoring.” Proc. of First ACM InternationalWorkshop on Experiential Telepresentation, pp. 77–86, 2003.

[25] W. Jun, M. S. Kankanhalli, W. Yan, and R. Jain, “Experientialsampling for video surveillance,” Proc. of First ACM InternationalWorkshop on Video Surveillance, pp. 77–86, 2003.

[26] G. L. Foresti, C. Micheloni, L. Snidaro, P. Remagnino, and T. Ellis,“Active video-based surveillance system,” IEEE Signal ProcessingMagazine, vol. 22, no. 2, pp. 25–37, 2005.

[27] J. R. Parker, “Gray-level thresholding in badly illuminated im-ages,” IEEE Trans. Pattern Anal. Machine Intell, vol. 13, no. 8,pp. 813–819, 1991.

[28] G. L. Foresti, “Object detection and tracking in time-varying andbadly illuminated outdoor environments,” IEEE Trans. PatternAnal. Machine Intell, vol. 37, no. 9, pp. 2550–2564, 1999.

[29] D. Vinay, D. Harwood, and L. S. Davis, “Multivalued default logicfor identity maintenance in visual surveillance,” 9th EuropeanConference on Computer Vision, no. 9, pp. 119–132, 2006.

[30] S. Hongeng, F. Bremond, and R. Nevatia, “Representation andoptimal recognition of human activities,” IEEE Proc. Comput.Vision Pattern Recogn, vol. 1, no. 1, pp. 818–825, 2000.

[31] L. R. Rabiner, “A tutorial on hidden markov models and selectedapplications in speech recognition,” IEEE Proc. IEEPAD, vol. 77,no. 2, pp. 257–286, 1989.

[32] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan, “Semi-supervised adapted hmms for unusual event detection,” Proc. ofIEEE CVPR, vol. 1, no. 1, pp. 611–618, 2005.

[33] M. Al-Hames and G. Rigoll, “A multi-modal mixed-state dynamicbayesian network for robust meeting event recognition from dis-turbed data,” IEEE Proc ICME, pp. 45–48, 2005.

[34] E. Stringa, C. Sacchi, and C. S. Regazzoni, “A multimedia systemfor surveillance of unattended railway stations,” Proc. of Eusipco,pp. 1709–1712, 1998.

[35] K. Alahari and C. V. Jawahar, “Discriminative actions for recog-nising events,” ICVGIP, pp. 552–563, 2006.

[36] D. B. Rubin, Using the SIR Algorithm to Simulate PosteriorDistributions (with discussion), in Bayesian Statistics, 3rd ed.New York: Oxford University Press, 1998.

[37] N. J. Gordon, D. J. Salmond, and A. F. M. Smith, “Novel

approach to nonlinear/non-gaussian bayesian state estimation,”IEEE Proceedings, vol. 140, no. 2, pp. 107–113, 1993.

[38] A. Doucet, S. J. Godsill, and C. Andrieu, “On sequential montecarlo sampling methods for bayesian filtering,” Statist. Comp,vol. 10, no. 3, pp. 197–208, 2000.

[39] M. Isard and A. Blake, “Condensation-conditional density prop-agation for visual tracking,” International Journal on ComputerVision, vol. 29, no. 1, pp. 5–28, 1998.

[40] R. Jain, “Eventweb: Developing a human-centered computingsystem,” Computer, vol. 41, pp. 42–50, 2008.

[41] W. Yan and R. Jain, “Event detection from picture observa-tions,” Proc. of International Workshop for Image Technology(IWAIT’08),, 2008.



a framework for an event driven video surveillance system

Documents