information assimilation framework for event …mohan/papers/inform_assim.pdfinformation...

Information Assimilation Framework for Event Detection inMultimedia Surveillance Systems

Pradeep Kumar AtreySchool of ComputingNational University of

SingaporeRepublic of Singapore

[email protected]

Mohan S. KankanhalliSchool of ComputingNational University of

SingaporeRepublic of Singapore

[email protected]

Ramesh JainSchool of Information and

Computer SciencesUniversity of California

Irvine, CA, USA

[email protected]

ABSTRACTMost multimedia surveillance and monitoring systems nowa-days utilize multiple types of sensors to detect events ofinterest as and when they occur in the environment. How-ever, due to the asynchrony among and diversity of sensors,information assimilation - how to combine the informationobtained from asynchronous and multifarious sources is animportant and challenging research problem. In this paper,we propose a framework for information assimilation thataddresses the issues - “when”, “what” and “how” to assim-ilate the information obtained from different media sourcesin order to detect events in multimedia surveillance systems.The proposed framework adopts a hierarchical probabilis-tic assimilation approach to detect atomic and compoundevents. To detect an event, our framework uses not only themedia streams available at the current instant but it also uti-lizes their two important properties - first, accumulated pasthistory of whether they have been providing concurring orcontradictory evidences, and - second, the system designer’sconfidence in them. The experimental results show the util-ity of the proposed framework.

Categories and Subject DescriptorsH.5.1 [Multimedia Information Systems]

General TermsSecurity

KeywordsInformation assimilation, Multimedia surveillance, Agree-ment coefficient, Confidence fusion, Event detection, Com-pound and atomic events

1. INTRODUCTIONIn recent times, it is being increasingly accepted that mostsurveillance and monitoring tasks can be better performed

by using multiple types of sensors as compared to using onlya single type. Therefore, most surveillance systems nowa-days utilize multiple types of sensors like microphones, mo-tion detectors and RFIDs etc in addition to the video cam-eras. However, different sensors usually provide the senseddata in different formats and at different rates. For exam-ple, a video may be captured at a frame rate which couldbe different from the rate at which audio samples are ob-tained, or even two video sources can have different framesrates. Moreover, the processing time of different types ofdata is also different. Due to the asynchrony and diversityamong streams, the assimilation of information in order toaccomplish an analysis task is a challenging research prob-lem. Information assimilation refers to a process of combin-ing the sensory and non-sensory information obtained fromasynchronous multifarious sources using the context and pastexperience.

Event detection is one of the fundamental analysis tasksin multimedia surveillance and monitoring systems. In thispaper, we propose an information assimilation framework forevent detection in multimedia surveillance and monitoringsystems.

Events are usually not impulse phenomena in real world,but they occur over an interval of time. Based on differentgranularity levels in time, location, number of objects andtheir activities, an event can be a “compound-event” or sim-ply an “atomic-event”. We define compound-events and theatomic-events as follows -

Definition 1. Event is a physical reality that consists ofone or more living or non-living real world objects (who)having one or more attributes (of type) being involved inone or more activities (what) at a location (where) over aperiod of time (when).

Definition 2. Atomic-event is an event in which exactlyone object having one or more attributes is involved in ex-actly one activity.

Definition 3. Compound-event is the composition of twoor more different atomic-events.

A compound-event, for example, “a person is running and

shouting in the corridor” can be decomposed into its con-stituent atomic-events - “a person is running in the corridor”and “a person is shouting in the corridor”. The atomic-events in a compound event can occur simultaneously, asin the example give above; or they may also occur one af-ter another, for example, the compound-event “A personwalked through the corridor, stood near the meeting room,and then ran to the other side of the corridor” consists ofthree atomic-events “a person walked through the corridor”followed by “person stood near the meeting room”, and thenfollowed by “person ran to the other side of the corridor”.

The different atomic-events, to be detected, may requiredifferent types of sensors. For example, a “walking” and“running” event can be detected based on video and au-dio streams, a “standing” event can be detected using videobut not by using audio streams, and “shouting” event canbe better detected using the audio streams. The differentatomic-events require different minimum time-periods overwhich they can be confirmed. This minimum time-periodfor different atomic-events depends upon the time in whichthe amount of data sufficient to reliably detect an event canbe obtained and processed. Even the same atomic-eventcan be confirmed in different time periods using differentdata streams. For example, minimum video data requiredto detect a walking event could be of two seconds; however,the same event can be detected based on audio data of onesecond.

The media streams in a multimedia system are often cor-related. We assume that the system designer has a confi-dence level in the decision obtained based on each of themedia streams; and there is a cost of obtaining these de-cisions which usually includes the cost of sensor, its instal-lation and maintenance cost, the cost of energy to operateit, and the processing cost of the stream. We also assumethat each stream in a multimedia system partially helps inaccomplishing the analysis task (e.g. event detection). Thevarious research issues in the assimilation of information insuch systems are -

1. When to assimilate?. Events occur over a timeline[Chieu and Lee 2004]. Timeline refers to a measurablespan of time with information denoted at designatedpoints. Timeline-based event detection in multimediasurveillance systems requires identification of the des-ignated points along a timeline at which assimilation ofinformation should take place. Identification of thesedesignated points is challenging because of asynchronyand diversity among streams and also because of thefact that different events have different granularity lev-els in time.

2. What to assimilate? The fact that, at any instant allof the employed media streams do not necessarily con-tribute towards accomplishing the analysis task bringsup the issue of finding the most informative subset ofstreams. From the available set of streams,

• What is the optimal number of streams requiredto detect an event under the specified constraints?

• Which subset of the streams is the optimal one?

• In case the most suitable subset is unavailable,can one use alternate streams without much lossof cost-effectiveness and confidence?

• How frequently should this optimal subset be com-puted so that the overall cost of the system isminimized?

3. How to assimilate? In combining of different datasources,

• How to utilize the correlation among streams?

• How to integrate the contextual information (suchas environment information) and the past experi-ence?

The framework for information assimilation, which we pro-pose, essentially addresses the above-mentioned issues. Notethat, the solution to the issue (2) has been described withdetailed results and analysis in our other work [Atrey et al.2006]. In this paper, we focus on issues (1) and (3) andpresent our framework for information assimilation with de-tailed analysis and results1.

The proposed framework for information assimilation hasthe following distinct characteristics -

• The detection of events based on individual streamsare usually not accomplished with certainty. To obtaina binary decision, early thresholding of uncertain infor-mation about an event may lead to error. For example,let an event detector find the probabilities of the oc-currence of an event based on three media streams M1,M2 and M3, to be 0.60, 0.62 and 0.70, respectively. Ifthe threshold is 0.65, then these probabilistic decisionsare converted into binary decisions 0, 0 and 1, respec-tively; which implies that the event is found occurringbased on stream M3 but is found non-occurring basedon stream M1 and M2. Since two decisions are in favorof non-occurrence of event compared to the one deci-sion in favor of occurrence of the event, by adoptinga simple voting strategy, the overall decision wouldbe that the event did not occur. It is important tonote that early thresholding can introduce errors inthe overall decision. In contrast to early threshold-ing, the proposed framework advocates late threshold-ing by first assimilating the probabilistic decisions thatare obtained based on individual streams, and then bythresholding the overall probability (which is usuallymore than the individual probabilities e.g. 0.85 in thiscase) of occurrence of event based on all the streams,which is less erroneous.

• The sensors capturing the same environment usuallyprovide concurring or contradictory evidences aboutwhat is happening in the environment. The proposedframework utilizes this agreement/disagreement infor-mation among the media streams to strengthen theoverall decision about the events happening in the en-vironment. For example, if two sensors have been pro-viding concurring evidences in the past, it makes sense

1The earlier version of some of the results found here waspublished in [Atrey et al. 2005]

to give more weight to their current combined evidencecompared to the case if they provided contradictoryevidences in the past [Siegel and Wu 2004]. The agree-ment/disagreement information (we call it as “agree-ment coefficient”) among media streams is computedbased on how they have been agreeing or disagreeing intheir decisions in the past. We also propose a methodfor fusing the agreement coefficients among the mediastreams.

• The designer of a multimedia analysis system can havedifferent confidence levels in different media streamsfor accomplishing different tasks. The proposed frame-work utilizes the confidence information by assigning ahigher weight to the media stream which has a higherconfidence level. The confidence in each stream is com-puted based on how accurate it has been in the past.Integrating confidence information in the assimilationprocess also requires the computation of the overallconfidence in a group of streams, a method for whichis also proposed.

• Information assimilation is different from informationfusion in that the former brings the notion of inte-grating context and the past experience in the fusionprocess. The context is an accessory information thathelps in the correct interpretation of the observed data.We use the geometry of the monitored space along withthe location, orientation and coverage area of the em-ployed sensors as the spatial contextual information.We integrate the past experience by modelling theagreement/disagreement information among the me-dia streams based on the accumulated past history oftheir agreement or disagreement.

Our contributions in this paper are as follows. We have iden-tified various research issues which are important and chal-lenging in assimilating the information for event detectionin multimedia surveillance systems, and proposed a frame-work that adopts a hierarchical probabilistic approach toaddress these issues. The proposed framework has intro-duced the notion of compound and atomic events that helpsin describing events over a timeline. Our probabilistic frame-work has not only utilized the agreement/disagreement in-formation among the media streams, but it has also inte-grates their confidence information in the assimilation pro-cess, which helps in improving the overall accuracy of eventdetection. We have formulated the computation and fusionof the agreement coefficients among the streams and havealso proposed a method for confidence fusion.

Rest of this paper is organized as follows. In section 2,we discuss the related work. We present our framework insection 3. The experimental results are reported in section4. Finally, we conclude the paper with a discussion on futurework in section 5.

2. RELATED WORKResearchers have used early fusion as well as late fusionstrategies in solving diverse problems. For example, feature-level (early) fusion of video and audio has been proposedfor the problems speech processing [Hershey et al. 2004] andrecognition [Nefian et al. 2002], tracking [Checka et al. 2004],

and monologue detection [Nock et al. 2002] by using the mu-tual information among the video and audio features underthe assumption that audio and video signals are individuallyand jointly Gaussian random variables. On the other hand,late fusion strategies have also been used in sensor fusionapplications [Rao and Whyte 1993], [Chair and Varshney1986], [Kam et al. 1992]. In late fusion strategy, a globaldecision is made by fusing the local decisions obtained fromeach data source. [Rao and Whyte 1993] presented a sen-sor fusion algorithm for identification of tracked targets ina decentralized environment. [Chair and Varshney 1986] es-tablished an optimal fusion rule with the assumption thateach local sensor made a predetermined decision and eachobservation was independent. [Kam et al. 1992] generalizestheir solution for fusing the correlated local decisions.

Similar to [Wu et al. 2004], we employ early (feature level)assimilation as well as late (decision level) assimilation strat-egy. Since each media stream provides various features (suchas blob’s location and area in case of a video stream), theirassimilation is performed locally for each media stream toobtain a local decision. Once all the local decisions are avail-able, a global decision is derived by assimilating the localdecisions incorporating their agreement and confidence in-formation. The late assimilation strategy has an advantageover early assimilation in that the former offers scalability(i.e. graceful upgradation or degradation) in terms of me-dia streams used in the assimilation process [Atrey et al.2006]. Note that, in late assimilation, we consider the mediastreams to be “decision-wise correlated”. The decision-wisecorrelation refers to how the decisions obtained based ondifferent media streams co-vary with each other.

Our work is different from the works cited above in followingaspects. We explicitly compute and utilize the correlationinformation (we call it the “agreement coefficient”) amongthe streams. Agreement coefficient among streams is com-puted based on how concurring or contradictory evidencesthey provide. Intuitively, higher the agreement among thestreams, more would be the confidence in the global decision,and vice versa [Siegel and Wu 2004]. The various forms ofcorrelation coefficients that have been used for diverse ap-plications are based on content-wise dependency betweenthe sources, hence are not suitable in our case. Pearson’scorrelation coefficient, Lin’s concordance correlation coeffi-cient [Lin 1989] and Kappa coefficient [Bloch and Kraemer1989] cannot be used in our case since they are evaluatedto zero when the covariance among the observations is zero.Therefore, the proposed framework models the agreementcoefficient and its evolution based on the accumulated pasthistory of how agreeing or disagreeing the media streamshave been in their decisions.

Also, the past works in multimodal fusion literature do notconsider the notion of having confidences in the differentmodalities. We incorporate the stream’s confidence infor-mation. Recently, [Siegel and Wu 2004] has also pointedout the importance of considering the confidence in sensorfusion. The authors have used the Dempster-Shafer (D-S)‘theory of evidence’ to fuse the confidences. In contrast, wepropose a model for confidence fusion by using a Bayesianformulation because it is both simple and computationallyefficient [Rao and Whyte 1993].

3. PROPOSED FRAMEWORK3.1 OverviewThe proposed information assimilation framework adoptsa hierarchical probabilistic approach in order to detect anevent in a surveillance and monitoring environment, andperforms assimilation of information at three different hier-archical levels - media-stream level, atomic-event level andthe compound-event level. The work flow of the frameworkis depicted in figure 1. Let a surveillance and monitoring sys-tem consists of n heterogeneous sensors that capture datafrom the environment. We employ n Media Stream Proces-sors (MSP1 to MSPn), where each MSPi, 1 ≤ i ≤ n, isa set of media processing tools that extracts features fromthe media stream Mi; for example, a blob detector extractsblobs from a video stream. The features extracted from eachmedia stream are stored in their respective databases.

Let the system detect Na number of atomic-events. Thetotal number of sets containing two or more atomic events inwhich the atomic events can occur together can be given by∑Na

r=2

(

Na

r

)

. Any kth compound event Ek can be expressedas Ek = 〈e1, e2, . . . , er〉, where 2 ≤ r ≤ Na, 1 ≤ k ≤ Nc, Nc

being the number of compound events which can be detectedby using the system. The total number NE of events (atomicevents as well as compound events) can be given by NE =Na + Nc.

A compound-event Ek, which comprises of two or moreatomic-events occurring together, is detected hierarchicallyin a bottom-up manner. First, atomic-events ej , 1 ≤ j ≤ rare detected using the relevant media streams, and thenthese decisions are assimilated hierarchically to obtain anoverall decision for the compound event Ek, as described inthe subsequent subsections.

From the total number Na of atomic events that the systemcan detect, the proposed framework identifies -

• The atomic events (e.g. person’s standing/walking/running and person’s talking/shouting) that cannotoccur simultaneously.

• The atomic events (e.g. person’s walking) that canoccur individually as well as can occur together withsome other atomic event (e.g. with person’s shouting).

• The atomic events (such as person’s shouting) thatcannot occur individually and must occur together withsome other atomic event (such as with person’s stand-ing/walking/running).

To further illustrate it, we provide the following example.Example 1: Let us consider a surveillance system thatuses two types of sensors - video and audio with the goal ofdetecting Na = 6 atomic events, namely - person’s “stand-ing”, “walking”, “running”, “talking”, “shouting” and “doorknocking”. In this case, as shown in Table 1, there could beNc = 9 compound events in which any r ≥ 2 atomic event(s)could occur. In total, there could be NE = 12 events.Next, we also identify the types of data sources which canbe used to detect each of the atomic events. For instance,the atomic events shown in example 1, can be detected as

Table 1: All possible events in Example 1Event number Constituent atomic events1 Standing2 Walking3 Running4 Standing , Talking5 Standing, Shouting6 Standing, Door knocking7 Walking, Talking8 Running, Talking9 Walking, Shouting10 Running, Shouting11 Standing, Talking, Door knocking12 Standing, Shouting, Door knocking

follows - standing (V), walking (AV), running (AV), talking(A), shouting (A), door knocking (A); where (A), (V) and(AV) denote audio, video and audio-video streams, respec-tively.

3.2 Timeline-based event detectionAs discussed earlier, the events occur over a timeline. Thereare various issues related to timeline-based event detectionsuch as -

• To mark the start and end of an event over a time-line, there is a need to obtain and process the datastreams at certain time intervals. This time inter-val, which is basically the minimum amount of timeto confirm an event, could be different for differentatomic/compound events when detected using differ-ent data streams. Determining the minimum time pe-riod (say tw) to confirm different events is a researchissue which is out of scope of this paper and will beexplored in the future work. In this paper, we assumethis minimum time period tw to be the same for allthe atomic/compound events.

• Determining the minimum time period tw for a specificatomic event is also critical. Ideally, tw should be justlarge enough to capture the data to confirm an event,since a small value of tw allows to detect the events ata finer granularity in time. We learn the suitable valueof tw through experiments.

• Since the information from different sources becomeavailable at different time instances, when should itbe assimilated is another research issue. There couldseveral strategies to resolve this issue. We assimilatethe information at fixed time intervals tw. This timeinterval is determined by choosing the maximum ofall the minimum time periods in which various atomicevents can be confirmed. Although this strategy maynot be the best, but is computationally less-expensive.Again, exploring other strategies is an issue which willbe considered in the future.

3.3 Hierarchical probabilistic assimilationThe proposed framework adopts a hierarchical probabilis-tic assimilation approach and performs assimilation of in-formation obtained from diverse data sources at three dif-ferent levels - Media-stream level, Atomic-event level andCompound-event level. The details are as follows.

ED

ED 12

Features

Features

Features

Features

Features

Features

1,1

1,2

1,

p

p

p

yes

Spatial contextual information

level

Probabilistic Assimilation

11

Media stream

Media stream

Media stream

1

Sensor

Sensor

2

n

SensorEnvironment

or The world

1

2

n

Compound−event level

Assimilationlevel

Media−stream

Probabilistic

Assimilation

did not occur occurredE E

n

>= Th

Compound−event E

k

kk

e e

k

nStream features

Atomic−event1

Atomic−evente2

Atomic−eventr

Atomic−event

Mn

M2

M1

no

MSP

MSP

MSP

Stream 1 features

Stream 2 features

ED 1npe2

erppe 1

Ep

Figure 1: A schematic overview of the information assimilation framework for event detection

3.3.1 Media-stream level assimilationThe Event Detectors (EDji, 1 ≤ j ≤ r and 1 ≤ i ≤ n)are employed to independently detect each atomic-event ej

based on the respective features obtained from media streamsMi, 1 ≤ i ≤ n. At media-stream level, all the available fea-tures from a media stream are combined. The event detec-tors make the decision about an atomic event based on thecombined features. Whenever required, they also utilize theenvironment information such as the geometry of the mon-itored space, location, orientation and the coverage spaceetc of sensors. The event detectors provide their decisionsin probabilities pj,i, 1 ≤ j ≤ r and 1 ≤ i ≤ n (Figure 1). Thepj,i implies probability of the occurrence of atomic-event ej

based on media stream Mi.

3.3.2 Atomic-event level assimilationAt the next level, since the decisions about an atomic-eventej , that are obtained based on all the relevant media streams,may be similar or contradictory; these decisions are assimi-lated using a Bayesian approach incorporating streams’ agree-ment/disagreement and confidence information. For theatomic-events ej , 1 ≤ j ≤ r, we follow the steps -

1. At any particular instant, we group all the streams intotwo subsets S1 and S2. S1 and S2 contain the streamsbased on which the event detectors provide decision infavor and against the occurrence of the atomic-event,respectively.

2. Using the streams in the two subsets S1 and S2, wecompute overall probabilities P (ej |S1) and P (ej |S2) ofoccurrence and non-occurrence of the atomic-event ej ,respectively. The overall probabilities are computedusing a Bayesian assimilation approach which will bedescribed shortly.

3. If P (ej |S1) ≥ P (ej |S2), it is concluded that the atomic-event ej has occurred with a probability Pej

= P (ej |S1),else it did not occur with a probability Pej

= P (ej |S2).

We assume the media streams to be “content-wise” inde-pendent. This assumption is reasonable since media streamsmay be of different types, and may have different data for-mats and representations. However, since the decision aboutthe same atomic-event is obtained based on all the streams,we can assume them to be “decision-wise” correlated.

We describe in the following paragraphs how the assimila-tion of decision-wise correlated media streams takes place,and also how the agreement coefficient and confidence infor-mation about them are modelled.

A. Assimilation of correlated media streamsLet a surveillance and monitoring system utilize a set Mn

= {M1, M2, . . . , Mn} of n media streams. The systemoutputs local decisions P (ej |Mi), 1 ≤ i ≤ n, 1 ≤ j ≤ r,about an atomic-event ej . Along a timeline, as these prob-abilistic decisions are available, we iteratively integrate allthe media streams using a Bayesian approach. The pro-posed approach allows for incremental and iterative addi-tion of new stream. Let P (ejt |M

i−1

t ) denote probability ofthe occurrence of atomic-event ej at time t based on frommedia streams M1, M2, . . . , Mi−1. The updated probabil-ity P (ejt |M

it) (i.e. the overall probability after assimilating

the new stream Mi,t at time instant t) can be iterativelycomputed as -

P (ejt |Mit) =

P (Mi,t|ejt)P (ejt |Mi−1

t )

P (Mi,t|Mi−1

t )

P (ejt |Mit) = αiP (ejt |M

i−1

t )P (Mi,t|ejt) (1)

where, αi is a normalization factor.

Equation (1) shows the assimilation using the Bayesian ap-proach under the assumption that all the media streamshave equal confidence levels and zero agreement coefficient.In what follows, we relax this assumption and integrate theagreement /disagreement and confidence information of me-dia streams in their assimilation.

The confidence in each media stream is computed by ex-perimentally determining its accuracy. To integrate theconfidence into assimilation process, we use the consensustheory. Consensus theory provides a notion of combiningthe single probability distributions based on their weights[Benediktsson and Kanellopoulos 1999]. In our case, we es-sentially do the same by assigning weights to different mediastreams based on their confidence information. If we havemore confidence in a media stream, a higher weight is givento it. Several consensus rules have been proposed, how-ever the most commonly used consensus rules are - linearopinion pool(LOP) and logarithmic opinion pool (LOGP).In linear opinion pool, non-negative weights are associatedwith the sources to quantitatively express the “goodness” ofeach source. The rule is formulated as: Tc(p1, p2, . . . , pn) =∑n

i=1wipi where, pi, 1 ≤ i ≤ n, are the individual proba-

bilistic decisions; and wi, 1 ≤ i ≤ n are their correspondingweights whose sum is equal to 1 i.e.

∑n

i=1wi = 1. We use

the logarithmic opinion pool since it satisfies the assump-tion of conditional (content-wise) independence among me-dia streams which is essential to assimilation. The rule isdescribed as [Genest and Zidek 1986] -

log[Tc(p1, p2, . . . , pn)] =

n∑

i=1

wilog(pi) (2)

or

Tc(p1, p2, . . . , pn) =n

∏

i=1

piwi (3)

where, pi, 1 ≤ i ≤ n, are the individual probabilistic de-cisions and

∑n

i=1wi = 1. We normalize it over the two

aspects of an event - the occurrence and non-occurrence ofevent. The formulation is shown as -

Tc(p1, p2, . . . , pn) =

∏n

i=1pi

wi

∑

E(∏n

i=1pi

wi)(4)

We use this formulation to develop the assimilation modelwhich will be described shortly.

The agreement coefficient between two media streams isused as a scaling factor for the overall probability of oc-currence of an event. The idea is that higher the agree-ment coefficient between the two media streams, the higherwould be the overall probability. We use this notion in theproposed assimilation model.

The assimilation model that combines the probabilistic de-cisions based on two sources Mi−1 (i.e. a group of i − 1streams) and Mi (i.e. an individual ith stream) is given asfollows-

Pi =(Pi−1)

Fi−1 .(pi)fi .eγi

(Pi−1)Fi−1 .(pi)fi .eγi + (1− Pi−1)Fi−1 (1− pi)fi .e−γi

(5)

where, Pi = P (ejt |Mit) and Pi−1 = P (ejt |M

i−1

t ) are theprobabilities of occurrence of atomic-event ej using Mi andMi−1, respectively, at time instant t. pi = P (ejt |Mi,t) isprobability of the occurrence of atomic-event ej based ononly ith stream at time instant t. Similarly, Fi−1 and fi

(such that Fi−1+fi = 1) are the confidence in Mi−1 andMi, respectively. The computation of confidence for a groupof media streams will be described shortly. The γi ∈ [−1, 1]is the agreement coefficient between two sources Mi−1 and

Mi. The limits -1 and 1 represent full disagreement andfull agreement, respectively, between the two sources. Themodelling of γi is described in subsequent paragraphs.

B. Modelling of the agreement coefficientThe correlation among the media streams refers to the mea-sure of their agreement or disagreement with each other. Wecall this measure of agreement to be the “Agreement Coef-ficient” among the streams. Let the measure of agreementamong the media streams at time t be represented by a setΓ(t) which is expressed as:

Γ(t) = {γik(t)} (6)

where, the term −1 ≤ γik(t) ≤ 1 is the agreement coefficientbetween the media streams Mi and Mk at time instant t.

The agreement coefficient γik(t) between the media streamsMi and Mk at time instant t is computed by iterativelyaveraging the past agreement coefficients with the currentobservation. The γik(t) is precisely computed as:

γik(t) =1

2[(1− 2× abs(pi(t)− pk(t))) + γik(t− 1)] (7)

where, pi(t) = P (ejt |Mi) and pk(t) = P (ejt |Mk) are the in-dividual probabilities of occurrence of atomic-event ej basedon media streams Mi and Mk, respectively, at time t ≥ 1;and γij(0) = 1 − 2 × abs(pi(0) − pk(0)). These probabil-ities represent decisions about the atomic-events. Exactlysame probabilities would imply full agreement (γik = 1)whereas totally dissimilar probabilities would mean that thetwo streams fully contradict each other (γik = −1). Notethat any three media streams, in agreeing/disagreeing witheach other, do follow the commutativity rule.

The agreement coefficient between two sources Mi−1 andMi is modelled as:

γi =1

i− 1

i−1∑

s=1

γsi (8)

where, γsi for 1 ≤ s ≤ i − 1, 1 < i ≤ n is the agreementcoefficients between the sth and ith media streams. Theagreement fusion model given in equation (8) is based onaverage-link clustering. In average-link clustering, we con-sider the distance between one cluster and another clusterto be equal to the average distance from any member of onecluster to any member of the other cluster. In our case, agroup Mi−1 of i−1 media streams is one cluster and we findthe average distance of new ith media stream with this clus-ter. The fused agreement coefficient γi is used for combiningMi with Mi−1 as described before in equation (5).

C. Confidence fusionIn the context of streams, the confidence in a stream is re-lated to its accuracy. The higher the accuracy of a stream,higher the confidence we would have in it. We computethe accuracy of a stream by determining how many timesan event is correctly detected based on it out of the totalnumber of tries. Note that, in our case, the accuracy of astream includes the measurement accuracy of the sensor aswell as the accuracy of the algorithm used for processing thestream.

The confidence fusion refers to the process of finding theoverall confidence in a group of media streams where the

individual media streams have their own confidence level.If the two streams Mi and Mk have their confidence levelsfi and fk, respectively; what would our confidence be in agroup which contains both the streams? The intuitive an-swer to this question would be that our overall confidenceshould increase as the number of streams increases. Consid-ering the confidence values as the probabilities, we proposea Bayesian method to fuse the confidence levels in individualstreams. The overall confidence fik in a group of two mediastreams Mi and Mk is computed as follows:

fik =fi × fk

fi × fk + (1− fi)× (1− fk)(9)

In the above formulation, we make two assumptions. First,we assume that the system designer’s confidence level in eachof the media streams is more than 0.5. This assumption isreasonable since there is no use of employing a sensor whichis found to be inaccurate more than half of the time. Second,although the media streams are correlated in their decisions;we assume that they are mutually independent in terms oftheir confidence levels.

For n number of media streams, the overall confidence isiteratively computed. Let Fi−1 be the overall confidence ina group of i − 1 streams. By fusing the confidence fi of ith

stream with Fi−1, the overall confidence Fi in a group of istreams is computed as:

Fi =Fi−1 × fi

Fi−1 × fi + (1− Fi−1)× (1− fi)(10)

3.3.3 Compound-event level assimilationAt the compound-event level, the overall probability pE ofthe occurrence of compound-event E is estimated by assim-ilating the probabilistic decisions pej

, 1 ≤ j ≤ r about ther atomic-events by using the following assimilation model -

pE =

∏r

j=1pej

∏r

j=1pej

+∏r

j=1(1− pej

)(11)

If pE is found greater than the threshold Th, the systemdecides in favor of the occurrence of compound event E,else it decides against it.

Since the atomic-events are independent, the agreement co-efficients among them are considered as zero, and hence isnot integrated into equation (11). For example, atomic-events e1 = “A person is walking in the corridor” and e2 =“A person is shouting in the corridor” are essentially inde-pendent since a person’s walking is completely independentof the person’s shouting. The confidence information is alsonot integrated into this assimilation model because the con-fidence is usually associated with media streams and notwith the atomic-events.

4. EXPERIMENTAL RESULTSTo demonstrate the utility of our proposed framework, wepresent experimental results in a surveillance and monitor-ing scenario. The surveillance environment is the corridor ofour school building and the system goal is to detect eventsthat are described in Example 1 (in section 3.1) i.e. hu-man’s running, walking, standing, talking, shouting anddoor knocking in the corridor. The environment layout isshown in figure 2. We use two video sensors (cameras M1

4MM

: Microphones − placed in the corridor.: Camera 2 − faces towards side ‘1’ and captures the whole corridor.

1M

2M

4M3M &

: Camera 1 − faces towards side ‘2’ and captures the whole corridor.

‘2’Door

M‘1’Door

2M

Wall

Wall

1 3

Figure 2: The layout of the corridor under surveil-lance and monitoring

Figure 3: Multimedia Surveillance System

and M2) to record the video from the two opposite sides ofcorridor, and two audio sensors (microphones M3 and M4)to capture the ambient sound. A snapshot of the multime-dia surveillance system which we have developed is shown infigure 3. The system is implemented using Visual C++ onMS-Windows platform. The MS-Access is used a databaseto store the features and the events.

4.1 Data setFor our experiments, we have used data of more than twelvehours which has been recorded using the system consistingof two video cameras (Canon VC-C50i) and two USB mi-crophones in the corridor of our school building. Over theperiod of more than twelve hours, a total of 92 events oc-curred over for a period of 1268 seconds. The details ofvarious events and their time durations are given in Table 2.The graduate students from our lab volunteered to performthese activities.

4.2 Performance evaluationThe evaluation of proposed framework is performed basedon two tasks - event detection and event classification. Theevaluation of event detection task is characterized by twometrics - False Rejection Rate (FRR) and False AcceptanceRate (FAR). FRR is the ratio of number of events not de-tected to the total number of events, and FAR is the ratio ofnumber of non-events detected to the total number of non-events. An event here refers to the observation made over atw period of time (Refer to section 3.2).

The event classification task is evaluated based on the accu-

Table 2: The data setEvents Time duration (In seconds)Standing 139Walking 798Running 142Standing, Talking 30Standing, Shouting 11Standing, Knocking 59Walking, Talking 80Walking, Shouting 9

0 0.5 1 1.5 2 2.5 3 3.5 40

20

40

60

80

100

tw (in seconds)

AC

C &

FR

R

ACCFRR

Optimal tw

Figure 4: Determining the optimal value of tw

racy (ACC) in classification. The metric ACC is defined asthe ratio of number of events corrected classified to the totalnumber of events that are detected to be the valid events.Again, an event refers to the observation made over a tw

time period.

As described in section 3.2, it is critical to determine thevalue of tw. We have determined through experiments thesuitable value of tw to be 1 second. As can be seen fromfigure 4, at tw = 1 second, we obtain the maximum accuracy(ACC) and minimum FRR.

4.3 Preprocessing steps4.3.1 Event detection in video streamsThe video is processed to detect human motion (running,walking and standing). Video processing involves two ma-jor steps - background modeling and blob detection. Thebackground is modeled using an adaptive Gaussian method[Stauffer and Grimson 1999]. The blob detection is per-formed by first segmenting the foreground from the back-ground using simple ‘matching’ on the three RGB colorchannels, and then using the morphological operations (erodeand dilation) to obtain connected components (i.e. blobs).The matching is defined as a pixel value being within 2.5standard deviations of the distribution. A summary of thevideo features used for various classification tasks is pro-vided in Table 3(a). We assume that the blob of an areagreater than a threshold corresponds to a human. The de-tected blob and its bounding rectangle is shown in figure5. Once we compute the bounding rectangle (x, y, w, h) foreach blob, where (x, y) denotes the top-left coordinate, w isthe width and h is the height; we map the point (x+w/2, h)(i.e. approximating with human’s feets) in the image to apoint (Ex, Ey) in 3-D world (i.e. on the corridor’s floor),as shown in figure 6. To achieve this mapping, we calibratethe cameras and obtain a transformation matrix that mapsimage points to the points on corridor’s floor. This pro-

(a) (b)

(c) (d)

Figure 5: Blob detection in Camera 1 and Camera 2:(a)-(b) Bounding rectangle, (c)-(d) Detected blobs

area

MSP

Video stream

Calibration

Spatial contextualinformation

Mapped location on earth

Ex, Ey

Blob features

toMapping

3−D world

Video

frame detectorBlob x, y, w, h

Figure 6: The process of finding from a video framethe location of a person on the corridor ground in3-D world

vides the exact ground location of human in the corridor ata particular time instant.

The system identifies the start and end of an event in videostreams as follows. If a person moves towards the camera,the start of event is marked when the blob’s area becomesgreater than a threshold and the event ends when the blobintersects the image plane. However, if the person walksaway from the camera, the start and end of the event isinverted. The event detection is performed at regular timeintervals of tw = 1 second. Using the actual location ofthe person on the corridor’s ground at the end of each timeinterval tw, we compute the average distance travelled bya person on the ground. Based on this average distance, aBayes classifier is first trained and then used to classify anatomic-event to be one of the classes - standing, walking andrunning.

4.3.2 Event detection in audio streamsUsing the audio streams, the system detects events suchfootsteps, talking, shouting and door knocking. The audio(of 44.1 MHz frequency) is divided into the “audio frames”of 50 ms each. The frame size is chosen by experimen-tally observing that 50 ms is the minimum period duringwhich an event such a footstep can be represented. Weadopted a hierarchical (top-down) approach to model theseevents using a mixture of Gaussian (GMM). The top-downevent modelling approach works better compared to thesingle-level multi-class modelling approach. We performeda separate study to find the suitability of features for de-

Level 1

Level 2

Events(foreground)

No event(background)

Audio frame

Level 3Walking Running

Level 0

Nonvocal eventsVocal events

Talking Shouting Knocking Footsteps

Figure 7: Audio event classification

Table 3: A summary of the features used for variousclassification tasks in video and audio streams

(a) VideoClassification task Feature used

Foreground/Background RGB channelsRunning/Walking/Standing Blob’s displacement

(b) AudioClassification task Feature used

Foreground/Background LFCCVocal/Nonvocal LFCC

Talk/Shout LPCFootsteps/Door knocking LFCC

tecting these audio events [Atrey et al. 2006]. Table 3(b)summarizes the audio features used for foreground/ back-ground segmentation and for classification of events at dif-ferent levels. The feature Log Frequency Cepstral Coeffi-cients (LFCCs) with 10 coefficients and 20 filters workedwell for foreground/background segmentation and for dis-tinguishing between vocal/nonvocal and footsteps/knockingevents. The LFCCs are computed by using logarithmic fil-ter bank in frequency domain [Maddage 2006]. The LinearPredictor Coefficient (LPC) that have been widely used inspeech processing community worked well for demarcatingbetween talking and shouting events.

The Gaussian Mixture Model (GMM) classifier is employedto classify every audio frame (of 50 ms) into the audio eventsat different levels as shown in figure 7. At the top level(0), each input audio frame is classified as the foregroundor the background. The background is the environmentnoise which represents ‘no event’ and is ignored. The fore-ground that represents the events, are further categorizedinto two classes - vocal and nonvocal (level 1). At the nextlevel (2), both vocal and nonvocal events are further classi-fied into “talking/shouting” and the “footsteps/door knock-ing” events, respectively. Finally, at the last level (3), thefootsteps sequences are classified as “walking” or “running”based on the frequency of their occurrence in a specified timeinterval.

Similar to the video, the system makes a probabilistic de-cision about the events based on audio streams after everytw = 1 second. Note that, in 1 second, we obtain 20 audioframes of 50 ms each. The audio event classification for theaudio data of tw time period is performed as follows. First,the system learns via training the number of audio framescorresponding to an event in the audio data of tw time pe-riod. Then, a Bayesian classifier is employed to estimate theprobability of occurrence of an audio event at a regular time

−0.5

0

0.5

(a)

−0.5

0

0.5

(b)

door knocking

door knocking

Figure 8: Audio data captured by (a) microphone1 and (b) microphone 2 corresponding to the eventEk

interval tw.

4.4 Example of an eventIn this section, we describe with an example how the pro-posed framework works in order to detect an event over atimeline. Let us consider a compound event Ek “A per-son is walking, knocking the door and then continued walk-ing in the corridor”. This event consists of atomic eventsoccurring in two different ways. First, it consists of twoatomic events occurring together i.e. “standing” and “doorknocking” events. Second, it also consists of atomic eventsoccurring one after another i.e. “walking” event followedby “standing/door knocking” event and then followed by“walking” event. The audio data captured using using mi-crophone 1 and microphone 2 is shown in shown in figure 8.Some of the video frames captured by camera 1 and camera2 corresponding to the event Ek and the bounding rectanglesof the detected blobs in them are shown in figure 9.

The system detects the walking event using both audio andvideo streams, while standing and knocking events are de-tected based on video and audio streams, respectively. Theprobabilistic decisions about these atomic events are ob-tained based on respective streams at every tw = 1 sec-ond. The overall decision for compound events are obtainedalong the timeline by assimilating the probabilistic decisionsfor atomic events as shown in figure 10. Note that in fig-ure 10, the legends denote as follows: ‘◦’ - “standing”, ‘2’ -“walking”, ‘5’ - “running” and ‘∗’ - “door knocking” events.

Figures 10a-10d show the timeline-based probabilistic deci-sions based on individual streams. Figures 10e-10h show thecombined decision about the event at a regular time intervalwith and without using streams’ agreement/disagreementand confidence information.

It is interesting to note from figure 10 that using agree-ment coefficient though improves the accuracy of computingthe probability of occurrence of an event, it is also impor-tant to use the confidence information to avoid incorrectresults. For instance, using the stream’s confidence infor-mation helps in obtaining correct results at time instants 3and 4 in figure 10g-10h compared to the results at the sametime instants in figure 10e-10f where confidence informationis not used and an “walking” event is detected as “running”.Note that the correct sequence of event is as follows: Time

Figure 9: Some of the video frames captured by (a) camera 1 and (b) camera 2 corresponding to the event E

0 5 10 15 20 250

0.5

1

0 5 10 15 20 250

0.5

1

0 5 10 15 20 250

0.5

1

0 5 10 15 20 250

0.5

1

0 5 10 15 20 250

0.5

1

0 5 10 15 20 250

0.5

1

0 5 10 15 20 250

0.5

1

0 5 10 15 20 250

0.5

1

Timeline (in seconds)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 10: Timeline-based assimilation of probabilistic decisions about the event E. The legends denote theprobabilistic decisions based on (a) Video stream 1 (b) Video stream 2 (c) Audio stream 1 (d) Audio stream2 (e) All the streams (without agreement coefficient and confidence information) (f) All the streams (withagreement coefficient but without confidence information) (g) All the streams (with confidence informationbut without agreement coefficient) (h) All the streams (with both agreement coefficient and the confidenceinformation)

Table 4: Results: Using individual streams withTh = 0.70

Stream FRR FAR ACCVideo stream 1 0.12 0.01 0.60Video stream 2 0.10 0.03 0.60Audio stream 1 0.07 0.19 0.55Audio stream 2 0.06 0.27 0.51

instants 1-9 “walking”, 10-20 “standing/door knocking” and21-27 “walking”.

4.5 Overall performance analysis4.5.1 Using Individual StreamsFirst, we performed event detection and classification us-ing individual streams. The probability threshold Th valuefor determining the occurrence of an event was set to 0.70.The probability threshold Th is a threshold to convert aprobabilistic decision into a binary decision (Refer to sec-tion 3.3.3). By comparing with the ground truth, we foundthe results as shown in Table 4. As can been from Table 4,FRR in video streams is higher than that in audio streams.This is because the video cameras were placed in such away that they could not cover the whole corridor, and hencecould not detect events outside their coverage area. On theother hand, since the microphones could capture the am-bient sound even beyond the corridor area, they were ableto detect the events those did not occur in the corridor re-gion. Therefore, the microphones are found to have the FARhigher than that of video streams.

Using our whole set of events, we computed the accuracies(ACC) of event classification for all the four streams. Wefound the accuracy of individual streams to be moderate.However, it was found that the accuracy of event classifica-tion based on video streams was slightly better than thatbased on audio streams. We used these accuracy values toassign the confidences in all the four streams. Note that theoverall accuracies of video streams is based on three types ofevents - “standing”, “walking” and “running”, while the au-dio streams’ overall accuracies are determined based on fivetypes of events - “walking”, “running”, “talking”, “shout-ing” and “door knocking”.

4.5.2 Assimilation of all streamsWe performed assimilation of the probabilistic decisions ob-tained from individual streams in four different ways basedon whether or not to use the agreement/disagreement in-formation and the confidence information about them. Theresults are shown in Table 5. Note that these results areobtained by setting probability threshold Th and minimumtime period tw to 0.70 and 1 second, respectively.

Overall observations from Table 5 are as follows -

• Using multiple streams together provides better overallaccuracy (ACC) and the reduced False Rejection Rate(FRR) as can be seen in the option 1 in Table 5. FARis not evaluated in case of assimilating all the streams;since in the assimilation process, only the evidences of

Table 5: Results: Using all the streams with Th =0.70

Option Agreement Confidence FRR ACCcoefficient information

1 No No 0.011 0.722 Yes No 0.011 0.783 No Yes 0.010 0.764 Yes Yes 0.012 0.80

occurrence of the events are used, and therefore it doesnot affect FAR.

• The results (Table 5) imply that using agreement/ dis-agreement information among the streams is advanta-geous in obtaining more accurate results, however, us-ing confidence information with it can further improvethe overall accuracy of event detection and classifica-tion. Again, note that, the overall accuracies reportedin Table 5 are for all the events listed in Table 2.

4.5.3 Early vs late thresholdingWe also observed the accuracy of event classification byvarying the probability threshold Th from 0.50 to 0.99. Theresults are shown in figure 11. Figure 11 shows how ac-curacy (ACC) decreases as the probability threshold Thincreases for individual streams and for all streams whenassimilated with four different options based on whetheror not agreement coefficient and confidence information isused. It can be clearly seen from figure 11 that assimilationof all streams provide better accuracy even with a higherthreshold, while individual streams fail in this respect. Theaccuracy decreases slowly for the combined evidences com-pared to the individual evidences. This implies that usingagreement/disagreement among and confidence informationof the streams in the assimilation process not only improvesthe overall accuracy, it also improves the accuracy of com-puting the probability of occurrence of the events. It alsoshows that early thresholding of the probabilistic decisionsobtained based on individual streams leads to lesser accu-racy; for example, in figure 11, at the probability thresholdis 0.80, we obtain higher accuracies - 68, 71, 61 and 74 inthe figures 11e-11h, respectively, after the assimilation of allstreams compared to the accuracies - 33, 35, 36 and 33 inthe figures 11a-11d, respectively, obtained using individualstreams.

5. CONCLUSIONSIn this paper, we have presented a novel framework for as-similation of information in order to detect events in thesurveillance and monitoring systems that utilize multifarioussensors. The experimental results have shown that the use ofagreement coefficient among and the confidence informationof media streams helps in obtaining more accurate and cred-ible decisions about the events. The results have also shownthat the False Rejection Rate for event detection can be sig-nificantly reduced using all the streams together. In futurework, there are many other issues which need to exploredsuch as - first, how to determine the minimum time periodto confirm different events; second, it would be interestingto see how framework will work when the information fromdifferent sources would be made available at different time

0.60 0.80 1.000

50

100

0.60 0.80 1.000

50

100

0.60 0.80 1.000

50

100

0.60 0.80 1.000

50

100

0.60 0.80 1.000

50

100

0.60 0.80 1.000

50

100

0.60 0.80 1.000

50

100

0.60 0.80 1.000

50

100

(h)

Accu

racy

(%)

(g)

(e) (f)

(c) (d)

(a) (b)

x-axis: Probability Threshold (Th), y-axis: Accuracy (ACC)

Figure 11: Plots: Probability Threshold vs Accu-racy. (a) Video stream 1 (b) Video stream 2 (c) Au-dio stream 1 (d) Audio stream 2 (e)-(h) All streamsafter assimilation with the four options given in Ta-ble 5

instances, what would be the ideal sampling rate of eventdetection and information assimilation; and finally, how theconfidence information about a stream (newly added in thesystem) can be computed over time using its agreement/disagreement with the other streams whose confidence in-formation are known, and how it would evolve over timewith the changes in environment.

6. REFERENCESAtrey, P. K., Kankanhalli, M. S., and Jain, R.

2005. Timeline-based information assimilation inmultimedia surveillance and monitoring systems. In TheACM International Workshop on Video Surveillance andSensor Networks. Singapore, 103–112.

Atrey, P. K., Kankanhalli, M. S., and Oommen,

J. B. 2006. Goal-oriented optimal subset selection ofcorrelated multimedia streams. ACM Transactions onMultimedia Computing, Communications andApplications. (To appear).

Atrey, P. K., Maddage, N. C., and Kankanhalli,

M. S. 2006. Audio based event detection for multimediasurveillance. In IEEE International Conference onAcoustics, Speech, and Signal Processing. Toulouse,France, V813–816.

Benediktsson, J. A. and Kanellopoulos, I. 1999.Classification of multisource and hyperspectral databased on decision fusion. IEEE Trans. on GeoScienceand Remote Sensing 37, 3 (May), 1367–1377.

Bloch, D. A. and Kraemer, H. C. 1989. 2 × 2 Kappacoefficients: Measures of agreement or association.Journal of Biometrics 45, 1, 269–287.

Chair, Z. and Varshney, P. R. 1986. Optimal datafusion in multiple sensor detection systems. IEEETransactions on Aerospace and Electronic Systems 22,98–101.

Checka, N., Wilson, K. W., Siracusa, M. R., and

Darrell, T. 2004. Multiple person and speaker activitytracking with a particle filter. In InternationalConference on Acoustics Speech and Signal Processing.

Chieu, H. L. and Lee, Y. K. 2004. Query based eventextraction along a timeline. In International ACMSIGIR Conference on Research and development inInformation Retrieval. Sheffield, UK, 425–432.

Genest, C. and Zidek, J. V. 1986. Combiningprobability distributions: A critique and annotatedbibliography. Journal of Statistical Science 1, 1, 114–118.

Hershey, J., Attias, H., Jojic, N., and Krisjianson,

T. 2004. Audio visual graphical models for speechprocessing. In IEEE International Conference on Speech,Acoustics, and Signal Processing. Montreal, Canada,V:649–652.

Kam, M., Zhu, Q., and Gray, W. S. 1992. Optimaldata fusion of correlated local decisions in multiplesensor detection systems. IEEE Transactions onAerospace and Electronic Systems 28, 3 (July), 916–920.

Lin, L. I.-K. 1989. A concordance correlation coefficientto evaluate reproducibility. Journal of Biometrics 45, 1,255–268.

Maddage, N. C. 2006. Content based music structureanalysis. Ph.D. thesis, School of Computing, NationalUniversity of Singapore.

Nefian, A. V., Liang, L., Pi, X., Liu, X., and

Murphye, K. 2002. Dynamic bayesian networks foraudio-visual speech recognition. EURASIP Journal onApplied Signal Processing 11, 1–15.

Nock, H. J., Iyengar, G., and Neti, C. 2002.Assessing face and speech consistency for monologuedetection in video. In ACM International Conference onMultimedia.

Rao, B. S. and Whyte, H. D. 1993. A decentralizedbayesian algorithm for identification of tracked objects.IEEE Transactions on Systems, Man andCybernetics 23, 1683–1698.

Siegel, M. and Wu, H. 2004. Confidence fusion. InIEEE International Workshop on Robot Sensing. 96–99.

Stauffer, C. and Grimson, W. E. L. 1999. Adaptivebackground mixture models for real-time tracking. InIEEE Computer Society Conference on Computer Visionand Pattern Recognition. Vol. 2. Ft. Collins, CO, USA,252–258.

Wu, Y., Chang, E. Y., Chang, K. C.-C., and Smith,

J. R. 2004. Optimal multimodal fusion for multimediadata analysis. In ACM International Conference onMultimedia. New York, USA, 572–579.

information assimilation framework for event …mohan/papers/inform_assim.pdfinformation...

Documents