multimodal event detection in user generated videos -...

8
Multimodal Event Detection in User Generated Videos Francesco Cricri 1 , Kostadin Dabov 1 , Igor D.D. Curcio 2 , Sujeet Mate 2 , Moncef Gabbouj 1 1 Department of Signal Processing, Tampere University of Technology, Tampere, Finland, {francesco.cricri, kostadin.dabov, moncef.gabbouj}@tut.fi 2 Nokia Research Center, Tampere, Finland, {igor.curcio, sujeet.mate}@nokia.com Abstract— Nowadays most camera-enabled electronic devices contain various auxiliary sensors such as accelerometers, gyroscopes, compasses, GPS receivers, etc. These sensors are often used during the media acquisition to limit camera degradations such as shake and also to provide some basic tagging information such as the location used in geo-tagging. Surprisingly, exploiting the sensor-recordings modality for high- level event detection has been a subject of rather limited research, further constrained to highly specialized acquisition setups. In this work, we show how these sensor modalities, alone or in combination with content-based analysis, allow inferring information about the video content. In addition, we consider a multi-camera scenario, where multiple user generated recordings of a common scene (e.g., music concerts, public events) are available. In order to understand some higher-level semantics of the recorded media, we jointly analyze the individual video recordings and sensor measurements of the multiple users. The detected semantics include generic interesting events and some more specific events. The detection exploits correlations in the camera motion and in the audio content of multiple users. We show that the proposed multimodal analysis methods perform well on various recordings obtained in real live music performances. Multimodal; indexing; analysis; event; motion; video I. INTRODUCTION In the past decade there has been a tremendous growth in the amount of user generated multimedia content (images and video). This was enabled by rapid advances in the multimedia recording capabilities of mobile devices. It is thereafter becoming increasingly important to extract rich semantic information about the media content in order to enable its effective storage, retrieval, and usage. This is a preliminary step which also consists of indexing the multimedia content. A typical retrieval use case is searching for particular objects and events within a video, e.g., a player scoring a goal in a football match. A significant amount of work has already been done for extracting low- and high-level descriptors used for multimedia indexing. In the following we discuss some of these content-based methods and in particular concentrate on multimodal approaches. A set of relatively low-level descriptors are made available as part of the MPEG-7 content description standard [1]. In [2], events are extracted from news videos, in a multimodal unsupervised fashion. This method combines information from audio, visual appearance, faces, and mid-level semantic concepts by applying coherence rules. In [3] the authors exploit the correlation between text and image components of multimedia content. The method is applied for cross-modal retrieval, i.e., for retrieving text documents in response to query images and vice-versa. It is shown that the cross-modal model outperforms existing uni- modal image retrieval systems. Multimedia indexing techniques play a key role in various multimedia applications. In automatic video summarization, interesting portions of the video are extracted and included in the video summary. In [4], the authors consider user generated videos. In particular, the camera motion is regarded as indicator of the camera person’s interests in the scene, and is analyzed by means of content analysis for identifying key- frames and regions of interest within a key-frame. In [5] a multimodal analysis for automated event detection in broadcast sport videos is proposed. The co-occurrence or temporal sequence of low-level, mid-level and high-level features from the audio and visual content (mainly color and motion) are analyzed for detecting events such as “Audience”, “Field”, “Goal”, “Close-up of player”, “Tennis point” and “Soccer goal”. Automatic or semi-automatic video mash-up generation is another application which relies on indexed high-level events. A video mash-up consists of video segments recorded by different cameras during approximately the same time span (for example during a live music show). The ultimate goal of these methods is to produce a video mash-up that is similar to what a human could produce. For example, this would require knowledge about the presence of an interesting event captured by a certain person, in order to switch view to that camera when the event is taking place. In [6] a study on requirements for creating video mash-ups of concerts is presented. One example of the obtained requirements is to include faces of the audience while cheering. In [7], a system for organization of user-contributed content from live music shows is presented. The authors consider as interesting events those time-intervals for which a higher number of captured videos is available. A method for detecting spatio-temporal events of interest in single and multi-camera systems is described in [8], in which low-level events are used as building blocks and combined by 2011 IEEE International Symposium on Multimedia 978-0-7695-4589-9/11 $26.00 © 2011 IEEE DOI 10.1109/ISM.2011.49 263

Upload: nguyenthu

Post on 06-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Multimodal Event Detection in User Generated Videos

Francesco Cricri1, Kostadin Dabov1, Igor D.D. Curcio2, Sujeet Mate2, Moncef Gabbouj1

1Department of Signal Processing, Tampere University of Technology, Tampere, Finland, {francesco.cricri, kostadin.dabov, moncef.gabbouj}@tut.fi

2Nokia Research Center, Tampere, Finland, {igor.curcio, sujeet.mate}@nokia.com

Abstract— Nowadays most camera-enabled electronic devices contain various auxiliary sensors such as accelerometers, gyroscopes, compasses, GPS receivers, etc. These sensors are often used during the media acquisition to limit camera degradations such as shake and also to provide some basic tagging information such as the location used in geo-tagging. Surprisingly, exploiting the sensor-recordings modality for high-level event detection has been a subject of rather limited research, further constrained to highly specialized acquisition setups. In this work, we show how these sensor modalities, alone or in combination with content-based analysis, allow inferring information about the video content. In addition, we consider a multi-camera scenario, where multiple user generated recordings of a common scene (e.g., music concerts, public events) are available. In order to understand some higher-level semantics of the recorded media, we jointly analyze the individual video recordings and sensor measurements of the multiple users. The detected semantics include generic interesting events and some more specific events. The detection exploits correlations in the camera motion and in the audio content of multiple users. We show that the proposed multimodal analysis methods perform well on various recordings obtained in real live music performances.

Multimodal; indexing; analysis; event; motion; video

I. INTRODUCTION In the past decade there has been a tremendous growth in

the amount of user generated multimedia content (images and video). This was enabled by rapid advances in the multimedia recording capabilities of mobile devices. It is thereafter becoming increasingly important to extract rich semantic information about the media content in order to enable its effective storage, retrieval, and usage. This is a preliminary step which also consists of indexing the multimedia content. A typical retrieval use case is searching for particular objects and events within a video, e.g., a player scoring a goal in a football match. A significant amount of work has already been done for extracting low- and high-level descriptors used for multimedia indexing. In the following we discuss some of these content-based methods and in particular concentrate on multimodal approaches. A set of relatively low-level descriptors are made available as part of the MPEG-7 content description standard [1]. In [2], events are extracted from news videos, in a multimodal unsupervised fashion. This method

combines information from audio, visual appearance, faces, and mid-level semantic concepts by applying coherence rules. In [3] the authors exploit the correlation between text and image components of multimedia content. The method is applied for cross-modal retrieval, i.e., for retrieving text documents in response to query images and vice-versa. It is shown that the cross-modal model outperforms existing uni-modal image retrieval systems.

Multimedia indexing techniques play a key role in various multimedia applications. In automatic video summarization, interesting portions of the video are extracted and included in the video summary. In [4], the authors consider user generated videos. In particular, the camera motion is regarded as indicator of the camera person’s interests in the scene, and is analyzed by means of content analysis for identifying key-frames and regions of interest within a key-frame. In [5] a multimodal analysis for automated event detection in broadcast sport videos is proposed. The co-occurrence or temporal sequence of low-level, mid-level and high-level features from the audio and visual content (mainly color and motion) are analyzed for detecting events such as “Audience”, “Field”, “Goal”, “Close-up of player”, “Tennis point” and “Soccer goal”.

Automatic or semi-automatic video mash-up generation is another application which relies on indexed high-level events. A video mash-up consists of video segments recorded by different cameras during approximately the same time span (for example during a live music show). The ultimate goal of these methods is to produce a video mash-up that is similar to what a human could produce. For example, this would require knowledge about the presence of an interesting event captured by a certain person, in order to switch view to that camera when the event is taking place. In [6] a study on requirements for creating video mash-ups of concerts is presented. One example of the obtained requirements is to include faces of the audience while cheering. In [7], a system for organization of user-contributed content from live music shows is presented. The authors consider as interesting events those time-intervals for which a higher number of captured videos is available. A method for detecting spatio-temporal events of interest in single and multi-camera systems is described in [8], in which low-level events are used as building blocks and combined by

2011 IEEE International Symposium on Multimedia

978-0-7695-4589-9/11 $26.00 © 2011 IEEE

DOI 10.1109/ISM.2011.49

263

several available operators for detecting semantically higher level events. In [9], interesting events are detected when users behave in a similar way for example in terms of view similarity and acoustic-ambience fluctuation. They also mention the use of compass to detect what they call group rotation, indicated by a large number of people rotating towards the event.

Nowadays a large portion of camera-enabled electronic devices contain various auxiliary sensors such as accelerometers, gyroscopes, magnetometers (compasses), GPS receivers, light sensors, etc. These sensors are often used during the media acquisition for example to provide some basic tagging information such as the location used in geo-tagging. In [10], the authors study the use of geo-location, image content and compass direction for annotating images. The use of compass (group rotation) and accelerometer (for shake detection) was mentioned in [9]. By capturing and recording the sensor data while the device is recording a video it is possible to obtain additional information about the content being recorded [11]. As proposed also in [12], by inferring the context of the recording process (for example, how the users move their cameras) it would be possible to reduce the distance between low-level and higher level analysis (the “semantic gap”). In this paper we propose a set of analysis methods which exploit auxiliary sensor data from multiple cameras for detecting the following generic interesting events in the individual video recordings: correlated camera panning and correlated camera tilting. Furthermore, we provide an example of how the detected generic events can be used to discover the occurrence of more specific interesting events, by combining the sensor data analysis with audio content analysis. We focus on the particular use case of recording live music shows.

The results of the proposed analysis are used for indexing the multimedia content. An illustration of the proposed event detection is given in Fig. 1. The paper is organized as follows: in Section I.A we introduce the auxiliary sensors that we use in this work. Section II describes the proposed multimodal analysis methods. Experimental results are given in Section III. Section IV contains discussion on the developed methods and on the obtained results. Section V concludes the paper.

A. Auxiliary Sensors in Mobile Camera-Enabled Devices The auxiliary sensors that we use for analyzing video

content are: • Tri-axial accelerometer, • Compass (tri-axial magnetometer).

A tri-axial accelerometer records acceleration across three mutually perpendicular axes. One very important characteristic of this sensor is that when there is lack of other acceleration it senses static acceleration of 1 g (approximately ������� at sea level) in the direction of Earth’s center of mass. This relatively strong static acceleration allows identifying the tilt of the camera with respect to the horizontal plane, i.e., the plane that is perpendicular to the gravitation force. We fix the camera orientation with respect to the three perpendicular accelerometer measurements axes as shown in Fig. 2.

Figure 2. Accelerometer data axes alignment with respect to the camera.

The electronic compasses that we consider are realized from tri-axial magnetometers. These sensors output the instantaneous horizontal orientation of the device embedding them with respect to the magnetic north. By using a tri-axial magnetometer, the sensed orientation is correct even in the presence of tilt (with respect to the horizontal plane).

Accelerometers and compasses provide mutually complementary information about the attitude of the device in which they are embedded. Accelerometers provide the angle of tilt with respect to the horizontal plane whereas compasses provide information about the angle of rotation within the horizontal plane (with respect to magnetic north). In case of a camera embedding these sensors, an accelerometer can provide information about tilt movements, whereas a compass can provide information about panning movements. We assume that the sensor readings are sampled at a fixed sampling rate. Also, we assume that the sampling timestamps are available and they are aligned with the start of video recording. The recorded sensor data can be regarded as a separate data stream. In the results section, we show that these assumptions are reasonable and can be readily satisfied without specialized hardware setup.

Figure 1. Overview of the proposed event detection exploiting multiple recordings and multiple modalities (audio and auxiliary sensors).

264

II. MULTIMODAL ANALYSIS FOR EVENT DETECTION In this section we describe a set of methods for detecting

interesting events for the purpose of indexing multimedia content. We consider a recording scenario in which multiple people are simultaneously recording the same happening, for example, a live music show. The most important aspects of this work are a joint analysis of the multiple recordings and exploitation of auxiliary sensors. In Section II.A we propose methods for detecting individual camera movements (camera tilting and panning) solely by exploiting sensor data. These are the lowest level events considered in this work and they are the basis for the proposed detection of interesting events. We develop methods for detecting generic interesting events in Section II.B by jointly analyzing multimodal data from multiple cameras. In Section II.C we provide examples of combining the sensor data analysis with audio content analysis to detect more specific interesting events, for the particular use case of live music shows.

A. Detecting Individual Camera Moves The basic foundation of the multi-camera event detection

is the detection of individual camera movements. We consider tilting and panning movements. To obtain information about these camera movements, there are mainly two approaches: content based techniques and motion-sensors based techniques. Content based techniques (e.g., [4]) analyze the video content data for determining if and how the device has moved. Unfortunately, this approach is typically computationally demanding, furthermore the performance could be impaired by the motion of objects present in the recorded scene. One way to overcome both of these issues would be to analyze the data output by the motion sensors considered in Section I.A, that directly tell about the movement of the recording device.

For detecting camera tilt, we exploit the ability of the accelerometer to detect both:

• the steady-state camera tilt angle (i.e., when no tilting is being performed by the cameraman),

• the speed and duration of a camera tilting.

We assume that the camera tilting involves marginal translational acceleration – especially when compared to the Earth’s gravitational component of 1 g. Actually, this assumption should be satisfied in order to capture video that does not suffer from obtrusive camera shake or motion blur. When the cameraman is performing a tilting action, the gravitational component changes so that it gets distributed between the X and the Z axes from Fig. 2. Let us define the angle of tilt as the angle between the Z axis and the horizontal plane. The angle of tilt for a given instantaneous static acceleration vector, � � �� �� �� is

� � �������

���� � ������������������������

To detect a tilting action, we compute the discrete derivative of the obtained instantaneous angles, which indicates the

speed at which the tilt angle changes. Let us denote this derivative as ��� In order to detect the temporal extent of a camera tilting move, we compare the magnitude of the derivative with a predefined threshold, � , and obtain an array, , whose elements indicate the detected tilting action:

� �!"#$"%&'(����������������������������������������������)��That is, nonzero values of indicate that a tilting action is being performed.

In a similar manner as for the tilting, we detect also individual camera panning movements by analyzing the data provided by the electronic compass. This sensor outputs directly the orientation angle. As a pre-processing step, we apply low-pass filtering to the raw compass data to avoid obtaining peaks in the derivative which are due to short or shaky camera movements, rather than to intentional camera motion.

B. Generic Interesting Event Detection from Correlated Camera Motion When people attend a public show such as a live music

performance or a sports happening, some of them use their handheld cameras to record videos. During these public shows there are often certain events that attract the attention of some or most of the people, and thus also of those who are recording. We consider such events as interesting events. For example, an interesting event in a music show consists of the guitar player starting playing a solo. In the case of a football match an interesting event occurs when the ball is thrown from one side of the field to the opposite side. In all these examples, the most natural thing that the people who are recording would do is to capture the event, by turning the cameras towards the subject that has triggered the interesting event (e.g., the guitar player) or by following the object of interest (e.g., the ball). As a result, the cameras of these people will be moved approximately in a similar (or correlated) manner; a similar idea is utilized for automatic zooming in [13].

The correlated and simultaneous motion of some cameras is a potentially interesting event for the viewer of the recorded videos, as it was found interesting and worth recording by multiple camera users. In contrast, a camera movement performed solely by one camera user might have been triggered by something of interest only for that particular user. As the specific type of event that has triggered the attention of the camera users will not be identified, we denote such events as generic interesting events. We detect the camera motion for each individual video recording by means of sensor data analysis as described in Section II.A. We assume that the recordings of all users are mutually temporally aligned; the alignment itself is a topic that we do not consider in this work. In the following we describe how we detect interesting events using the compass data and the accelerometer data.

1) Correlated Camera Panning In order to detect generic interesting events from correlated

panning movements we perform correlation analysis on the compass data received from different users. In this way, we aim to discover temporal segments within which the compass data from some users is highly correlated.

265

Figure 3. Correlated camera panning movements (view from the top). The orientations of the cameras before and after the correlated panning are represented respectively by the dotted lines and the solid lines.

Thus, it is likely that in those time intervals the users have turned their cameras in a similar way (i.e., similar temporal extent of the camera movement, similar speed, etc.). Fig. 3 shows an example in which a number of users are recording a stage and at some point a subset of these users turn their cameras towards an interesting point (the performer 1). For the correlation analysis we have chosen to compute the Spearman’s rank correlation coefficient [14], as it allows for detecting similar panning movements even if the cameras are turned to opposite directions or the speed of the panning varies. The method could be summarized in the following steps:

1. Compute the Spearman correlation between the raw compass data streams produced by the cameras. 2. Detect the major peaks of correlation by thresholding the correlation signal using an empirically set value. 3. Detect the camera panning movements for each camera. 4. Detect the co-occurrence of a correlation peak for a certain set of cameras and of panning movements for the same set of cameras.

Steps 3 and 4 are needed for filtering out irrelevant correlated camera motion (i.e., not related to a substantial change in the orientation); thus, we consider as correlated camera panning each of those temporal segments in which there is co-occurrence of a correlation peak and of camera panning movements performed by the camera users.

2) Correlated Camera Tilting In this section we consider correlated camera tilting

performed by two or more users by comparing their individual detected tilts as given by (2). An illustration of this event is given in Fig. 4, where two users are recording and following an object of interest moving in vertical direction. Let us consider the individual tilts of any two camera users (that are detected as

Figure 4. Two users are recording an object of interest which moves in vertical direction.

described in Section II.A), described by their pairs of beginning and end timestamps, tbeg-1, tend-1 and tbeg-2, tend-2. We classify two or more camera tilts as correlated if their temporal boundaries are within a predefined threshold, tcorr, that is if |tbeg-1 – tbeg-2| < tcorr and |tend-1 – tend-2| < tcorr. In addition, the tilt directions, given by the sign of the derivative, must be coinciding. For the correlated tilt detection we use a different approach than for the correlated panning detection as we consider as correlated only tilts performed in the same direction. The reason for this is that camera tilting in opposite directions is highly unlikely to be a result of following the very same object of interest. Such an unlikely scenario is when two camera users are at substantially different altitudes and the object or scene they are recording is positioned between them in altitude. In contrast, camera panning in opposite directions is rather common as the camera users can have various locations with respect to the scene of interest. For example, in a football game, camera users from opposing sections of the stadium will pan their cameras in opposite directions to capture a movement of a player.

C. Specific Event Detection from Co-occurrence of Simple Events We use the detected generic events from correlated camera

panning for discovering some specific interesting events (with higher-level semantics) for the particular use case of live music shows. The algorithm takes advantage of two main ideas. The first one consists of jointly analyzing data streams of same modality (e.g., compass data) captured by multiple devices, in order to detect simple events (each one obtained from one sensor modality). The second idea consists of exploiting the availability of data from multiple modalities, by combining the detected simple events for obtaining higher-level semantics of the media analysis. By using the same notation as in [8], we can think of the simple events as primitive events and the specific events as composite events. The combination is based on the temporal co-occurrence of the simple events. In this work we give an example of detecting two specific events that can be defined by means of compound sentences. The first event is: “The song starts and some of the users turn their cameras towards the performance stage” (we denote this as specific event S1). The second event is: “The song ends, the audience starts clapping, and some users turn their cameras towards the audience” (denoted as S2). The reason for

266

considering these particular events is that in live music shows it is common to record the audience clapping hands or cheering at the end of each performed song, as it is considered a salient moment. It is also common to point the camera back to the stage when the next song starts.

The primitive or simple events used for detecting the considered specific events are the camera panning correlation (as already described in Section II.B.1), the audio class, and the audio similarity degree.

In order to detect the instants when a song starts and ends (i.e., for performing song segmentation) we classify the audio into “Music” and “No music” using the audio classifier described in [15], which we have specifically trained for live music shows. The end of class “No Music” and the start of class “Music” represent the start of a song, whereas the end of class “Music” and the start of class “No music” represent the end of a song. As the audio classification is applied on live music audio recorded by mobile phones, the presence of ambient noise and the quality of the microphones reduce the performances of the audio classifier. As a result, we obtain a relatively high rate of missed detections for the start or end of songs. To overcome this problem, we detect start and end of songs also with an additional method that exploits the following idea. While a song is being played, even if the cameras are spread all over the audience area, they will capture similar audio content, which is produced by the loudspeakers. In contrast, after the song has ended, the cameras will not capture the same sound anymore and the audio similarity among them will likely be low. We take advantage of this by detecting whether the audio content recorded by the users is similar or not. We interpret the transition in the audio similarity degree for most of the cameras from low to high and from high to low as respectively the start or the end of a song. We compute the similarity degree for temporally overlapping audio segments extracted from the video recordings of different cameras. For computing the audio similarity, we use a recent work which is described in [15]. Then we apply a threshold on the output of the analysis in order to discriminate between low and high similarity degree.

For detecting the two specific events S1 and S2, we consider the co-occurrence (within a predetermined temporal window) of a panning correlation peak and a state change for at least one of the two song segmentation methods. S1 is detected when a panning correlation peak (for some cameras) occurs and at the same time there is a transient from no-song to song (detected either by the audio classifier or by the audio similarity algorithm, or both). Then we can assert that the song has started and very likely some of the users have pointed their cameras towards something of interest, probably the performance stage. S2 is detected when there is a co-occurrence of panning correlation peak and a transient from song to no-song (detected either by the audio classifier or by the audio similarity algorithm, or both), then we assert that the song has ended and very likely some of the users have pointed their cameras to something they found interesting, for example the audience which is applauding. In case the two song segmentation methods provide conflicting results (i.e., for the same temporal window one method detects the start of a song and the other detects the end), then we consider only the

detection provided by the audio classifier, as it has proved to be more reliable than the audio similarity analysis.

III. RESULTS In this section we evaluate the performance of the

proposed multimodal analysis. As our methods exploit streams of sensor (compass and accelerometer) measurements recorded simultaneously with the video recording, there are no publicly available datasets that already contain such sensor data. Therefore, we use test data obtained as described in Section III.A. We used publicly available smart phones and simple dedicated software to collect the sensor data synchronously with video recording. The default sampling rate for each sensor was used, i.e., 40 samples/sec for the accelerometer and 10 samples/sec for the compass, where each sample is further labeled with its timestamp. The time alignment between a recorded video and the recorded sensor data was straightforward to obtain, as the video start- and stop-recording times were obtained from the creation time of the media file and were then matched to the timestamps of the sensor measurements. The software application that we used stored the sensor measurements (and associated timestamps) as data streams associated with the recorded video.

In this section we use the following measures for evaluating the performances of the detection methods:

• Precision (P) – fraction of the detected events which are indeed true interesting events.

• Recall (R) – fraction of the true interesting events which are detected correctly.

• Balanced F-measure (F) – it is computed as the harmonic mean of the precision and recall [16].

A. Test Datasets Dataset 1 contains 47 video recordings from two different

live music shows (featuring 10 bands), spanning an overall time duration of about 62 minutes. The data was collected by nine users that were attending the shows and were sparsely located in the audience (so that there is no visual interference between them during the recording).

Dataset 2: in order to obtain a dataset with significant number of camera movements, we simulated a recording experiment in a large auditorium where two camera users were asked to record what they were considering interesting for a duration of 25 minutes. During that time, moving objects were displayed in different parts of the large aditorium display, serving as potential reason to perform a camera move.

B. Detection of Individual Camera Moves We applied the proposed camera panning and tilting

detection methods on Datasets 1 and 2. The results are summarized in Table I. These results show good detection accuracy which is expected as we analyze data from motion sensors (instead of the video content).

C. Detection of Generic Interesting Events In this section we provide empirical results of the methods

described in Section II.B for detecting generic interesting events.

267

Figure 5. Correlated camera panning movements.

1) Generic Interesting Events from Correlated Camera Panning

We have applied this method on recordings in real live music performances (Dataset 1) and on recordings in simulated environment (Dataset 2). Table I summarizes the experimental results for the detection of generic interesting events from correlated camera panning for all the datasets. Fig. 5 shows the detection of some interesting events from Dataset 1.

For Dataset 1 we manually annotated 50 interesting events.

Following is a list of the types of detected interesting events for that experiment:

• Audience putting hands up or clapping. • Person being lifted by the audience. • Start of singing. • Start of guitar solo. • Guitar player playing hi-hat (a cymbal). • Performers moving across the stage. • Music starting after a break.

As it can be seen in Table I, the detection performance for Dataset 1 is lower than that of Dataset 2. In particular, we obtained an increased number of false positives, caused by inadvertent camera movements performed during the live music performances. Based on these experimental results, we can assert that the proposed method is effective in recognizing

20 40 60 80 100 120 140 160

50

100

150

200

250

Time [sec]

Com

pass

orie

ntat

ions

[de

gree

s]

Raw compass data of User 1

Raw compass data of User 2True interesting events

Detected interesting events

True interesting event

Detected interesting event

GT TP FP P R F Camera Panning

Dataset 1 130 109 11 0.91 0.84 0.88

Dataset 2 104 104 2 0.98 1.0 0.99

Total 234 213 13 0.94 0.91 0.93

Camera Tilting

Dataset 1 73 69 7 0.91 0.95 0.93

Dataset 2 207 204 9 0.96 0.99 0.97

Total 280 273 16 0.94 0.97 0.96

Generic Interesting Event (from Correlated Panning)

Dataset 1 50 43 17 0.72 0.86 0.78

Dataset 2 58 48 4 0.92 0.83 0.87

Total 108 91 21 0.81 0.84 0.83

Generic Interesting Event (from Correlated Tilting)

Dataset 1 5 5 1 0.83 1.0 0.91

Dataset 2 112 105 9 0.92 0.94 0.93

Total 117 110 10 0.92 0.94 0.93

Figure 6. Example of detecting interesting events from correlated camera tilting.

TABLE I. TEST RESULTS FOR DETECTING INDIVIDUAL USERS’ CAMERAMOVEMENTS AND GENERIC INTERESTING EVENTS. TP STANDS FOR TRUEPOSITIVES, FP FOR FALSE POSITIVES, GT FOR GROUND TRUTH (MANUALLYANNOTATED EVENTS).

0 20 40 60 80 100 120

0

10

20

30

Camera tilt angle during video recordingfor two users (red and green dashed)

Deg

ree

0 20 40 60 80 100 120

-20

-10

0

10

20

Camera user 1: speed of tilt (red) and detected individual tilts (black dashed)

Deg

ree/

sec

0 20 40 60 80 100 120

-10

0

10

Camera user 2: speed of tilt (green) and detected individual tilts (black dashed)

Deg

ree/

sec

0 20 40 60 80 100 120

0

0.5

1

Time [sec]

Detected correlated tilts

DetectedTilting

DetectedTilting

268

Figure 7. Illustration of generic interesting events (from cbeen extracted from Dataset 1 (for visualization purposes, such interesting events when recording rperformances.

2) Generic Interesting Events from CorTilting

The detection of interesting events from ctilting movements is obtained by the methSection II.B.2. The objective performance evmethod is given in Table I. One note about Dcontains much fewer correlated tilting compared with correlated panning; one expchange in horizontal direction is more probaban object of common interest. Plots of deteangles and the detected correlated tilts are illuIn this plot, it can be seen that false positindividual camera tilts (seen in the plot for bo1 and 2) are not resulting in a false positive fcamera tilt – as there is no matching in both us

Fig. 7 shows the detection of few generic inAlso, as an example of usage of the deteautomatic generation of a video summary is illeach detected interesting event is included.

D. Detecting Specific Events In this section we evaluate the performanc

event detector presented in Section II.C. For used Dataset 1.

correlated panning and tilting) and an example of their usage for video , only three cameras are shown in the figure).

real live music

rrelated Camera

correlated camera hod proposed in valuation of this

Dataset 1 is that it movements as

planation is that ble for capturing ected camera tilt ustrated in Fig. 6. tive detection of oth camera users for the correlated sers. nteresting events.

ected events, the lustrated, wherein

ces of the specific this purpose, we

Each song from that Dataset containthe two specific events considered inand the performance measures areshows an example in which the spfrom two camera users recording on

IV. DISCUS

Recent works such as [4] have in user generated video content canlevel events such as interesting evevideos. In this work we extend thusers. Also, we analyze data prowhich directly give information abcontrast, content-analysis motion restricted by the presence of movinby the presence of motion bluFurthermore, the use of moticomputational burden of content asecond of HD video at 25 framesmillion pixels, whereas in our exaccelerometer and compass recordinmeasurements.

When considering the analysiscontent, the proposed methods arexploit the sensor modality. To oupreviously considered this modality

summarization. This test data has

ned one instance for each of n this paper. The test results e given in Table II. Fig. 8 pecific event S2 is detected ne of the test songs.

SSION shown that camera motion

n be used to detect higher-ents occurring in individual is idea to multiple camera

ovided by motion sensors, bout the camera motion. In estimation is significantly

ng objects in the scene, and ur in the video frames. ion sensors avoids the analysis. For example, one s/second contains about 46 xperiments, one second of ngs contains only 50 sensor

s of user generated video e among the very first to

ur knowledge, only [9] has y for the purpose of video

269

Figure 8. Detection of specific event S2: audio-class change from “Music” to “No music”, audio similarity change from “High” to “Low”, users perform correlated panning.

Specific

event Ground

truth event

TP

FP

P

R

F

S1 19 15 4 0.79 0.79 0.79 S2 19 16 3 0.84 0.84 0.84

Total 38 31 7 0.82 0.82 0.82 summarization. Therein, they consider group rotations, indicated by correlations in the compass rotation, which is mentioned and not described in further details. The setup used in [9] consists of mobile phone cameras attached to the clothes of some users – whereas in our work we rely solely on user generated content without any restrictions on the camera use.

We would like to mention that other sensors can be employed in multimodal event detection. For example, a tri-axial gyroscope can provide a finer resolution of the rotational movements of the camera as compared with the compass and the accelerometer. However, a gyroscope by itself cannot provide a reference with respect to actual space coordinates and thus cannot replace any of these two sensors.

Due to the availability of information describing the actual scene and the recording activity from different modalities (e.g., motion sensors describe the device attitude, audio content describes the audio-scene, video content describes the visual-scene, etc.) the multimodal analysis can provide results with higher-level semantics (as the ones presented in Section II.C). The multi-user availability and the multimodal data availability are jointly exploited in this work and constitute one of its main contributions.

V. CONCLUSIONS In this paper we present a set of methods for event

detection in user generated video, which exploit both the analysis of multimodal data and the joint analysis of multiple recordings (i.e., recorded by multiple camera users). We

discover the presence of interesting events by detecting correlated camera panning and tilting. We then combine these generic events with other events from audio content analysis, and we detect some more specific and with higher-level semantics events, which are valid for the particular domain of live music shows. The contribution of this work is twofold: first the application of sensor modality in the detection of camera moves; second, the exploitation of multiple recordings from more than one camera user. Our experiments show that the proposed methods perform well in detecting the various considered events.

REFERENCES [1] MPEG-7, ISO/IEC 15938, Multimedia content description interface.

http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=34228

[2] M. Broilo, A. Basso, E. Zavesky, F.G.B. De Natale, “Unsupervised Event Segmentation of News Content with Multimodal Cues”, Proc. of the 3rd Int. Workshop on Automated Information Extraction in Media Production (AIEMPro), 2010.

[3] N. Rasiwasia, J.C. Pereira, E. Coviello, G. Doyle, “A New Approach to Cross-Modal Multimedia Retrieval”, Proc. of ACM Multimedia, 2010.

[4] G. Abdollahian, C.M. Taskiran, Z. Pizlo, E.J. Delp, “Camera Motion-Based Analysis of User Generated Video”, IEEE Transactions on Multimedia, Vol. 12, No. 1, 2010.

[5] C. Poppe, S. De Bruyne, R. Van de Walle, “Generic Architecture for Event Detection in Broadcast Sports Video”, Proc. of the 3rd Int. Workshop on Automated Information Extraction in Media Production (AIEMPro), 2010.

[6] P. Shrestha , P.H.N. de With, H. Weda, M. Barbieri, E.H.L. Aarts, “Automatic Mashup Generation from Multiple-camera Concert Recordings”, Proc. of ACM Multimedia, 2010.

[7] L. Kennedy, M. Naaman, “Less Talk, More Rock: Automated Organization of Community-Contributed Collections of Concert Videos”, Int. Conf. on World Wide Web, 2009.

[8] S. Velipasalar, L.M. Brown, A. Hampapur, “Specifying, Interpreting and Detecting High-Level, Spatio-Temporal Composite Events in Single and Multi-Camera Systems”, Proc. Of the 2006 Conf. on Computer Vision and Pattern Recognition Workshop.

[9] X. Bao, R.R. Choudhury, “MoVi: Mobile Phone based Video Highlights via Collaborative Sensing”, 8th Int. Conf. on Mobile Systems, Applications and Services, 2010.

[10] A.J. Cheng, F.E. Lin, Y.H. Kuo, W.H. Hsu, “GPS, Compass, or Camera?: Investigating Effective Mobile Sensors for Automatic Search-Based Image Annotation”, Proc. of ACM Multimedia, 2010.

[11] S. Järvinen, J. Peltola, J. Plomp, O. Ojutkangas, I. Heino, J. Lahti, J. Heinilä, ”Deploying Mobile Multimedia Services for Everyday Experience Sharing”, IEEE Int. Conf. on Multimedia and Expo, 2009.

[12] R. Jain, P. Sinha, “Content Without Context is Meaningless”, Proc. of ACM Multimedia, 2010.

[13] A. Carlier, V. Charvillat, W.T. Ooi, R. Grigoras, G. Morin, “Crowdsourced automatic zoom and scroll for video retargeting”, Proc. of ACM Multimedia, 201-210, 2010.

[14] E.C. Fieller, H.O. Hartley, E.S. Pearson, “Tests for Rank Correlation Coefficients”, Biometrika, Vol. 44, No. 3/4. (Dec., 1957), pp. 470-481.

[15] T. Lahti, “On Low Complexity Techniques for Automatic Speech Recognition and Automatic Audio Content Analysis”, Doctoral Thesis, Tampere University of Technology, 2008.

[16] C.J. Van Rijsbergen, “Information retrieval”, 2nd ed. Butterworth-Heinemann Newton. ISBN: 0408709294, 1979.

190 200 210 220 230 240 250 260 270-200

-150

-100

-50

0

50

100

150

200

250

300

Time [sec]

Com

pass

orie

ntat

ions

[de

gree

s]

Compass data User 1Compass data User 2

Audio-class changes

Audio similarity decrease

Panning correlation peaksDetected event S2

Audio similarity decrease

Panning correlation peak Audio class change

Detected event S2

TABLE II. EXPERIMENTAL RESULTS FOR THE DETECTION OF SPECIFIC EVENTS.

270