movie summarization using bullet screen...

Multimed Tools Appl (2018) 77:9093–9110DOI 10.1007/s11042-017-4807-6

Movie summarization using bullet screen comments

Shan Sun1 ·Feng Wang1 ·Liang He1

Received: 16 August 2016 / Revised: 17 March 2017 / Accepted: 3 May 2017 /Published online: 15 May 2017© Springer Science+Business Media New York 2017

Abstract Automatic movie summarization helps users to skim a movie in an efficient way.However, it is challenging because it requires the computer to automatically understandthe movie content and the users’ opinions. Most previous works rely on the movie dataitself without considering the opinions of the audience. In this paper, a novel approach forautomatic movie summarization is presented by exploring a new type of user-generateddata, i.e. bullet screen comments, which allow the audience to comment on the movie ina real-time manner. The number of the comments on a movie segment shows the excitingdegree of the audience, while the content of the comments includes the concepts (e.g. thecharacters and the scenes) that interest the audience. In our approach, given a movie, bulletscreen comments are utilized to select candidate highlight segments which are the mostcommented. Then the candidates are scored based on the number and the content of thebullet screen comments. Visual diversity is also considered in the scoring process. Finally,a subset of candidates which achieves the highest score is selected to compose a summary.Our experiments carried out on movies of different genres have shown the effectiveness ofour proposed approach.

Keywords Movie summarization · Multimedia content understanding · Bullet screencomments · User-generated data

1 Introduction

With the rapid growth of movie production, users hardly have enough time to follow allnew movies and review their favoured classic movies. Therefore, movie summarization is

� Feng [email protected]

Shan [email protected]

1 Shanghai Key Laboratory of Multidimensional Information Processing, Dept. of Computer Scienceand Technology, East China Normal University, 3663 N. Zhongshan Rd., Shanghai, China

http://crossmark.crossref.org/dialog/?doi=10.1007/s11042-017-4807-6&domain=pdf

http://orcid.org/0000-0002-4262-6998

mailto:[email protected]

mailto:[email protected]

9094 Multimed Tools Appl (2018) 77:9093–9110

of vital importance because it provides a brief of the movie content. For users who wantto review the classic portions of their favorite movie, a summary of the movie is a goodchoice for it costs little time. For users, a qualified summary is supposed to capture themost interesting content of the movie and meet users’ excitements. More specifically, moviesummarization can be considered successful as long as the associated summary reflects theexciting moments of the viewers, and provides the most important piece of information inthe most complete manner, while eliminating possible redundancies. Manually summarizingmovies leads to successful summaries, but it is intellectually expensive and time-consuming.Therefore, there is a great demand for automatic movie summarization. However, this isa challenging task since it requires the computer to automatically understand the movie’ssemantic content and make decisions about which parts of the movie are relatively moreimportant to users.

Most previous works on automatic movie summarization rely on only movie data itself.The summary is generated based on the features extracted from the movies [10, 14, 28,32] or the movie scripts [29]. However, relying on only movie data can hardly grasp theaudience’s attentions. Since movie summarization is audience-oriented, the audience’s opin-ions and feelings are more important for producing a qualified summary. In [7] and [13],features extracted from tweets are exploited to represent viewers’ enthusiasm degree andgenerate summary of sport match. Compared with the features extracted from movies, thiskind of features derived from user-generated data better depict the audience’s opinions andexcitements, which are good hints for highlight detection.

In recent years, bullet screen has been an emerging craze on online video sites especiallyin China and Japan. Examples of videos along with bullet screen comments can be easilyfound in some video sites, such as [2] and [22]. As shown in Fig. 1, bullet screen allowsthe audience’s typed comments to zoom across the screen like bullets. The audience canpost their comments on the screen in real time whenever they view something interesting.Meanwhile, they can view the comments on the screen from other users about the sameclip. Therefore, bullet screen comments give the audience a “real-time interaction” feelingand thus are accepted by more and more video websites. Since bullet screen commentsprovide sufficient information about movies in the perspective of the audience, they can beutilized to help computer automatically understand the thoughts of users. First, the numberof the comments on a movie clip indicates its importance or attractiveness, and can beused to detect the highlights. Second, the content of the comments includes the audience’sfeelings towards a specific movie clip and the important concepts (e.g. the characters andthe scenes) which interest the audience. In this paper, we extract features from this new typeof user-generated data, i.e. bullet screen comments, to represent the movie content and theaudience’s opinions for movie summarization.

(a) A Chinese example of bul-let screen [39].

(b) A Japanese example of bulletscreen [40].

(c) English examples of bulletscreen comments.

Fig. 1 Examples of bullet screen and comment content

Multimed Tools Appl (2018) 77:9093–9110 9095

With the temporal enthusiasm degree and the attitude of the audience towards a movieprovided by the bullet screen comments, we present our summarization approach. In ourapproach, multiple features are derived from the bullet screen comments to provide aneffective movie representation. In addition, visual features are included to ensure the visualdiversity of the resulting summary. The approach is divided in two steps. In the first step,candidate segments are detected based on the number of bullet comment. In the second step,the subsets composed of a few candidate segments are scored according to the the contentof comments, the number of comments and the visual information. Specifically, the scoreis composed by four parts: the attraction score H(s), the content saliency score I (s), thecontent consistency score D(s), and the visual diversity score V (s).

The main contribution of our work is to explore the features derived from the users’perspectives into movie highlight detection and summarization. Besides, a comprehensivemovie summarization approach is proposed, including: 1) a novel candidate highlightsdetection algorithm to pick up the most attractive segments using the bullet screencomments; 2) a scoring strategy to measure the importance of the movie segments byconsidering the bullet screen comments and the visual information.

The remaining of this paper is organized as follows. Section 2 reviews the related work.Section 3 presents our algorithm for candidate highlight detection by utilizing the bulletscreen comments. Our approaches for highlight scoring and summary generation are pro-posed in Section 4. The experimental results are presented in Section 5. Finally, Section 6concludes this paper and discusses the future work.

2 Related work

Video summarization are extensively studied for decades. Researchers have proposed lotsof approaches under the stimulus of its wide application and the competitions, such as theMediaEval Emotional Impact of Movies task and TRECVID video summarization task.Different features are explored for video content representation and understanding, such ascolor [34, 37], motion [6], text [1, 10, 18, 26], audio [10, 12, 36], objects and events [33, 38].

In one of the earliest work [37], color histograms are employed to select representativeframes by clustering. In [34], YUV color histograms are employed to generate video sum-maries. Similar, color moment is uesd as feature in [20] and [21] for key-frame selection invideo summarization. Besides the static information, motion is another good hint for select-ing important segments. In [6], the cumulative moving average (CMA) and the precedingsegment average (PSA) statistical metric are used to indicate gradual and sudden changesin the instantaneous velocity of moving targets. A finite state machine is proposed to modelthe trajectory of a moving target and used for detecting transitions that represent the changesfrom one state to another when initiated by a triggering event or condition. These low-levelfeatures can be used to describe the visual content of videos. However, they cannot graspthe semantic information of the videos.

To semantically understand the video content, high-level features such as objects andevents are exploited in video summarization. “Video synopsis” is performed in [27] based onobject detection and segmentation. “Video synopsis” provides a short video representation,while preserving the essential activities of the original video. The activity in the video iscondensed into a shorter period by simultaneously showing multiple activities, even whenthey originally occurred at different times. In [16], normal activity patterns and key regionsin a scene are automatically modeled for creating suitable summary videos of abnormalactivities in surveillance videos. In [33], a key-pose based video summarization system is


proposed. Video objects are extracted and represented by shape descriptors. Representativeframes are then selected as key-poses by clustering these shapes. In [38], scene detectorsand descriptors are used to extract scenes as features for shot categorization. Bag-of-visual-words (bovw) model is built using shot classes as vocabularies for summarizing moviegenre.

For documentary videos and movies, text and speech contain lots of semantic informationof the video content and can be used for summarization. In [31], the video titles are used forsummarization. Agnihotri et al. [1] search the closed-caption text for cue words to generatesummaries for talk shows. Pickering et al. [26] use key entity detection to identify importantkeywords in closed-caption text. In [32], speech transcripts are explored for automatic videosummarization. In [29], movie script is used to exploit the movie structure and a contentattention model is built based on the character interaction features.

To comprehensively describe and understand video content, an effective way is to com-bine multiple features. Panagiotakis et al. [24] combine the audio and visual features toselect key frames for video summarization. In [18], visual and text features are combinedto discover the story of an egocentric video for summarization. In [12], a mechanism usingaudio and video analysis is proposed to produce video summaries. The generated sum-maries are coupled with intelligible audio and can therefore be considered as an alternativeto the original videos. In [36], audio-video affective features are exploited for summarizinghome videos. In [35], using video-tag features, an event-driven approach is proposed forweb video summarization. Firstly, the tags that are associated with each video are localizedinto its shots. Then the relevance of the shots with respect to the event query is estimatedby matching the shot level tags with the query. Finally a set of key-shots that have highrelevance scores are identified to compose the summary. In [10], aural, visual, and textualfeatures are extracted, and a saliency model is built for summary generation.

Most existing approaches mainly rely on the videos themselves for content understandingand summarization, while the opinions of the audience are not considered. Movie is a spe-cial kind of video. They are well edited and have complete storylines. The summarization ofmovies requires more semantical understanding of the whole movie. Furthermore, moviesare audience-oriented. The opinions of the audience should be taken into account whenselecting interesting movie segments for summarization. Some previous works attempt toexplore human perception models. In [25], a visual attention model is proposed to bridgethe semantic gap between low-level descriptors and high-level human concepts. In [19],user attention is defined through multiple sensory perceptions, and a set of models forvisual and aural attentions are used to summarize videos. In [9], a feature aggregation basedvisual saliency detection mechanism is proposed to extract key frames in video summariza-tion. Similarly, in [11], the aggregated saliency map computed from various feature is usedfor extracting key frames. To reduce the computational cost in human attention modelingwhile summarizing videos, Naveed Ejaz et al. [8] present an efficient visual attention modelby dynamic visual saliency detection. These works attempt to understand user attentionson videos. However, the attention models are built based only on some predefined rules(e.g. assuming the visually salient or motion intensive segments are more attractive). Theviewers’ perceptions on the videos are actually not used.

In [4], viewers’ perceptions are incorporated into video summarization. The saliencyscores of frames are first assigned by video experts, and then updated by users gener-ated profiles. In this way, personalised video summaries are generated. However, additionalannotation is needed. Some works utilize the existing user data for video summarization.In [13], time-stamped tweets are used to summarize World Cup matches. Two kinds of sum-maries are generated: Frequency-based summaries are composed by video clips with most


tweet count, and content-based summaries are composed by video clips with the content oftweet containing a given query. Though the count and content of tweets are taken into con-sideration in this work, they are not combined to generate summaries containing both heatand user-interested clips. Furthermore, the content-based summarization approach is basedon Lucene, which is a search tool. The meaning of tweet content cannot be understoodthrough the basic query search. Same as [13], in [30], a tool called Statler is introducedto summarize media using short messages. By using twitter messages sent at the time ofwidely televised events, Statler summarizes media by showing the segmentation, trendingtopics, level-of-interest, and tweet geo-locations. Statler puts emphasis on the presenta-tion of the final summaries, which are composed of various statistics, such as geographicinformation. Similarly, the meaning of shot message content is not taken into considera-tion. In [7], features extracted from tweets are utilized to represent viewers’ enthusiasmdegree and generate sports highlight video. However, only statistics information is usedas features including the frequency of exclamation marks, the number of tweets includ-ing a repeated expression, and the number of retweets. These features cannot represent thesemantic information either. By employing tweets, these approaches can catch the users’excitement among the videos. Multiple intuitive features are extracted from the tweets, suchas the tweet volume [30], related tweet content searched by a given query [13], or abnormalwords [7]. However, the semantic meaning of content is not analyzed in summarization. Inthis paper, we employ a new kind of user-generated data, i.e., bullet screen comments, andanalyze the content of comments, for movie summarization. Multiple features are exploitedfrom bullet screen comments to ensure that the summary is informative, visually diverse,and faithful to the original movie.

3 Candidate segment detection

Our approach consists of two steps as illustrated in Fig. 2. First, given a movie, we detect thecandidate highlight segments by considering the statistics distribution of the bullet screencomments on the time axis. Second, a subset of candidates are selected for summary gen-eration through a scoring strategy based on the number and the content of the comments(Section 4). As shown at the bottom of the Fig. 2, the scoring strategy takes the numberand content of bullet screen comments as well as the visual features into consideration. Itcontains the attraction score, the content saliency score, the content consistency score, andthe visual diversity score. In this section, we introduce the approach of candidate segmentdetection, as shown in the top of Fig. 2.

In our approach, the detection of candidate segment is based on the number of the com-ments on each segment. From the empirical observation, in a movie, the densely commentedsegments are usually exciting and likely to be highlights. Therefore, we filter out the lesscommented segments and keep the relatively important and attractive ones by using Algo-rithm 1. The object of Algorithm 1 is to find proper quantity of candidate segments thatare commented relatively more. To this end, given a movie, we calculate the number ofcomment in every time unit. We choose one second as a time unit in this paper for flex-ibility. Through the calculation, the statistics of bullet screen comments is generated. TheFig. 3a shows an example statistics of the bullet screen comments for movies. Since thestatistics curve around each peak shows to be relatively smooth, moving average strategies(such as [5] introduced) are not used. As can be observed in the figure, the number ofcomments varies at different time points. The peaks of the statistics curves indicate rela-tively more comments at the corresponding time points, which means the audience are more


interested in these time points. Thus, the corresponding movie segments have more chanceto be highlights.

For candidate highlight detection, in Algorithm 1, first the highest peak of the statisticscurve is found and a segment around the peak is selected as a candidate. Two thresholds,minimum δmin and maximum δmax , are set to limit the duration of a segment. δmax is setto avoid the situation where a long-duration movie part including several attractive plotsis put into one segment. Similarly, δmin is used so that the segment is not too short todescribe a highlight completely. The value of δmin and δmax are determined according tothe observation from successful manual movie trailers and movie highlight summaries. Themean value n̄ is calculated as a threshold to differentiate relatively more comments fromless comments. From the experimental observation, the mean value is qualified as a stopcondition threshold to generate suitable quantity of candidate segments. After a candidatesegment is determined, this segment is cut out and put into the candidate set. The above peakfinding and segment selection process repeats until all qualified candidates are detected.

Figure 3 illustrates an example of candidate segment detection using our approach. Givena movie, we first make statistics of number of comment as Fig. 3a shown. Then the peak ofthe statistics is found, as point A annotated. Starting from the peak, a candidate segment isfound and then erased, as shown in Fig. 3c and d. This candidate segment selection processproceeds until all qualified segments are found as Fig. 3e shown.


Fig. 2 Schematization of our approach. The first step is candidate segment detection based on the number ofcomment. The second step is subset scoring and summary generation. The score is composed of the attractionscore H(s), the content saliency score I (s), the content consistency score D(s), and the visual diversity scoreV (s)

The segments detected by the above approach may include broken shot. Considering thevisual fluency of the candidate segments, we justify the boundary of each segment. There area number of shot boundary detection approaches, such as color histogram based approachesand macroblock classification based approaches [15]. For simpleness, we employ the colorinformation for shot boundary detection. First, we detect the shot boundaries based on RGB


Fig. 3 An example for candidatesegment detection usingAlgorithm 1. a The statistics ofnumber of comment of a movie.b Corresponding to step 2 inAlgorithm 1, the peak A is foundat the 60th second. cWe startfrom point A to its neighbors tofind a segment until the numberof comment is less than the meanvalue. Segment 1 is found. d Thesegment 1 is erased from themovie and a new peak B iswaiting to be found. e Thisprocess proceeds until allqualified segments are found. fEach segment is extended to itsnearest shot boundary

0

2

4

6

8

10

12

14

16

18

1 11 21 31 41 51 61 71 81 91 101 111

tne

mm

ocf

ore

bm

uN

Number of Comment Statistics

Mean value

Time(s)

0

2

4

6

8

10

12

14

16

18

1 11 21 31 41 51 61 71 81 91 101 111

tne

mm

ocf

ore

bm

uN


Mean value

Time(s)

A (60,16)

0

2

4

6

8

10

12

14

16

18

1 11 21 31 41 51 61 71 81 91 101 111

tne

mm

ocf

ore

bm

uN


Segment 1

Mean value

Time(s)

0

2

4

6

8

10

12

14

16

18

1 11 21 31 41 51 61 71 81 91 101 111

tne

mm

ocf

ore

bm

uN


Mean value

Time(s)

B

0

2

4

6

8

10

12

14

16

18

1 11 21 31 41 51 61 71 81 91 101 111

tne

mm

ocf

ore

bm

uN


Mean value

Time(s)4 7 3 5 2 1 8 6 9

Shot

Boundary

Shot

Boundary

Shot

Boundary

Shot

Boundary

A Candidate Segment

(a)

(b)

(c)

(d)

(e)

(f)


color histogram. Then, for the start (end) boundary of a segment, we extend it to the nearestshot boundary as shown in the left-bottom of Fig. 3f. In this way, the candidate segmentsare aligned with the shot boundaries.

4 Highlight scoring and summary generation

Movie summarization can be considered successful as long as the associated summaryreflects the exciting moments of the viewers, and provides the most important piece of infor-mation in the most complete manner, while eliminating possible redundancies. The abovecriteria were thoroughly taken into consideration during the design and implementation ofthe proposed scoring and summarization approaches.

The summary of a movie is composed of a number of segments selected from the candi-date segment set C. In other words, the summary is made up by a subset s of C. To selectthe optimal subset s̃, a score strategy is exploited based on the “rules of successful moviesummarization” we discussed above. As shown in Fig. 2, the score for a subset s (a set ofsegment) is composed of four items, which are respectively related to the number of the bul-let screen comments in s, the saliency of content in s, the consistency of s with the wholemovie in content, and the visual diversity of the segments in s. In our approach, the durationof the resulting summary is limited to L = 5 minutes.

4.1 Attraction score

The viewers tend to comment when they find something interesting or surprising. Therefore,highlight segments are commented more than others because they attract more attentionsfrom the viewers. Based on this observation, the number of comments is utilized to measurehow attractive a segment is. For each segment i, we count the number of bullet screencomments count (i) in it. Therefore, the attraction score of a segment subset s described bythe viewers’ surprising degree can be calculated as

H(s) =∑

i∈s

count (i) (1)

where the count (i) means the number of bullet screen comments of segment i, andattraction score H(s) is the sum number of comments in segment subset s.

4.2 Content saliency score

Besides the number of the bullet screen comments, the content of the comments is alsoimportant. It includes useful information about the movie content which attracts the viewers.Therefore, the content of the comments is utilized to measure the importance degree ofthe corresponding movie segment. However, different from other offline comments, bulletscreen comments are typed in real time without well organization. The texts of commentare casual and organized in an extremely informal way. To filter the comments and onlykeep relatively more informative words, in our approach, only the nouns of comments areutilized. They contain the main concepts of movies, such as the characters, the objects, andthe scenes.


The extraction of nouns proceeds as following. First, the bullet screen comments are pre-processed with word tokenization and part-of-speech tagging. Then all nouns are extractedwhich include the concepts that the viewers are interested in. In the following of this paper,the words of comments specifically mean the nouns only. After the extraction, the contentsaliency score for each subset s is calculated according to the words they contain. We useTerm Frequency - Inverse Document Frequency (TF-IDF) [32] to weight each word w as

tf idf (w, i) = (k1 + 1)ni,w

k1[(1 − k2) + k2Li

AL] + ni,w

· logN

nw

(2)

where ni,w is the number of occurrences of word w in segment i, Li is the number of wordsin segment i, AL is the average number of words per segment in the video sequence, k1and k2 are tuning parameters, N is the total number of segments in the video sequence, andnw is the number of segments that contain word w. The tuning parameter k1 determines thesensitivity of the first term (k1+1)ni,w

k1[(1−k2)+k2LiAL

]+ni,w

to changes in the value of the term frequency

ni,w . If k1 = 0, this term reduces to the counting function which is 1 if word w occurs insegment and 0 otherwise. If k1 is large, this term becomes nearly linear in ni,w . Becausenot all the segments have same words, normalization is needed. k2 is used to control thenormalization degree. If k2 = 0, there are no normalization. Following [32], The tuningparameters are set as k1 = 2 and k2 = 0.75.

With the saliency scores for individual word, the saliency score for a segment iscalculated as the sum of the scores for the words it contains as (3).

T (i) =∑

w∈i

tf idf (w, i) (3)

The saliency score for a subset s is calculated as the sum of the scores for all segments in itas (4).

I (s) =∑

i∈s

T (i) (4)

In conclusion, the content saliency score I (i) of a subset s is calculated by summingthe saliency scores of all segments T (i). The segment saliency score T (i) is calculatedaccording to the words it contained as (3).

4.3 Content consistency score

Different from other videos, movies are well edited and have storylines. Therefore, a quali-fied movie summary is supposed to not only include the most attractive segments, but alsokeep the content consistency with the original movie. An example of inconsistency is thatthe summary is composed of segments only about the leading character without any support-ing roles. This kind of summary cannot sufficiently describe the original story of the movie.Therefore, we take the content consistency into account so that the resulting summary cancompletely cover the movie content.

In this section, we introduce the content consistency scoring scheme based on thekeywords of the bullet screen comments. As presented in Section 4.2, the bullet screen com-ments provide information about the main concepts (such as the characters, objects, andscenes) in the current movie plot. The keywords of a segment can be found by ranking themain concepts and selecting the top ones. The keywords of the whole movie can be found inthe same way. A subset is scored high in which the ratio of the occurrence of each keyword


is close to its ratio in the whole movie. In this way, the content of the resulting summaryis consistent with the original movie in terms of the occurrences of the main concepts andstorytelling.

Supposing that we select the top J concepts as the keywords of a movie, the contentconsistency score is calculated as

D(s) = log10

⎛

⎝∏

0<j<k<J

(Nj + Nk

)g

(nj

Nj

,nk

Nk

)⎞

⎠ (5)

where nj and nk are the occurrence numbers of the j th and the kth keywords in the subset srespectively, Nj and Nk are the numbers of their occurrences in the whole movie, and g(·) isa normalization function reflecting the proportionate degree of two words by its exponentialdistance from 10 which is defined as (6). J is tuned according to the diversity of comments.If there is enough diversity in the comments for one segment, J is set big.

g(x, y) ={ x

yx ≤ y

yx

x > y(6)

Through g(·), the proportionate degree of two key words are evaluated. Multiplying by(Nj + Nk), the weight of a pair of key works are assigned. All pairs of key words areevaluated for calculating content consistency score D(s) in (6).

4.4 Visual diversity score

Besides the features developed from the bullet screen comments, visual features are alsoemployed for summary generation. Considering that viewers do not prefer to watch a sum-mary with visually similar segments, visual diversity is measured in our scoring scheme. Tothis end, we detect and describe the keypoints of the frames (selected from every 10 frames)by employing Scale-Invariant Feature Transform (SIFT) descriptors [17]. Keypoints arethen clustered with k-means algorithm to construct a visual dictionary, and each keypointis assigned to the nearest visual word. Thus, a segment can be represented by a histogramover the dictionary. Finally, distance is calculated to measure the visual difference betweentwo segments. In our work, the Manhattan distance is utilized.

For a given segment i, we calculate the Manhattan distances from i to other segmentsand sum them up as

dist (i) =∑

j �=i

dist (i, j) (7)

The visual diversity score of a subset s is then calculated as

V (s) =∑

i∈s

dist (i) (8)

Note that other visual descriptors and distance metrics can also be applied to our visualdiversity scoring framework.

4.5 Summary generation

With the scoring schemes presented above, for each subset s, we obtain the attractionscore H(s), the content saliency score I (s), the content consistency score D(s), and thevisual diversity score V (s). The L2 normalization is performed on H(s), I(s), D(s), V(s) to


normalize them to the value of 0 to 1. For summary generation, we find an optimal subset s̃which satisfies

s̃ = argmaxs

(ω1H(s) + ω2I (s) + ω3D(s) + ω4V (s)) (9)

where ω1, ω2, ω3 and ω4 are the weights of the scores, and ω1 + ω2 + ω3 + ω4 = 1.In order to find the optimal subset, we score all subsets of candidate segment set C by

limiting the total duration to L and select the one with the highest score. Finally, the moviesummary is generated by concatenating all the segments in s̃.

5 Experiments

5.1 Dataset

Our experiments are carried out on a dataset including six movies along with the bulletscreen comments as shown in Fig. 4. We downloaded them from the website bilibilijj [3].The movies are different in genres and languages. The details of the movies are listed inTable 1. The last row lists the number of the bullet screen comments for each movie.

5.2 Experiment settings

The experiments are performed according to the candidate segment detection approach weintroduced in Section 3, and the highlight scoring approach proposed in Section 4. In theimplementation, for candidate segment detection, duration of each segment is limited to 5to 20 seconds. After H(s), I (s),D(s) and V (s) are calculated using the (1) to (8), theyare normalized to 0-1 using L2 normalization. Then the optimal subset is selected using(9). We set ω1 to 0.4, and ω2, ω3, ω4 to 0.2 since the viewers’ excitements and opinionsare considered more important in movie summarization from our observation. For contentconsistency score, since the movies in our dataset have diverse comments, the top 5 conceptsare selected as the keywords of each movie (J = 5). It can also be justified according tothe diversity of incoming data. The duration of the summary is limited to less than L = 5minutes.

Three alternative approaches are used for comparison:

– Uniform approach employs evenly sampling for summary generation, which is a simplebut effective approach. In this approach, the movie is equally divided into 16 portions,and one segment is selected from each portion. In the implementation, we select the

(a) (b) (c) (d) (e) (f)

Fig. 4 Movies in our dataset. a Insidious: Chapter 3; (b) Mad Max 4: Fury Road; (c) Samon Akagi &Tomohisa Yurine; (d) Who Am I: No System Is Safe; (e) The Dark Knight; (f) Ultraviolet


Table 1 The details of the movies used in our experiments

Movie Duration Genre Language Release time Number of comment

(a) 97min Horror English June 2015 12000

(b) 120min Action, Fantasy, War English May 2015 8000

(c) 110min Comedy, Mystery Japanese January 2015 8000

(d) 101min Crime, Thriller German September 2014 8000

(e) 152min Superhero, Action English July 2008 10268

(f) 88min Science fiction, Action English March 2006 5906

first 15 seconds from each portion, and the summary is composed of these selectedsegments. According to Trecvid 2007 BBC Rushes Summarization task [23], the per-formance of simple sampling is within top five in all evaluation measures among thetwenty-four submissions in this task. Also, this approach is standard and widely usedin video summarization, such as the DEFT approach in [32].

– Random approach selects segments from the candidates introduced in Section 3 in arandom way. Then, the selected segments are assembled to compose the movie sum-mary. This approach is similar to the RAND approach in [32]. In our work, it is designedfor showing the effectiveness of our scoring scheme (the second step in Fig. 2).

– Visual approach is based on visual diversity. The score of each segment is calculatedusing (7), and the top ones are selected to compose the movie summary with the limitL on the total duration. This approach is designed for showing the effectiveness ofthe bullet screen comments compared with visual information. The Visual approach issimilar to the classical cluster-based approaches, which aims to find visually differentsegments to compose the summary.

Since there is no objective groundtruth to evaluate the results of movie summarization,we conduct a subjective evaluation. 22 judges are invited to watch and score the summaries.They are mainly young people of 20-40 years old who are the main consumers of onlinevideos. Male and female are 50 to 50. The judges have different preferences in movie genres,and have watched all or some of the movies in our dataset. For each movie, we show thefour summaries (one from our approach and three from the approaches for comparison)to them. To avoid the possible bias during judging, the four summaries are displayed ina random order, and for different movies, the orders are different. The judges are askedto score the four summaries according to the following criteria. The criteria refer to thesubjective evaluation in related work [10, 32], and are finally determined by user study. Thefollowing items are valued more by users.

1. Attraction: Most segments of the movie summary are attractive.2. Fluency: The summary is fluent, without or with little half shot and half sentence.3. Coverage: The summary contains most highlights that deeply impress the judges when

they view the original movie.4. Consistency: The summary keeps high consistency with the original movie in terms of

the characters and the key storylines.5. Inspiration: For judges who have not watched the original movie, the summary interests

them to watch this movie.6. Informativeness: For judges who have not watched the original movie, the summary

provides enough information for them to grasp the movie content.


The scores are limited to integer values from 1 to 5. The score 1 is the worst and score 5 is thebest value. The judges are asked to score {Attraction, Fluency, Coverage, Consistency} or{Fluency, Inspiration, Informativeness} according to whether they have watched the movieor not. On average, for each movie, 17 judges watched the movie and 5 judges didn’t. Thatmeans, {Attraction, Coverage, Consistency} are scored by 17 judges for each movie onaverage; {Inspiration, Informativeness} are scored by 5 judges; and {Fluency} are scored byall judges, i.e., 22 judges.

5.3 Experimental results and discussion

The statistical results of our subjective evaluation are shown in Table 2. As can be observedin Table 2, our approach significantly outperforms the compared approaches (Uniform, Ran-dom and Visual) for all criteria. The Visual algorithm performs second. The performanceof Uniform and Random are comparable, with Random outperforming Uniform slightly.Our approach performs well especially for Attraction and Coverage criteria. The judgesagree that the summaries generated from our approach cover most attractive segments of themovies. This can be traced back as a result of the utilization of the bullet screen comments,which provide us with rich information of the viewers’ opinions. The number of commentsindicates the excitement degree of the audience to a movie segment, while the content ofcomments includes the main concepts which interests the audience. By considering boththe number and the content of the comments, we can detect most highlights of the moviesfrom the perspectives of the audience. Furthermore, our scoring scheme based on the screenbullet comments and visual information is effective in selecting movie highlights and gen-erating qualified movie summaries. The resulting summaries are enjoyable and informativesince we keep high consistency with the content of the original movies. For those who havenot watched the movies, they can easily grasp the storylines and the highlights of the moviesby watching the summaries.

In order to assess the statistical significance of the difference in the results between ourapproach and others, we list p-values in Table 3 obtained using a two-sided t-test. The p-value refers to the probability of incorrectly rejecting the hypothesis that two populationshave identical mean values. The smaller the p-value is, the more different two popula-tions are. As shown in Table 3, our approach generates different populations from other

Table 2 The statistical results of our subjective evaluation. μ̂: estimated mean; σ̂ : standard deviation

μ̂ σ̂ μ̂ σ̂ μ̂ σ̂

Attraction Fluency Coverage

Uniform 3.294 1.030 3.227 0.985 3.225 0.911

Random 3.451 1.040 3.356 0.942 3.294 1.011

Visual 3.559 0.896 3.364 0.967 3.412 0.979

OurApproach 4.186 0.941 3.818 0.915 3.922 0.951

Consistency Inspiration Informativeness

Uniform 3.373 1.024 2.933 1.015 3.033 0.928

Random 3.431 1.020 3.233 1.040 3.100 1.125

Visual 3.520 0.992 3.233 0.858 3.300 0.988

OurApproach 3.931 0.882 3.500 1.137 3.567 0.971


Table 3 The p-values using the two-sided t-test to test the statistical significance of the difference in scoresbetween our approach and the compared approaches

OurApproach vs. Uniform OurApproach vs. Random OurApproach vs. Visual

Attraction 7.880E-10 3.132E-07 2.194E-06

Fluency 8.338E-07 6.941E-05 1.120E-04

Coverage 2.521E-07 8.652E-06 2.119E-04

Consistency 4.424E-05 2.356E-04 1.993E-03

Inspiration 4.634E-02 3.472E-01 3.099E-01

Informativeness 3.376E-02 9.093E-02 2.962E-01

approaches, since most p-values are less than 0.05 (the experiential cutoff value belowwhich the difference between two populations is considered significant). This means thedifference between the scores of our approach and others are statistically significant. ForInspiration and Informativeness criteria, the differences in the scores are not significant.This is because these two criteria are only scored by those judges who have not watchedthe original movies. For each movie, only few judges (5 judges per movie on averageas mentioned above) are involved in the scoring on these two criteria. This results in arelatively large standard deviation in the scores for them. In addition, the judges have dif-ferent preferences. For instance, if one is afraid of watching thriller movies, a summaryhardly interests him/her to watch the movie. Overall, our approach produces more qualifiedmovie summaries by including most attractive segments from the perspective of the audi-ence and keeping content consistency with the original movies compared with the comparedapproaches.

Usually, the bullet screen comments are a little bit delayed from the content that the view-ers want to comment, since the viewers need time to type. However, in our experiments, nodelay is observed between the selected segments and the original highlights in movies. Thereasons lie in two aspects. First, the texts of comments are short and they need little timeto type, which makes the delay ignorable. Second, the comments are a little delayed, butstill lie in the same shot containing the content that the viewers comment. As presented inSection 3, we extend the start boundary of every candidate segment backward to its near-est shot boundary. As a result, the commented clips are included in the selected segmentsalthough there is a little delay.

6 Conclusion and future work

We have presented a novel approach for movie summarization by exploring a new fea-ture, i.e. the bullet screen comments. The user-generated bullet screen comments provideuseful information about the movie highlights from the perspective of the audience. Thishelps generate qualified movie summaries for users. In this paper, we derive different fea-tures based on the content analysis of bullet screen comments for movie summarization.Our experiments show that our approach is general enough for movies of different genresand languages. For future work, we will investigate more effective features from the bul-let screen comments for movie summarization, such as the abnormal words, the locationinformation of comments, and the sentiment analysis. Besides, generating different sum-maries according to the viewers’ preference can be researched with the help of bullet screencomments.


Acknowledgments The work described in this paper was supported by the National Natural Science Foun-dation of China (No. 61375016) and the Science and Technology Commission of Shanghai Municipality(No. 16511102702).

References

1. Agnihotri L, Devera KV, McGee T, Dimitrova N (2001) Summarization of video programs based onclosed captions. Storage Retriev Media Databases. doi:10.1117/12.410973

2. Bilibili. http://www.bilibili.com/3. Bilibilijj. http://www.bilibilijj.com/4. Darabi K, Ghinea G (2016) User-centred personalised video abstraction approach adopting SIFT

features. Multimed Tools Appl. doi:10.1007/s11042-015-3210-45. Dimoulas CA, Symeonidis AL (2015) Syncing shared multimedia through audiovisual bimodal segmen-

tation. IEEE MultiMed 22(3):26–42. doi:10.1109/MMUL.2015.336. Dogra DP, Ahmed A, Bhaskar H (2015) Smart video summarization using mealy machine-

based trajectory modelling for surveillance applications. Multimed Tools Appl 75(11):6373–6401.doi:10.1007/s11042-015-2576-7

7. Doman K, Tomita T, Ide I, Deguchi D, Murase H (2014) Event detection based on twitter enthusiasmdegree for generating a sports highlight video. In: Proceedings of the 18th ACM international conferenceon multimedia. doi:10.1145/2647868.2654973

8. Ejaz N, Mehmood I, Baik SW (2013) Efficient visual attention based framework for extracting keyframes from videos. Signal Process Image Commun 28(1):34–44. doi:10.1016/j.image.2012.10.002

9. Ejaz N, Mehmood I, Baik SW (2014) Feature aggregation based visual attention model for videosummarization. Comput Electric Eng 40(3):993–1005. doi:10.1016/j.compeleceng.2013.10.005

10. Evangelopoulos G, Zlatintsi A, Potamianos A,Maragos P, Rapantzikos K, Skoumas G, Avrithis Y (2013)Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention.IEEE Trans Multimed 15(7):1553–1568. doi:10.1109/TMM.2013.2267205

11. Ferreira L, Silva Cruz LA, Assuncao P (2015) A generic framework for optimal 2D/3D key-frame extraction driven by aggregated saliency maps. Signal Process Image Commun 39:98–110.doi:10.1016/j.image.2015.09.005

12. Furini M, Ghini V (2006) An audio-video summarization scheme based on audio and video analysis. In:Proceedings of IEEE consumer communications & networking. doi:10.1109/CCNC.2006.1593230

13. Hannon J, McCarthy K, Lynch J, Smyth B (2011) Personalized and automatic social summarizationof events in video. In: Proceedings of the 16th international conference on intelligent user interfaces.doi:10.1145/1943403.1943459

14. Li Y, Lee SH, Yeh CH, Kuo CC (2006) Techniques for movie content analysis and skimming:tutorial and overview on video abstraction techniques. IEEE Signal Process Mag 23(2):79–89.doi:10.1109/MSP.2006.1621451

15. Lin W, Sun MT, Li H, Chen Z, Li W, Zhou B (2012) Macroblock classification for video applicationsinvolving motions. IEEE Trans Broadcast 58(1):34–46. doi:10.1109/TBC.2011.2170611

16. Lin W, Zhang Y, Lu J, Zhou B, Wang J, Zhou Y (2015) Summarizing surveillance videos withlocal-patch-learning-based abnormality detection, blob sequence optimization, and type-based synopsis.Neurocomputing 155:84–98. doi:10.1016/j.neucom.2014.12.044

17. Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEEinternational conference on computer vision. doi:10.1109/ICCV.1999.790410

18. Lu Z, Grauman K (2013) Story-driven summarization for egocentric video. In: Proceedings of IEEEconference on computer vision and pattern recognition. doi:10.1109/CVPR.2013.350

19. Ma YF, Hua XS, Lu L, Zhang HJ (2005) A generic framework of user attention model and its applicationin video summarization. IEEE Trans Multimed 7(5):907–919. doi:10.1109/TMM.2005.854410

20. Mei S, Guan G, Wang Z, He M, Hua X, Feng D (2014) L2,0 constrained sparse dictio-nary selection for video summarization. In: International conference on multimedia & Expo.doi:10.1109/ICME.2014.6890179

21. Mei S, Guan G, Wang Z, Wan S, He M, Feng D (2015) Video summarization via minimum sparsereconstruction. Pattern Recogn 48(2):522–533. doi:10.1016/j.patcog.2014.08.002

22. Niconico. http://www.nicovideo.jp/23. Over P, Smeaton AF, Kelly P (2007) The TRECVID 2007 BBC rushes summarization evaluation pilot.

In: Proceedings of the TRECVID workshop on video summarization. doi:10.1145/1290031.1290032

http://dx.doi.org/10.1117/12.410973

http://www.bilibili.com/

http://www.bilibilijj.com/

http://dx.doi.org/10.1007/s11042-015-3210-4

http://dx.doi.org/10.1109/MMUL.2015.33

http://dx.doi.org/10.1007/s11042-015-2576-7

http://dx.doi.org/10.1145/2647868.2654973

http://dx.doi.org/10.1016/j.image.2012.10.002

http://dx.doi.org/10.1016/j.compeleceng.2013.10.005

http://dx.doi.org/10.1109/TMM.2013.2267205

http://dx.doi.org/10.1016/j.image.2015.09.005

http://dx.doi.org/10.1109/CCNC.2006.1593230

http://dx.doi.org/10.1145/1943403.1943459

http://dx.doi.org/10.1109/MSP.2006.1621451

http://dx.doi.org/10.1109/TBC.2011.2170611

http://dx.doi.org/10.1016/j.neucom.2014.12.044

http://dx.doi.org/10.1109/ICCV.1999.790410

http://dx.doi.org/10.1109/CVPR.2013.350


http://dx.doi.org/10.1109/ICME.2014.6890179

http://dx.doi.org/10.1016/j.patcog.2014.08.002

http://www.nicovideo.jp/

http://dx.doi.org/10.1145/1290031.1290032


24. Panagiotakis C, Doulamis A, Tziritas G (2009) Equivalent key frames selection based on iso-contentprinciples. IEEE Trans Circ Syst Vid Technol 19(3):447–451. doi:10.1109/TCSVT.2009.2013517

25. Peng J, Lin Q (2007) Keyframe-based video summary using visual attention clues. IEEE Multimed17(2):64–73. doi:10.1109/MMUL.2009.65

26. Pickering MJ, Wong L, Rger SM (2003) ANSES: summarization of news video. Image Vid Retriev.doi:10.1007/3-540-45113-7 42

27. Pritch Y, Rav-Acha A, Peleg S (2008) Nonchronological video synopsis and indexing. IEEE TransPattern Anal Mach Intell 30(11):1971–1984. doi:10.1109/TPAMI.2008.29

28. Rapantzikos K, Evangelopoulos G, Maragos P, Avrithis YS (2007) An audio-visual saliencymodel for movie summarization. In: IEEE 9th Workshop on multimedia signal processing.doi:10.1109/MMSP.2007.4412882

29. Sang J, Xu C (2010) Character-based movie summarization. In: Proceedings of the 18th ACMinternational conference on multimedia. doi:10.1145/1873951.1874096

30. Shamma D, Kennedy L, Churchill E (2010) Summarizing media through short-messaging services. In:ACM Conference on computer supported cooperative work

31. Song Y, Vallmitjana J, Stent A, Jaimes A (2015) TVSum: summarizing web videos usingtitles. In: Proceedings of the IEEE conference on computer vision and pttern recognition.doi:10.1109/CVPR.2015.7299154

32. Taskiran CM, Pizlo Z, Amir A, Ponceleon D, Delp EJ (2006) Automated video program summarizationusing speech transcripts. IEEE Trans Multimed 8(4):775–791. doi:10.1109/TMM.2006.876282

33. Tian Z, Xue J, Lan X, Li C, Zheng N (2014) Object segmentation and key-pose based summarization formotion video. Multimed Tools Appl 72(2):1773–1802. doi:10.1007/s11042-013-1488-7

34. Uchihashi S, Foote J, Girgenson A, Boreczky J (1999) Video manga: generating semantically mean-ingful video summaries. In: Proceedings of the 7th ACM international conference on multimedia.doi:10.1145/319463.319654

35. Wang M, Hong R, Li G, Zha Z, Yan S, Chua T (2012) Event driven web video summa-rization by tag localization and key-shot identification. IEEE Trans Multimed 14(4):975–985.doi:10.1109/TMM.2012.2185041

36. Xiang X, Kankanhalli MS (2011) Affect-based adaptive presentation of home videos. In: Proceedingsof the 19th ACM international conference on multimedia. doi:10.1145/2072298.2072370

37. Yeung MM, Yeo BL (1997) Video visualization for compact presentation and fast browsing of pictorialcontent. IEEE Trans Circ Syst Vid Technol 7(5):771–785. doi:10.1109/76.633496

38. Zhou H, Hermans T, Karandikar A, Rehg J (2010) Movie genre classification via scene categorization.In: Proceedings of ACM international conference on multimedia. doi:10.1145/1873951.1874068

Shan Sun received the B.S. degree in computer science and technology from the East China Normal Uni-versity in 2014. Now she is a Ph.D. student of the Institute of Computer Applications, East China NormalUniversity. Her research interests include computer vision and multimedia application.

http://dx.doi.org/10.1109/TCSVT.2009.2013517

http://dx.doi.org/10.1109/MMUL.2009.65

http://dx.doi.org/10.1007/3-540-45113-7_42

http://dx.doi.org/10.1109/TPAMI.2008.29

http://dx.doi.org/10.1109/MMSP.2007.4412882

http://dx.doi.org/10.1145/1873951.1874096

http://dx.doi.org/10.1109/CVPR.2015.7299154


http://dx.doi.org/10.1007/s11042-013-1488-7

http://dx.doi.org/10.1145/319463.319654


http://dx.doi.org/10.1145/2072298.2072370

http://dx.doi.org/10.1109/76.633496

http://dx.doi.org/10.1145/1873951.1874068


Feng Wang received his PhD in Computer Science from the Hong Kong University of Science and Tech-nology in 2007 and BsC from Fudan University, China, in 2001 respectively. Before joining East ChinaNormal University as an associate professor in the Dept. of Computer Science and Technology, he was aresearch fellow in City University of Hong Kong and Institute Eurecom, France. His research interests includemultimedia information retrieval, pattern recognition, and IT in education.

Liang He received his B.S. degree and Ph.D. degree in computer science from East China Normal University.He is a Professor in the Department of Computer Science and Technology, East China Normal University,and is presently serving as the Director of the Department of Computer Science and Technology. His currentresearch interests include knowledge processing, user behavior analysis, and context-aware computing.

movie summarization using bullet screen...

Documents