using prior knowledge to assess relevance in speech

USING PRIOR KNOWLEDGE TO ASSESS RELEVANCE IN SPEECH SUMMARIZATION

Ricardo Ribeiro1 and David Martins de Matos2

1 ISCTE/IST/L2F - INESC ID Lisboa2 IST/L2F - INESC ID Lisboa

Rua Alves Redol, 9, 1000-029 Lisboa, Portugal

ABSTRACT

We explore the use of topic-based automatically acquired priorknowledge in speech summarization, assessing its influence through-out several term weighting schemes. All information is combinedusing latent semantic analysis as a core procedure to compute therelevance of the sentence-like units of the given input source. Eval-uation is performed using the self-information measure, which triesto capture the informativeness of the summary in relation to thesummarized input source. The similarity of the output summaries ofthe several approaches is also analyzed.

Index Terms— Speech processing, Text processing, Naturallanguages, Information retrieval, Singular value decomposition

1. INTRODUCTION

Automatic summarization aims to present to the user in a concise andcomprehensible manner the most relevant content given one or moreinformation sources [1]. Several difficulties arise when addressingthis problem, but one of most importance is how to assess the rele-vant content. While in text summarization, up-to-date systems makeuse of complex information, such as syntactical [2], semantic [3]and discourse [4, 5] information, either to assess relevance or re-duce the length of the output, speech summarization systems relyon lexical, acoustic/prosodic, structural, and discourse features (al-though in what concerns the use of discourse features, only simpleapproaches, such as the given/new information score presented byMaskey and Hirschberg [6] or the detection of words like decide orconclude described by Murray et al. [7], are used) to extract from theinput, the most relevant sentence-like units. In fact, spoken languagesummarization is often considered of greater difficulty than text sum-marization [8, 9, 10]: speech recognition errors, disfluencies, andproblems in the accurate identification of sentence boundaries af-fect the input source to be summarized. However, although cur-rent work strives to explore features beyond the spoken documentstranscriptions, such as structural, acoustic/prosodic and spoken lan-guage features [6, 11], shallow text summarization approaches likeLatent Semantic Analysis (LSA) [12] and Maximal Marginal Rele-vance (MMR) [13] achieve comparable performance [11].

In the current work, we explore the use of prior knowledge toassess relevance in the context of the summarization of broadcastnews. SSNT [14] is a system for selective dissemination of mul-timedia contents, based on an automatic speech recognition (ASR)module that generates the transcriptions later used by topic segmen-tation, topic indexing, and title&summarization modules. Our pro-posal consists in the use of news stories previously indexed by topic(by the topic indexing module) to identify the most relevant pas-sages of upcoming news stories. This approach models relevance

within topics and its evolution through time, by continuously updat-ing the corresponding set of background news stories—which canbe seen as topic-directed semantic spaces—, trying to approximatethe way humans perform summarization tasks [15, 16, 17]. This isin accordance with Endres-Niggemeyer [15, 16, 17], who identifiedthe following aspects as the most important ones when observing thesummarization process from a human perspective: knowledge-basedtext analysis; combined online processing; and task orientation andselective interpretation.

We address these issues using a statistical approach: the topic-directed semantic spaces establish the knowledge base used by thesummarization process; term weighting strategies are used to com-bine global and local weights (which may also include informationlike the recognition confidence coefficient to address spoken lan-guage summarization specificities); and, finally, the use of LSA [18]provides the selective interpretation step, where terms (keywords),as described by Endres-Niggemeyer, are the most obvious cues forattracting attention.

This document is organized as follows: the next section brieflydescribes the broadcast news processing system; section 3 presentsour approach to speech summarization, detailing how prior knowl-edge is included in the summarization process; section 4 presents theobtained results; before concluding, we address similar work.

2. SELECTIVE DISSEMINATION OF MULTIMEDIACONTENTS

SSNT [14] is a system for selective dissemination of multimediacontents, working primarily with Portuguese broadcast news ser-vices. The system is based on an ASR module, that generatesthe transcriptions used by the topic segmentation, topic indexing,and title&summarization modules. Preceding the speech recogni-tion module, an audio preprocessing module classifies the audioin accordance to several criteria: speech/non-speech, speaker seg-mentation and clustering, gender, and background conditions. TheASR module with an average word error rate of 24% [14], greatlyinfluences the performance of the subsequent modules. The topicsegmentation module [19], reported as based on clustering, groupstranscribed segments into stories. The algorithm relies on a heuristicderived from the structure of the news services: each story startswith a segment spoken by the anchor. This module achieved anF-measure of 68% [14]. The main problem identified by the authorswas boundary deletion: a problem which impacts the summarizationtask. Topic indexing [19] is based on a hierarchically organizedthematic thesaurus provided by the broadcasting company. The hi-erarchy has 22 thematic areas on the first level, for which the moduleachieved a correctness of 91.4% [20, 14]. Batista et al. [21] inserteda module for recovering punctuation marks, based on maximum

entropy models, after the ASR module. The punctuation marksaddressed were the “full stop” and “comma”, which provide thesentence units necessary for use in the title&summarization module.This module achieved an F-measure of 56% and a SER (Slot ErrorRate) of 0.74.

3. SPEECH SUMMARIZATION USING PRIORKNOWLEDGE TO ASSESS RELEVANCE

As introduced above, human summarization is a knowledge-basedtask: Endres-Niggemeyer [16] observes that both generic and spe-cific knowledge is used to understand information and assess its rel-evance.

3.1. Selecting Background Knowledge

The first step of our summarization process consists in the selectionof the background knowledge that will improve the detection of themost relevant passages of the news stories to be summarized. This isdone by identifying and selecting the news stories from the previousn days that match the topics of news story to be summarized. Theprocedure, as currently implemented, relies on the results producedby the topic indexing module, but a clustering approach could alsobe followed.

3.2. Using LSA to Combine the Available Information

The summarization process we implemented is characterized by theuse of LSA as a core procedure to compute the relevance of the ex-tracts (sentence-like units) of the given input source.

3.2.1. LSA

LSA is based on the singular vector decomposition (SVD) of theterm-sentence frequency m × n matrix, M . U is an m × n matrixof left singular vectors; Σ is the n × n diagonal matrix of singularvalues; and, V is the n × n matrix of right singular vectors (onlypossible if m ≥ n): M = UΣV T .

The idea behind the method is that the decomposition capturesthe underlying topics (identifying the most salient ones) of the docu-ment by means of co-occurrence of terms (the latent semantic analy-sis), and identifies the best representative sentence-like units of eachtopic. Summary creation can be done by picking the best represen-tatives of the most relevant topics according to a defined strategy.

Our implementation follows the original ideas of Gong andLiu [12] and the ones of Murray, Renals, and Carletta [9] for solvingdimensionality problems, and using, for matrix operations, the GNUScientific Library1.

3.2.2. Including the Background Knowledge

Prior to the application of the SVD, it is necessary to create the termsby sentences matrix that represents the input source. It is in this stepthat the previously selected news stories are taken into consideration:instead of using only the news story to be summarized to create theterms by sentences matrix, both the selected stories and the newsstory are used. The matrix is defined as24 a1d1

1... a1d1

s... a1dD

1... a1dD

sa1n1 ... a1ns

...aTd1

1... aTd1

s... aTdD

1... aTdD

saTn1 ... aTns

35 (1)

1http://www.gnu.org/software/gsl/

where aidkj

represents the weight of term ti, 1 ≤ i ≤ T (T is thenumber terms), in sentence sdk

j, 1 ≤ k ≤ D (D is the number

of selected documents) with dk1 ≤ dk

j ≤ dks , of document dk; and

ainl , 1 ≤ l ≤ s, are the elements associated with the news storyto be summarized. The elements of the matrix are defined as aij =Lij × Gij , where Lij is a local weight, and Gij is a global weight.To form the resulting summary, only sentences corresponding to thelast columns of the matrix (which correspond to the sentence of thenews story to be summarized) are selected.

3.2.3. Term Weighting Schemes

In order to better assess the relevance of the prior knowledge, weexplored different weighting schemes, all of them in both situationsof using and not using prior knowledge. The following local weightsLij were used:• frequency of term i in sentence j;• summation of the recognition confidence scores of all occur-

rences of term i in sentence j;• summation of the smoothed recognition confidence scores

(using the trigrams tk−2tk−1tk, tk−1tktk+1, and tktk+1tk+2,tk is an occurrence term i in a sentence) of all occurrences ofterm i in sentence j.

Global weights are entropy-based (more specifically, are basedon the self-information measure), considering the probability of aterm ti characterizing a sentence s:

P (ti) =

Pj C(ti, sj)P

i

Pj C(ti, sj)

(2)

where

C(ti, sj) =

(1 ti ∈ sj

0 ti /∈ sj

(3)

The global weights Gij were defined as follows:• 1 (not using global weight);• −log(P (ti));• same as previous, but calculating P (ti) using CR(ti, sj)

(eq. 4) instead of C(ti, sj)—used only when using the recog-nition confidence scores in the local weight;

• same as previous, but using the smoothed recognition con-fidence scores as described for the local weights—used onlywhen using the smoothed recognition confidence scores in thelocal weight.

CR(ti, sj) =

( Precog. conf. score of occurrences of ti in sj

frequency of ti in sjti ∈ sj

0 ti /∈ sj

(4)

4. EXPERIMENTAL SETUP

Our goal was to understand whether the inclusion of prior knowledgeincreases the relevant content of the resulting summaries.

4.1. Data

In this experiment, we tried to summarize the news stories of 2episodes, 2008/06/09 and 2008/06/29, of a Portuguese news program(their properties are detailed in the top part of table 1). As sourceof prior knowledge, we used for the 2008/06/09 show the episodesfrom 2008/06/01 to 2008/06/08, and for the 2008/06/29 show, theepisodes from 2008/06/20 to 2008/06/28.

2008/06/09 2008/06/29# of stories 28 24# of sentences 443 476# of transcribed tokens 7733 6632# of different topics 45 47Average # of topics per story 3.54 4.92# of stories 205 268# of sentences 4127 4553# of transcribed tokens 62025 71470# of different topics 158 195Average # of topics per story 3.42 4.10

Table 1. Properties of the news shows to be summarized (top) andof the news shows used as prior knowledge (bottom).

4.2. Evaluation

As previously mentioned, our goal is to understand if the content ofthe summaries produced using prior knowledge is more informativethan using baseline LSA methods. To do so, we use self information(I(ti) = −log(P (ti))) as the measure of the informativeness ofa summary: this meets our objectives and facilitates an automaticassessment of the improvements. The probabilities of term ti, areestimated using their distribution in the document to be summarized.

In this experiment, we generated 3 sentence summaries (whichcorresponds to a compression rate of about 5% to 10%) of eachnews story (similar to what is done, for example, in CNN websitehttp://edition.cnn.com).

4.3. Results

The obtained results are depicted in figures 1, that shows the overlapbetween the output summaries, and figure 2, that shows the globalperformance of all approaches.

As can be seen in figure 1, the use of global weight increases thedifferences between using and not using prior knowledge. This isdue to the fact that global weight for each term is computed using notonly the news story to be summarized, but also the prior knowledge:that influences also the elements of the term by sentences matrixrelated to the news story, whereas when not using global weights,the prior knowledge does not affect those elements. That is shown onthe top graphic of figure 1, where all summaries have more overlap.

Figure 2 shows the number of times, where each method pro-duced the most informative summary. In general, the methods usingglobal weight achieved the worst results. This can be explained inpart by the noise present both in the transcriptions and in the topicindexing phase, since when not using global weights, the use of priorknowledge has a considerably better performance.

5. RELATED WORK

Furui [10] identifies three main approaches to speech summariza-tion: sentence extraction-based methods, sentence compaction-based methods, and combinations of both. Sentence extractive meth-ods comprehend, essentially, methods like LSA [9, 22], MMR [23, 9,22], and feature-based methods [9, 6, 7, 22]. Sentence compactionmethods are based on word removal from the transcription, withrecognition confidence scores playing a major role [24]. A com-bination of these two types of methods was developed by Kikuchiet al. [25], where summarization is performed in two steps: first,

sentence extraction is done through feature combination; second,compaction is done by scoring the words in each sentence and then adynamic programming technique is applied to select the words thatwill remain in the sentence to be included in the summary.

Closer to our work is the way topic-adapted summarization isexplored by Chatain et al. [26]: one of the components of the sen-tence scoring function is an n-gram linguistic model that is com-puted from given data. However, in the two experiments performed,one using talks and the other using broadcast news, only the one us-ing talks used a topic-adapted linguistic model and the data used forthe adaptation consisted of the papers in the conference proceedingsof the talk to be summarized. CollabSum [27] also explores docu-ment proximity in single document text summarization: by meansof a clustering algorithm, documents related to the document to besummarized are grouped in a cluster and used to build a graph thatreflects the relationships between all sentences in the cluster; theninformativeness is computed for each sentence and redundancy re-moved to generate the summary.

Fig. 1. Summary overlap: with (B,C,D,E)/without (A,F) RC (C,D:smoothed), with (D,E,F)/without (A,B,C) PK, with (bottom)/without(top) GW. RC: recognition confidence; PK: prior knowledge; GW:global weight.

6. FUTURE WORK

Future work should address the influence of the modules that precedesummarization in its performance; and, assess the correlation of hu-man preferences in summarization with the informativeness measureused in the current work.

7. REFERENCES

[1] I. Mani, Automatic Summarization, John Benjamins Publish-ing Company, 2001.

0

5

10

15

20

25

30

35

A B C D E F G H I J K L

plain PK plain + RC PK + RC

Fig. 2. Comparative results using all variations of the evaluationmeasure. A: plain LSA; B: plain LSA + GW; C: plain LSA + GW+ RC; D: plain LSA + GW + RC smoothed; E: plain + RC; F: plainLSA + RC smoothed; G through L, the same but with PK.

[2] L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova,“Beyond SumBasic: Task-focused summarization and lexicalexpansion,” Information Processing and Management, vol. 43,pp. 1606–1618, 2007.

[3] R. I. Tucker and K. Sparck Jones, “Between shallow and deep:an experiment in automatic summarising,” Tech. Rep. 632,University of Cambridge Computer Laboratory, 2005.

[4] H. Daume III and D. Marcu, “A Noisy-Channel Model forDocument Compression,” in Proceedings of the 40th annualmeeting of the ACL. 2002, pp. 449–456, ACL.

[5] S. Harabagiu and F. Lacatusu, “Topic Themes for Multi-Document Summarization,” in SIGIR 2005: Proceedings ofthe 28th Annual International ACM SIGIR Conference on Re-search and Development in Information Retrieval. 2005, pp.202–209, ACM.

[6] S. Maskey and J. Hirschberg, “Comparing Lexical, Acous-tic/Prosodic, Strucural and Discourse Features for SpeechSummarization,” in Proceedings of the 9th EUROSPEECH- INTERSPEECH 2005. 2005, pp. 621–624, ISCA.

[7] G. Murray, S. Renals, J. Carletta, and J. Moore, “IncorporatingSpeaker and Discourse Features into Speech Summarization,”in Proceedings of the HLT/NAACL. 2006, pp. 367–374, ACL.

[8] K. R. McKeown, J. Hirschberg, M. Galley, and S. Maskey,“From Text to Speech Summarization,” in 2005 IEEE Interna-tional Conference on Acoustics, Speech, and Signal Process-ing. Proceedings. 2005, vol. V, pp. 997–1000, IEEE.

[9] G. Murray, S. Renals, and J. Carletta, “Extractive Summariza-tion of Meeting Records,” in Proc. of the 9th EUROSPEECH- INTERSPEECH 2005. 2005, pp. 593–596, ISCA.

[10] S. Furui, “Recent Advances in Automatic Speech Sum-marization,” in Proc. of the 8th Conference on Recherched’Information Assistee par Ordinateur. 2007, Centre desHautes Etudes Internationales d’Informatique Documentaire.

[11] G. Penn and Xiaodan Zhu, “A critical reassessment of evalu-ation baselines for speech summarization,” in Proceeding ofACL-08: HLT. 2008, pp. 470–478, ACL.

[12] Y. Gong and X. Liu, “Generic Text Summarization UsingRelevance Measure and Latent Semantic Analysis,” in SIGIR

2001: Proceedings of the 24st Annual International ACM SI-GIR Conference on Research and Development in InformationRetrieval. 2001, pp. 19–25, ACM.

[13] J. Carbonell and J. Goldstein, “The Use of MMR, Diversity-Based Reranking for Reordering Documents and ProducingSummaries,” in SIGIR 1998: Proceedings of the 21st AnnualInternational ACM SIGIR Conference on Research and Devel-opment in Information Retrieval. 1998, pp. 335–336, ACM.

[14] R. Amaral, H. Meinedo, D. Caseiro, I. Trancoso, and J. P. Neto,“A Prototype System for Selective Dissemination of Broad-cast News in European Portuguese,” EURASIP Journal on Ad-vances in Signal Processing, vol. 2007, 2007.

[15] B. Endres-Niggemeyer, Summarizing Informarion, Springer,1998.

[16] B. Endres-Niggemeyer, “Human-style WWW summariza-tion,” Tech. Rep., University for Applied Sciences, Departmentof Information and Communication, 2000.

[17] B. Endres-Niggemeyer, “SimSum: an empirically foundedsimulation of summarizing,” Information Processing and Man-agement, vol. 36, no. 4, pp. 659–682, 2000.

[18] T. K. Landauer, P. W. Foltz, and D. Laham, “An Introductionto Latent Semantic Analysis,” Discourse Processes, vol. 25,1998.

[19] R. Amaral and I. Trancoso, “Improving the Topic Indexationand Segmentation Modules of a Media Watch System,” inProc. of INTERSPEECH 2004 - ICSLP, 2004, pp. 1609–1612.

[20] R. Amaral, H. Meinedo, D. Caseiro, I. Trancoso, and J. P. Neto,“Automatic vs. Manual Topic Segmentation and Indexation inBroadcast News,” in Proc. of the IV Jornadas en Tecnologiadel Habla, 2006.

[21] F. Batista, D. Caseiro, N. J. Mamede, and I. Trancoso, “Recov-ering Punctuation Marks for Automatic Speech Recognition,”in Proc. of INTERSPEECH 2007. 2007, pp. 2153–2156, ISCA.

[22] R. Ribeiro and D. M. de Matos, “Extractive Summarizationof Broadcast News: Comparing Strategies for European Por-tuguese,” in Text, Speech and Dialogue – 10th InternationalConference. Proceedings. 2007, vol. 4629 of Lecture Notes inComputer Science (Subseries LNAI), pp. 115–122, Springer.

[23] K. Zechner and A. Waibel, “Minimizing Word Error Rate inTextual Summaries of Spoken Language,” in Proceedings ofthe 1st conference of the North American chapter of the ACL.2000, pp. 186–193, Morgan Kaufmann.

[24] T. Hori, C. Hori, and Y. Minami, “Speech Summarization us-ing Weighted Finite-State Transducers,” in Proceedings of the8th EUROSPEECH - INTERSPEECH 2003. 2003, pp. 2817–2820, ISCA.

[25] T. Kikuchi, S. Furui, and C. Hori, “Two-stage AutomaticSpeech Summarization by Sentence Extraction and Com-paction,” in Proceedings of the ISCA & IEEE Workshopon Spontaneous Speech Processing and Recognition (SSPR-2003). 2003, pp. 207–210, ISCA.

[26] P. Chatain, E. W. D. Whittaker, J. A. Mrozinski, and S. Furui,“Topic and Stylistic Adaptation for Speech Summarisation,” inProc. of ICASSP 2006. 2006, vol. I, pp. 977–980, IEEE.

[27] X. Wan, J. Yang, and J. Xiao, “CollabSum: Exploiting Multi-ple Document Clustering for Collaborative Single DocumentSummarizations,” in SIGIR 2007: Proc. of the 30th AnnualInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval. 2007, pp. 143–150, ACM.