a low-rank approximation approach to learning joint ...yww/papers/naacl2016.pdf · (cv) community...

11
A Low-Rank Approximation Approach to Learning Joint Embeddings of News Stories and Images for Timeline Summarization William Yang Wang 1* , Yashar Mehdad 3 , Dragomir R. Radev 2 , Amanda Stent 4 1 School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA 2 Department of EECS, University of Michigan, Ann Arbor, MI 48109, USA 3 Yahoo, Sunnyvale, CA 94089, USA and 4 New York, NY 10036, USA [email protected], {ymehdad,stent}@yahoo-inc.com, [email protected] Abstract A key challenge for timeline summarization is to generate a concise, yet complete storyline from large collections of news stories. Pre- vious studies in extractive timeline generation are limited in two ways: first, most prior work focuses on fully-observable ranking models or clustering models with hand-designed features that may not generalize well. Second, most summarization corpora are text-only, which means that text is the sole source of infor- mation considered in timeline summarization, and thus, the rich visual content from news images is ignored. To solve these issues, we leverage the success of matrix factorization techniques from recommender systems, and cast the problem as a sentence recommenda- tion task, using a representation learning ap- proach. To augment text-only corpora, for each candidate sentence in a news article, we take advantage of top-ranked relevant images from the Web and model the image using a convolutional neural network architecture. Fi- nally, we propose a scalable low-rank approx- imation approach for learning joint embed- dings of news stories and images. In experi- ments, we compare our model to various com- petitive baselines, and demonstrate the state- of-the-art performance of the proposed text- based and multimodal approaches. 1 Introduction Timeline summarization is the task of organizing crucial milestones of a news story in a temporal or- der, e.g. (Kedzie et al., 2014; Lin et al., 2012). A * This work was performed when William Wang and Dragomir Radev were visiting Yahoo NYC. timeline example for the 2010 British Oil spill gen- erated by our system is shown in Figure 1. The task is challenging, because the input often includes a large number of news articles as the story is de- veloping each day, but only a small portion of the key information is needed for timeline generation. In addition to the conciseness requirement, timeline summarization also has to be complete—all key in- formation, in whatever form, must be presented in the final summary. To distill key insights from news reports, prior work in summarization often relies on feature en- gineering, and uses clustering techniques (Radev et al., 2004b) to select important events to be included in the final summary. While this approach is un- supervised, the process of feature engineering is al- ways expensive, and the number of clusters is not easy to estimate. To present a complete summary, researchers from the natural language processing (NLP) community often solely rely on the textual information, while studies in the computer vision (CV) community rely solely on the image and video information. However, even though news images are abundantly available together with news stories, ap- proaches that jointly learn textual and visual repre- sentations for summarization are not common. In this paper, we take a more radical approach to timeline summarization. We formulate the problem as a sentence recommendation task—instead of rec- ommending items to users as in a recommender sys- tem, we recommend important sentences to a time- line. Our approach does not require feature engi- neering: by using a matrix factorization framework, we are essentially performing representation learn-

Upload: others

Post on 02-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Low-Rank Approximation Approach to Learning Joint ...yww/papers/naacl2016.pdf · (CV) community rely solely on the image and video information. However, even though news images

A Low-Rank Approximation Approach to LearningJoint Embeddings of News Stories and Images for Timeline Summarization

William Yang Wang1∗, Yashar Mehdad3, Dragomir R. Radev2, Amanda Stent41School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

2Department of EECS, University of Michigan, Ann Arbor, MI 48109, USA3Yahoo, Sunnyvale, CA 94089, USA and 4New York, NY 10036, USA

[email protected], {ymehdad,stent}@yahoo-inc.com, [email protected]

Abstract

A key challenge for timeline summarization isto generate a concise, yet complete storylinefrom large collections of news stories. Pre-vious studies in extractive timeline generationare limited in two ways: first, most prior workfocuses on fully-observable ranking models orclustering models with hand-designed featuresthat may not generalize well. Second, mostsummarization corpora are text-only, whichmeans that text is the sole source of infor-mation considered in timeline summarization,and thus, the rich visual content from newsimages is ignored. To solve these issues, weleverage the success of matrix factorizationtechniques from recommender systems, andcast the problem as a sentence recommenda-tion task, using a representation learning ap-proach. To augment text-only corpora, foreach candidate sentence in a news article, wetake advantage of top-ranked relevant imagesfrom the Web and model the image using aconvolutional neural network architecture. Fi-nally, we propose a scalable low-rank approx-imation approach for learning joint embed-dings of news stories and images. In experi-ments, we compare our model to various com-petitive baselines, and demonstrate the state-of-the-art performance of the proposed text-based and multimodal approaches.

1 Introduction

Timeline summarization is the task of organizingcrucial milestones of a news story in a temporal or-der, e.g. (Kedzie et al., 2014; Lin et al., 2012). A

∗This work was performed when William Wang andDragomir Radev were visiting Yahoo NYC.

timeline example for the 2010 British Oil spill gen-erated by our system is shown in Figure 1. Thetask is challenging, because the input often includesa large number of news articles as the story is de-veloping each day, but only a small portion of thekey information is needed for timeline generation.In addition to the conciseness requirement, timelinesummarization also has to be complete—all key in-formation, in whatever form, must be presented inthe final summary.

To distill key insights from news reports, priorwork in summarization often relies on feature en-gineering, and uses clustering techniques (Radev etal., 2004b) to select important events to be includedin the final summary. While this approach is un-supervised, the process of feature engineering is al-ways expensive, and the number of clusters is noteasy to estimate. To present a complete summary,researchers from the natural language processing(NLP) community often solely rely on the textualinformation, while studies in the computer vision(CV) community rely solely on the image and videoinformation. However, even though news images areabundantly available together with news stories, ap-proaches that jointly learn textual and visual repre-sentations for summarization are not common.

In this paper, we take a more radical approach totimeline summarization. We formulate the problemas a sentence recommendation task—instead of rec-ommending items to users as in a recommender sys-tem, we recommend important sentences to a time-line. Our approach does not require feature engi-neering: by using a matrix factorization framework,we are essentially performing representation learn-

Page 2: A Low-Rank Approximation Approach to Learning Joint ...yww/papers/naacl2016.pdf · (CV) community rely solely on the image and video information. However, even though news images

Figure 1: A timeline example for the BP oil spill generated by our proposed method. Note that we useYahoo! Image Search to obtain the top-ranked image for each candidate sentence.

ing to model the continuous representation of sen-tences and words. Since most previous timelinesummarization work (and therefore, corpora) onlyfocuses on textual information, we also provide anovel web-based approach for harvesting news im-ages: we query Yahoo! image search with sen-tences from news articles, and extract visual cuesusing a 15-layer convolutional neural network ar-chitecture. By unifying text and images in the low-rank approximation framework, our approach learnsa joint embedding of news story texts and imagesin a principled manner. In empirical evaluations,we conduct experiments on two publicly availabledatasets, and demonstrate the efficiency and effec-tiveness of our approach. By comparing to various

baselines, we show that our approach is highly scal-able and achieves state-of-the-art performance. Ourmain contributions are three-fold:

• We propose a novel matrix factorization ap-proach for extractive summarization, leverag-ing the success of collaborative filtering;

• We are among the first to consider representa-tion learning of a joint embedding for text andimages in timeline summarization;

• Our model significantly outperforms variouscompetitive baselines on two publicly availabledatasets.

Page 3: A Low-Rank Approximation Approach to Learning Joint ...yww/papers/naacl2016.pdf · (CV) community rely solely on the image and video information. However, even though news images

2 Related Work

Supervised learning is widely used in summariza-tion. For example, the seminal study by Kupiec etal. (1995) used a Naive Bayes classifier for selectingsentences. Recently, Wang et al. (2015) proposeda regression method that uses a joint loss function,combining news articles and comments. Addition-ally, unsupervised techniques such as language mod-eling (Allan et al., 2001) have been used for tem-poral summarization. In recent years, ranking andgraph-based methods (Radev et al., 2004b; Erkanand Radev, 2004; Mihalcea and Tarau, 2004; Faderet al., 2007; Hassan et al., 2008; Mei et al., 2010;Yan et al., 2011b; Yan et al., 2011a; Zhao et al.,2013; Ng et al., 2014; Zhou et al., 2014; Glavasand Snajder, 2014; Tran et al., 2015; Dehghani andAsadpour, 2015) have also proved popular for ex-tractive timeline summarization, often in an unsu-pervised setting. Dynamic programming (Kiernanand Terzi, 2009) and greedy algorithms (Althoff etal., 2015) have also been considered for construct-ing summaries over time.

Our work aligns with recent studies on latentvariable models for multi-document summarizationand storyline clustering. Conroy et al. (2001) wereamong the first to consider latent variable models,even though it is difficult to incorporate featuresand high-dimensional latent states in a HMM-basedmodel. Ahmed et al. (2011) proposed a hierarchi-cal nonparametric model that integrates a RecurrentChinese Restaurant Process with Latent DirichletAllocation to cluster words over time. The mainissues with this approach are that it does not gen-erate human-readable sentences, and that scalingnonparametric Bayesian models is often challeng-ing. Similarly, Huang and Huang (2013) introduceda joint mixture-event-aspect model using a genera-tive method. Navarro-Colorado and Saquete (2015)combined temporal information with topic model-ing, and obtained the best performance in the cross-document event ordering task of SemEval 2015.

There has been prior work (Wang et al., 2008;Lee et al., 2009) using matrix factorization to per-form sentence clustering. A key distinction betweenour work and this previous work is that our methodrequires no additional sentence selection steps aftersentence clustering, so we avoid error cascades.

Zhu and Chen (2007) were among the first toconsider multimodal timeline summarization, butthey focus on visualization, and do not make useof images. Wang et al. (2012) investigated multi-modal timeline summarization by considering co-sine similarity among various feature vectors, andthen using a graph based algorithm to select salienttopics. In the computer vision community, Kimand Xing (2014) made use of community web pho-tos, and generate storyline graphs for image recom-mendation. Interestingly, Kim et al. (2014) com-bined images and videos for storyline reconstruc-tion. However, none of the above studies combinetextual and visual information for timeline summa-rization.

3 Our Approach

We now describe the technical details of our low-rank approximation approach. First, we motivateour approach. Next, we explain how we formulatethe timeline summarization task as a matrix factor-ization problem. Then, we introduce a scalable ap-proach for learning low-dimensional embeddings ofnews stories and images.

3.1 Motivation

We formulate timeline summarization as a low-rankmatrix completion task because of the followingconsiderations:

• Simplicity In the past decade, a significantamount of work on summarization has focusedon designing various lexical, syntactic and se-mantic features. In contrast to prior work,we make use of low-rank approximation tech-niques to learn representations directly fromdata. This way, our model does not requirestrong domain knowledge or lots of feature en-gineering, and it is easy for developers to de-ploy the system in real-world applications.

• Scalability A major reason that recommendersystems and collaborative filtering techniqueshave been very successful in industrial applica-tions is that matrix completion techniques arerelatively sophisticated, and are known to scaleup to large recommendation datasets with morethan 100 million ratings (Bennett and Lanning,

Page 4: A Low-Rank Approximation Approach to Learning Joint ...yww/papers/naacl2016.pdf · (CV) community rely solely on the image and video information. However, even though news images

Haywood  s)ll  receives    salary  as  the  CEO  of  BP.  

Some  es)mates  are  around    40,000  barrels  a  day.    

  ������  Dudley  replaced  Haywood    as  the  new  CEO  of  BP.  

ROUGE  

0.2  

0  

Words  

Dudley  

1  

Haywood  

1  

Barrels      ���      BP            Largest  

1  

1  

0.3   1  

Events  

receives(Haywood,  salary)  

1  

Time  

���    Replaced(Dudley,  Haywood)  

0.8  

2010/04    ���      2010/05  

1  

1  

0.7  

Implicit  Word  RelaNons   Implicit  Event  RelaNons   Impl.  Time  RelaNons  

Implicit  Text-­‐Vision  RelaNons  

The  size  of  the  oil  spill    was  one  of  the  largest.   0.8   1   1  

Input  images:  224  x  224  RGB   conv3-­‐64      conv3-­‐64      maxpool      conv3-­‐128      conv3-­‐128      maxpool      conv3-­‐256      conv3-­‐256      conv3-­‐256      maxpool      conv3-­‐512      conv3-­‐512      conv3-­‐512      maxpool      conv3-­‐512      conv3-­‐512      conv3-­‐512      maxpool      fc-­‐4096      fc-­‐4096    

Vision  

2.8   2.0  

1.9  

2.8  

0.2  

Implicit  Vision  RelaNons  

1                                 ��� 4096  

Figure 2: Our low-rank approximation framework for learning joint embedding of news stories and imagesfor timeline summarization.

2007). Therefore, we believe that our approachis practical for processing large datasets in thissummarization task.

• Joint Multimodal Modeling A key challengeof supervised learning approaches for summa-rization is to select informative sentences. Inthis work, we make use of multimodality to se-lect important sentences.

3.2 Problem Formulation

Since the Netflix competition (Bell and Koren,2007), collaborative filtering techniques with latentfactor models have had huge success in recom-mender systems. These latent factors, often in theform of low-rank embeddings, capture not only ex-plicit information but also implicit context from theinput data. In this work, we propose a novel matrixfactorization framework to “recommend” key sen-tences to a timeline. Figure 2 shows an overview ofthe framework.

More specifically, we formulate this task as a ma-trix completion problem. Given a news corpus, weassume that there are m total sentences, which arethe rows in the matrix. The first column is the met-ric section, where we use ROUGE (Lin, 2004) asthe metric to pre-compute a sentence importance

score between a candidate sentence and a human-generated summary. During training, we use thesescores to tune model parameters, and during testing,we predict the sentence importance scores given thefeatures in other columns. That is, we learn the em-bedding of important sentences.

The second set of columns is the text feature sec-tion. In our experiments, this includes word obser-vations, subject-verb-object (SVO) events, and thepublication date of the document from which thecandidate sentence is extracted. In our preprocess-ing step, we run the Stanford part-of-speech tag-ger (Toutanova et al., 2003) and MaltParser (Nivreet al., 2006) to generate SVO events based on de-pendency parses. Additional features can easily beincorporated into this framework; we leave the con-sideration of additional features for future work.

Finally, for each sentence, we use an image searchengine to retrieve a top-ranked relevant image, andthen we use a convolutional neural network (CNN)architecture to extract visual features in an unsu-pervised fashion. We use a CNN model from Si-monyan and Zisserman (2015), which is trained onthe ImageNet Challenge 2014 dataset (Russakovskyet al., 2014). In our work, we keep the 16 convo-lutional layers and max-pool operations. To extract

Page 5: A Low-Rank Approximation Approach to Learning Joint ...yww/papers/naacl2016.pdf · (CV) community rely solely on the image and video information. However, even though news images

neural network features, we remove the final fully-connected-1000 layer and the softmax function, re-sulting in 4096 features for each image.

The total number of columns in the input matrixis n. Our matrix M now encodes preferences for asentence, together with its lexical, event, and tempo-ral attributes, and visual features for an image highlyrelevant to the sentence. Here we use i to index thei-th sentence and j to index the j-th column. Wescale the columns by the standard deviation.

3.3 Low-Rank ApproximationFollowing prior work (Koren et al., 2009), we areinterested in learning two low-rank matrices P ∈Rk×m and Q ∈ Rk×n. The intuitition is that P isthe embedding of all candidate sentences, and Q isthe embedding of textual and visual features, as wellas the sentence importance score, event, and tempo-ral features. Here k is the number of latent dimen-sions, and we would like to approximate M(i,j) '~pi

T ~qj , where ~pi is the latent embedding vector forthe i-th sentence and ~qj is the latent embedding vec-tor for the j-th column. We seek to approximate thematrix M by these two low-rank matrices P and Q.We can then formulate the optimization problem forthis task:

minP,Q

∑(i,j)∈M

(M(i,j)− ~piT ~qj)2+λP ||~pi||2+λQ||~qj ||2

here, λP and λQ are regularization coefficients toprevent the model from overfitting. To solve this op-timization problem efficiently, a popular approachis stochastic gradient descent (SGD) (Koren et al.,2009). In contrast to traditional methods that requiretime-consuming gradient computation, SGD takesonly a small number of random samples to com-pute the gradient. SGD is also natural to online al-gorithms in real-time streaming applications, whereinstead of retraining the model with all the data, pa-rameters might be updated incrementally when newdata comes in. Once we have selected a randomsample M(i,j), we can simplify the objective func-tion:

(M(i,j) − ~piT ~qj)

2 + λP (~piT ~pi) + λQ(~qj

T ~qj)

Now, we can calculate the sub-gradients of the twolatent vectors ~pi and ~qj to derive the following vari-

able update rules:

~pi ← ~pi + δ(`(i,j)qj − λP ~pi) (1)

~qj ← ~qj + δ(`(i,j)pi − λQ~qj) (2)

Here, δ is the learning rate, whereas `(i,j) is the lossfunction that estimates how well the model approxi-mates the ground truth:

`(i, j) =M(i,j) − ~piT ~qj

The low-rank approximation here is accomplishedby reconstructing the M matrix with the two low-rank matrices P and Q, and we use the row and col-umn regularizers to prevent the model from overfit-ting to the training data.

SGD-based optimization for matrix factorizationcan also be easily parallelized. For example, HOG-WILD! (Recht et al., 2011) is a lock-free paralleliza-tion approach for SGD. In contrast to synchronousapproaches where idle threads have to wait for busythreads to sync up parameters, HOGWILD! is anasynchronous method: it assumes that because textfeatures are sparse, there is no need to perform syn-chronization of the threads. In reality, although thisapproach might not work for speech or image re-lated tasks, it performs well in various text basedtasks. In this work, we follow a recently proposedapproach called fast parallel stochastic gradient de-scent (FPSG) (Chin et al., 2015), which is partly in-spired by HOGWILD!.

3.4 Joint Modeling of Mixed EffectsMatrix factorization is a relatively complex methodfor modeling latent factors. So, an important ques-tion to ask is: in the context of timeline summa-rization, what is this matrix factorization frameworkmodeling?

From equation (1), we can see that the latent sen-tence vector ~pi will be updated whenever we en-counter a M(i,·) sample (e.g., all the word, event,time, and visual features for this particular sentence)in a full pass over the training data. An interestingaspect about matrix factorization is that, in additionto using the previous row embedding ~pi to update thevariables in equation (1), the column embedding ~qjwill also be used. Similarly, when updating the la-tent column embedding ~qj in equation (2), the passwill visit all samples that have non-zero items in that

Page 6: A Low-Rank Approximation Approach to Learning Joint ...yww/papers/naacl2016.pdf · (CV) community rely solely on the image and video information. However, even though news images

column, while making use of the ~pi vector. Essen-tially, in timeline summarization, this approach ismodeling the mixed effects of sentence importance,lexical features, events, temporal information, andvisual factors. For example, if we are predictingthe ROUGE score of a new sentence at testing, themodel will take the explicit sentence-level featuresinto account, together with the learned latent embed-ding of ROUGE, which is recursively influenced byother metrics and features during training.

Our approach shares similarities with some recentadvances in word embedding techniques. For ex-ample, word2vec uses the continuous bag-of-words(CBOW) and SkipGram algorithms (Mikolov et al.,2013) to learn continuous representations of wordsfrom large collections of text and relational data. Arecent study (Levy and Goldberg, 2014) shows thatthe technique behind word2vec is very similar to im-plicit matrix factorization. In our work, we considermultiple sources of information to learn the jointembedding in a unified matrix factorization frame-work. In addition to word information, we also con-sider event and temporal cues.

3.5 The Matrix Factorization Based TimelineSummarization

We outline our matrix factorization based timelinesummarization method in Algorithm 1. Since this isa supervised learning approach, we assume the cor-pus includes a collection of news documents S, aswell as human-written summaries H for each day ofthe story. We also assume the publication date ofeach news document is known (or computable).

During training, we traverse each sentence in thiscorpus, and compute a sentence importance score(Ii) by comparing the sentence to the human gen-erated summary for that day using ROUGE (Lin,2004). If a human summary is not given for thatday, Ii will be zero. We also extract subject-verb-object event representations, using the Stanford part-of-speech tagger (Toutanova et al., 2003) and Malt-Parser (Nivre et al., 2006). We use the publicationdate of the news document as the publication dateof the sentence. Visual features are extracted usinga very deep CNN (Simonyan and Zisserman, 2015).Finally, we merge these vectors into a joint vectorto represent a row in our matrix factorization frame-work. Then, we perform stochastic gradient descent

Algorithm 1 A Matrix Factorization Based Time-line Summarization Algorithm

1: Input: news documents S, human summaries H foreach day t.

2: procedure TRAINING(Str, H)3: for each training sentence Str

i in Str do4: Ii← ComputeImportanceScores(Str

i ,Ht)5: ~Ei← ExtractSVOEvents(Str

i )6: ~Di ← ExtractPublicationDate(Str

i )7: ~Vi← ExtractVisualFeatures(V tr

i )8: ~Mi ←MergeVectors(Ii, ~Ei, ~Di, ~Vi)9: end for

10: for each epoch e do11: for each cell i, j in M do12: ~pi

(e) ← ~pi(e) + δ(`(i,j)q

(e)j − λP ~pi

(e))

13: ~qj(e) ← ~qj

(e) + δ(`(i,j)p(e)i − λQ ~qj

(e))14: end for15: end for16: end procedure17: procedure TESTING(Ste)18: for each test sentence Ste

i in Ste do19: ~Ei ← ExtractSVOEvents(Ste

i )20: ~Di ← ExtractPublicationDate(Ste

i )21: ~Vi← ExtractVisualFeatures(Ste

i )22: ~Mi ←MergeVectors( ~Ei, ~Di, ~Vi)23: Ii ← PredictROUGE( ~Mi, P,Q)24: end for25: for each day t in Ste do26: Hte← SelectTopSentences(Ste

t ,~It)27: end for28: end procedure

training to learn the hidden low-rank embeddings ofsentences and features P and Q using the updaterules outlined earlier.

During testing, we still extract events and publi-cation dates, and the PredictROUGE function esti-mates the sentence importance score Ii, using thetrained latent low-rank matrices P and Q. To bemore specific, we extract the text, vision, event, andpublication date features for a candidate sentence i.Then, given these features, we update the embed-dings for this sentence, and make the prediction bytaking the dot product of this i-th column of P (i.e.,~pi) and the ROUGE column of Q (i.e., ~q1). This pre-dicted scalar value Ii indicates the likelihood of thesentence being included in the final timeline sum-mary. Finally, we go through the predicted results ofeach sentence in the timeline in temporal order, and

Page 7: A Low-Rank Approximation Approach to Learning Joint ...yww/papers/naacl2016.pdf · (CV) community rely solely on the image and video information. However, even though news images

include the top-ranked sentences with the highestsentence importance scores. It is natural to scale thismethod from daily summaries to weekly or monthlysummaries.

4 Experiments

In this section, we investigate the empirical perfor-mance of the proposed method, comparing to vari-ous baselines. We first discuss our experimental set-tings, including our primary dataset and baselines.Then, we discuss our evaluation results. We demon-strate the robustness of our approach by varying thelatent dimensions of the low-rank matrices. Next,we show additional experiments on a headline-basedtimeline summarization dataset. Finally, we providea qualitative analysis of the output of our system.

4.1 Comparative Evaluation on the 17Timelines Dataset

We use the 17 timelines dataset which has beenused in several prior studies (Tran et al., 2013b;Tran et al., 2013a). It includes 17 timelines from9 topics1 from major news agencies such as CNN,BBC, and NBC News. Only English documents areincluded. The dataset contains 4,650 news docu-ments. We use Yahoo! Image Search to retrievethe top-ranked image for each sentence.2 We followexactly the same topic-based cross-validation setupthat was used in prior work (Tran et al., 2013b): wetrain on eight topics, test on the remaining topic,and repeat the process eight times. The number oftraining iterations was set to 20; the k was set to200 for the text only model, and 300 for the jointtext/image model; and the vocabulary is 10K wordsfor all systems. The common summarization metricsROUGE-1, ROUGE-2, and ROUGE-S are used toevaluate the quality of the machine-generated time-lines. We consider the following baselines:

1The nine topics are the BP oil spill, Egyptian protests, Fi-nancial crisis, H1N1, Haiti earthquake, Iraq War, Libya War,Michael Jackson death, and Syrian crisis.

2We are not aware of any publicly available dataset for time-line summarization that includes both text and images. Most ofthese datasets are text-only, not including the original article fileor links to accompanying images. We adopted this Web-basedcorpus enhancement technique as a proximity for news images.Our low-rank approximation technique can be applied to theoriginal news images in the same way.

(a) ROUGE:0 (b) ROUGE:.009.

Figure 3: Examples of retrieved Web images. Theleft image was retrieved by using a non-informativesentence: “The latest five minute news bulletin fromBBC World Service”. The right image was retrievedusing a crucial sentence with a non-zero ROUGEscore vs. a human summary, “Case study : Gulf ofMexico oil spill and BP On 20 April 2010 a deepwa-ter oil well exploded in the Gulf of Mexico”.

• Random: summary sentences are randomly se-lected from the corpus.

• MEAD: a feature-rich, classic multi-documentsummarization system (Radev et al., 2004a)that uses centroid-based summarization tech-niques.

• Chieu et al. (Chieu and Lee, 2004): a multi-document summarization system that usesTFIDF scores to indicate the “popularity” of asentence compared to other sentences.

• ETS (Yan et al., 2011b): a state-of-the-art un-supervised timeline summarization system.

• Tran et al. (Tran et al., 2013b): anotherstate-of-the-art timeline summarization systembased on learning to rank techniques, and forwhich results on the 17 Timelines dataset havebeen previously reported.

• Regression: a part of a state-of-the-art extrac-tive summarization method (Wang et al., 2015)that formulates the sentence extraction task as asupervised regression problem. We use a state-of-the-art regression implementation in VowpalWabbit3.

We report results for our system and the baselineson the 17 timelines dataset in Table 1. We see thatthe random baseline clearly performs worse than theother methods. Even though Chieu et al. (2004)

3https://github.com/JohnLangford/vowpal wabbit

Page 8: A Low-Rank Approximation Approach to Learning Joint ...yww/papers/naacl2016.pdf · (CV) community rely solely on the image and video information. However, even though news images

Methods ROUGE-1 ROUGE-2 ROUGE-SRandom 0.128 0.021 0.026Chieu et al. 0.202 0.037 0.041MEAD 0.208 0.049 0.039ETS 0.207 0.047 0.042Tran et al. 0.230 0.053 0.050Regression 0.303 0.078 0.081Our approachText 0.312 0.089 0.112Text+Vision 0.331 0.091 0.115

Table 1: Comparing the timeline summarization per-formance to various baselines on the 17 Timelinesdataset. The best-performing results are highlightedin bold.

and MEAD (Radev et al., 2004a) are not specifi-cally designed for the timeline summarization task,they perform relatively well against the ETS sys-tem for timeline summarization (Yan et al., 2011b).Tran et al. (2013b) was previously the state-of-the-art method on the 17 timelines dataset. The ROUGEregression method is shown as a strong supervisedbaseline. Our matrix factorization approach outper-forms all of these methods, achieving the best re-sults in all three ROUGE metrics. We also see thatthere is an extra boost in the performance when con-sidering visual features for timeline summarization.Figure 3 shows an example of the retrieved imageswe used. In general, images retrieved by using moreimportant sentences (measured by ROUGE) includeobjects, as well a more vivid and detailed scene.

4.2 Comparative Evaluation Results forHeadline Based Timeline Summarization

To evaluate the robustness of our approach, we showthe performance of our method on the recently re-leased crisis dataset (Tran et al., 2015). The maindifference between the crisis dataset and the 17 time-lines dataset is that here we focus on a headlinebased timeline summarization task, rather than us-ing sentences from the news documents. The crisisdataset includes four topics: Egypt revolution, Libyawar, Syria war, and Yemen crisis. There are a totalof 15,534 news documents in the dataset, and eachtopic has around 4K documents. There are 25 man-ually created timelines for these topics, collectedfrom major news agencies such as BBC, CNN, andReuters. We perform standard cross-validation on

Methods ROUGE-1 ROUGE-2 ROUGE-SRegression 0.207 0.045 0.039Our approachText 0.211 0.046 0.040Text+Vision 0.232 0.052 0.044

Table 3: Comparing the timeline summarization per-formance to the state-of-the-art supervised sentenceregression approach on the crisis dataset. The best-performing results are highlighted in bold.

this dataset: we train on three topics, and test on theother. Here k is set to 300, and the vocabulary is10K words for all systems. Table 3 shows the per-formance of our system. Our system is significantlybetter than the strong supervised regression baseline.When considering joint learning of text and vision,we see that there is a further improvement.

4.3 Headline Based Timeline Summarization:A Qualitative Analysis

In this section, we perform a qualitative analysis ofthe output of our system for the headline based time-line summarization task. We train the system onthree topics, and show a sample of the output onthe Syria war. Table 2 shows a subset of the time-line for the Syria war generated by our system. Wesee that most of the daily summaries are relevant tothe topic, except the one generated on 2011-11-24.When evaluating the quality, we notice that most ofthem are of high quality: after the initial hypothesisof the Syria war on 2011-11-18, the following dailysummaries concern the world’s response to the cri-sis. We show that most of the relevant summariesare also providing specific information, with an ex-ception on 2011-12-02. We suspect that this is be-cause this headline contains three keywords “syria”,“civil”, “war”, and also the key date information:the model was trained partly on the Libya war time-line, and therefore many features and parameterswere activated in the matrix factorization frameworkto give a high recommendation in this testing sce-nario. In contrast, when evaluating the output of thejoint text and vision system, we see that this error iseliminated: the selected sentence on 2011-12-02 is“Eleven killed after weekly prayers in Syria on eveof Arab League deadline”.

Page 9: A Low-Rank Approximation Approach to Learning Joint ...yww/papers/naacl2016.pdf · (CV) community rely solely on the image and video information. However, even though news images

Date Summary Relevant? Good?2011-11-18 Syria is heading inexorably for a civil war and an appalling bloodbath X X2011-11-19 David Ignatius Sorting out the rebel forces in Syria X X2011-11-20 Syria committed crimes against humanity, U.N. panel finds X X2011-11-21 Iraq joins Syria civil war warnings X X2011-11-22 The Path to a Civil War in Syria X X2011-11-23 Report Iran, Hezbollah setting up militias to prepare for post-Assad Syria X X2011-11-24 Q&A Syria’s daring actress Features Al Jazeera English × ×2011-11-25 Syria conflict How residents of Aleppo struggle for survival X X2011-11-27 Syrian jets bomb rebel areas near Damascus as troops battle X X2011-11-28 Is the Regional Showdown in Syria Rekindling Iraqs Civil War? X X2011-11-29 Syria Crisis Army Drops Leaflets Over Damascus X X2011-11-30 Russia says West’s Syria push “path to civil war” X X2011-12-01 UN extends Syria war crimes investigation despite opposition from China X X2011-12-02 Un syria civil war 12 2 2011 X ×2011-12-03 Israel says fires into Syria after Golan attack on troops X X

Table 2: A timeline example for Syria war generated by our text-only system.

5 Conclusions

In this paper, we introduce a low-rank approxima-tion based approach for learning joint embeddings ofnews stories and images for timeline summarization.We leverage the success of matrix factorization tech-niques in recommender systems, and cast the multi-document extractive summarization task as a sen-tence recommendation problem. For each sentencein the corpus, we compute its similarity to a human-generated abstract, and extract lexical, event, andtemporal features. We use a convolutional neuralarchitecture to extract vision features. We demon-strate the effectiveness of this joint learning methodby comparison with several strong baselines on the17 timelines dataset and a headline based timelinesummarization dataset. We show that image fea-tures improve the performance of our model signif-icantly. This further motivates investment in jointmultimodal learning for NLP tasks.

Acknowledgments

The authors would like to thank Kapil Thadani andthe anonymous reviewers for their thoughtful com-ments.

ReferencesAmr Ahmed, Qirong Ho, Choon H Teo, Jacob Eisenstein,

Eric Xing, and Alex Smola. 2011. Online inferencefor the infinite topic-cluster model: Storylines fromstreaming text. In Proceedings of AISTATS.

James Allan, Rahul Gupta, and Vikas Khandelwal. 2001.Temporal summaries of new topics. In Proceedings ofSIGIR.

Tim Althoff, Xin Luna Dong, Kevin Murphy, Safa Alai,Van Dang, and Wei Zhang. 2015. TimeMachine:Timeline generation for knowledge-base entities. InProceedings of KDD.

Robert Bell and Yehuda Koren. 2007. Lessons from theNetflix prize challenge. ACM SIGKDD ExplorationsNewsletter, 9(2).

James Bennett and Stan Lanning. 2007. The Netflixprize. In Proceedings of the KDD Cup and Workshop.

Hai Leong Chieu and Yoong Keok Lee. 2004. Querybased event extraction along a timeline. In Proceed-ings of SIGIR.

Wei-Sheng Chin, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015. A fast parallel stochastic gradientmethod for matrix factorization in shared memory sys-tems. ACM Transactions on Intelligent Systems andTechnology, 6(1).

James Conroy, Judith Schlesinger, Diane O’Leary, andMary Okurowski. 2001. Using HMM and logistic re-gression to generate extract summaries for DUC. InProceedings of DUC.

Nazanin Dehghani and Masoud Asadpour. 2015. Graph-based method for summarized storyline generation inTwitter. arXiv preprint arXiv:1504.07361.

Gunes Erkan and Dragomir Radev. 2004. Lexrank:Graph-based lexical centrality as salience in text sum-marization. Journal of Artificial Intelligence Re-search, 22(1).

Anthony Fader, Dragomir Radev, Michael Crespin, BurtMonroe, Kevin Quinn, and Michael Colaresi. 2007.Mavenrank: Identifying influential members of the

Page 10: A Low-Rank Approximation Approach to Learning Joint ...yww/papers/naacl2016.pdf · (CV) community rely solely on the image and video information. However, even though news images

US senate using lexical centrality. In Proceedings ofEMNLP-CoNLL.

Goran Glavas and Jan Snajder. 2014. Event graphs forinformation retrieval and multi-document summariza-tion. Expert Systems with Applications, 41(15).

Ahmed Hassan, Anthony Fader, Michael Crespin, KevinQuinn, Burt Monroe, Michael Colaresi, and DragomirRadev. 2008. Tracking the dynamic evolution of par-ticipant salience in a discussion. In Proceedings ofCOLING.

Lifu Huang and Lian’en Huang. 2013. Optimizedevent storyline generation based on mixture-event-aspect model. In Proceedings of EMNLP.

Chris Kedzie, Kathleen McKeown, and Fernando Diaz.2014. Summarizing disasters over time. In Proceed-ings of the Bloomberg Workshop on Social Good atKDD.

Jerry Kiernan and Evimaria Terzi. 2009. Constructingcomprehensive summaries of large event sequences.ACM Transactions on Knowledge Discovery fromData, 3(4).

Gunhee Kim and Eric Xing. 2014. Reconstructing story-line graphs for image recommendation from web com-munity photos. In Proceedings of CVPR.

Gunhee Kim, Leonid Sigal, and Eric P Xing. 2014. Jointsummarization of large-scale collections of web im-ages and videos for storyline reconstruction. In Pro-ceedings of CVPR.

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009.Matrix factorization techniques for recommender sys-tems. Computer, 8.

Julian Kupiec, Jan Pedersen, and Francine Chen. 1995.A trainable document summarizer. In Proceedings ofSIGIR.

Ju-Hong Lee, Sun Park, Chan-Min Ahn, and Daeho Kim.2009. Automatic generic document summarizationbased on non-negative matrix factorization. Informa-tion Processing & Management, 45(1).

Omer Levy and Yoav Goldberg. 2014. Neural word em-bedding as implicit matrix factorization. In Proceed-ings of NIPS.

Chen Lin, Chun Lin, Jingxuan Li, Dingding Wang, YangChen, and Tao Li. 2012. Generating event storylinesfrom microblogs. In Proceedings of CIKM.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Proceedings of the ACLworkshop “Text summarization branches out”.

Qiaozhu Mei, Jian Guo, and Dragomir Radev. 2010. Di-vrank: the interplay of prestige and diversity in infor-mation networks. In Proceedings of KDD.

Rada Mihalcea and Paul Tarau. 2004. TextRank: Bring-ing order into texts. In Proceedings of EMNLP.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word representa-tions in vector space. arXiv preprint arXiv:1301.3781.

Borja Navarro-Colorado and Estela Saquete. 2015.GPLSIUA: Combining temporal information and topicmodeling for cross-document event ordering. In Pro-ceedings of SemEval.

Jun-Ping Ng, Yan Chen, Min-Yen Kan, and ZhoujunLi. 2014. Exploiting timelines to enhance multi-document summarization. In Proceedings of the ACL.

Joakim Nivre, Johan Hall, and Jens Nilsson. 2006. Malt-Parser: A data-driven parser-generator for dependencyparsing. In Proceedings of LREC.

Dragomir Radev, Timothy Allison, Sasha Blair-Goldensohn, John Blitzer, Arda Celebi, StankoDimitrov, Elliott Drabek, Ali Hakim, Wai Lam, DanyuLiu, Jahna Otterbacher, Hong Qi, Horacio Saggion,Simone Teufel, Michael Topper, Adam Winkel,and Zhu Zhang. 2004a. MEAD – a platform formultidocument multilingual text summarization. InProceedings of LREC.

Dragomir R Radev, Hongyan Jing, Małgorzata Stys, andDaniel Tam. 2004b. Centroid-based summarization ofmultiple documents. Information Processing & Man-agement.

Benjamin Recht, Christopher Re, Stephen Wright, andFeng Niu. 2011. Hogwild: A lock-free approach toparallelizing stochastic gradient descent. In Proceed-ings of NIPS.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, Alexan-der C. Berg, and Li Fei-Fei. 2014. Imagenet largescale visual recognition challenge. International Jour-nal of Computer Vision, 115(3).

Karen Simonyan and Andrew Zisserman. 2015. Verydeep convolutional networks for large-scale imagerecognition. In Proceedings of ICLR.

Kristina Toutanova, Dan Klein, Christopher D Manning,and Yoram Singer. 2003. Feature-rich part-of-speechtagging with a cyclic dependency network. In Pro-ceedings of NAACL-HLT.

Giang Binh Tran, Mohammad Alrifai, and DatQuoc Nguyen. 2013a. Predicting relevant news eventsfor timeline summaries. In Proceedings of WWW.

Giang Binh Tran, Tuan A Tran, Nam-Khanh Tran, Mo-hammad Alrifai, and Nattiya Kanhabua. 2013b.Leveraging learning to rank in an optimization frame-work for timeline summarization. In Proceedings ofthe SIGIR Workshop on Time-Aware Information Ac-cess.

Giang Tran, Mohammad Alrifai, and Eelco Herder. 2015.Timeline summarization from relevant headlines. InProceedings of ECIR.

Page 11: A Low-Rank Approximation Approach to Learning Joint ...yww/papers/naacl2016.pdf · (CV) community rely solely on the image and video information. However, even though news images

Dingding Wang, Tao Li, Shenghuo Zhu, and Chris Ding.2008. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factor-ization. In Proceedings of SIGIR.

Dingding Wang, Tao Li, and Mitsunori Ogihara. 2012.Generating pictorial storylines via minimum-weightconnected dominating set approximation in multi-viewgraphs. In Proceedings of AAAI.

Lu Wang, Claire Cardie, and Galen Marchetti. 2015.Socially-informed timeline generation for complexevents. In Proceedings of NAACL-HLT.

Rui Yan, Liang Kong, Congrui Huang, Xiaojun Wan, Xi-aoming Li, and Yan Zhang. 2011a. Timeline gener-ation through evolutionary trans-temporal summariza-tion. In Proceedings of EMNLP.

Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong,Xiaoming Li, and Yan Zhang. 2011b. Evolution-ary timeline summarization: a balanced optimizationframework via iterative substitution. In Proceedingsof SIGIR.

Xin Wayne Zhao, Yanwei Guo, Rui Yan, Yulan He, andXiaoming Li. 2013. Timeline generation with socialattention. In Proceedings of SIGIR.

Wubai Zhou, Chao Shen, Tao Li, Shu-Ching Chen, andNing Xie. 2014. Generating textual storyline to im-prove situation awareness in disaster management. InProceedings of the IEEE International Conference onInformation Reuse and Integration.

Weizhong Zhu and Chaomei Chen. 2007. Storylines: Vi-sual exploration and analysis in latent semantic spaces.Computers & Graphics, 31(3).