augmenting neural encoder-decoder model for natural

15
Augmenting Neural Encoder-Decoder Model for Natural Language Generation Tasks Yichen Jiang Department of Computer Science UNC Chapel Hill [email protected] Abstract Neural Encoder-Decoder model has been widely adopted for grounded language generation tasks. Such tasks usually require translation of data from one domain to a different domain, including machine translation (language to language), image captioning (image to language), and text summarization (long article to short summary). In this thesis, we aim to improve Neural Encoder-Decoder model for two different generation tasks: text summarization and image captioning. For summarization, we aim to improve the encoder of a popular pointer-generator model by adding a ‘closed-book’ decoder without attention and pointer mechanism. We argue that such a decoder forces the encoder to be more selective on the information encoded in its memory state since it can’t rely on the extra information provided by the attention and copy modules, and hence improves the entire model. On the CNN/Daily Mail dataset, our model outperforms the baseline significantly in terms of ROUGE and METEOR metrics and receives higher saliency scores, for both cross-entropy and reinforced setups. For image captioning, we propose a framework based on Conditional Generative Adversarial Network (CGAN), which jointly trains a a generator that produces captions conditioned on the image, and a discriminator that evaluates the probability of the caption being generated or not. We present a series of experiments to show that our CGAN-based model consistently and significantly outperforms strong baseline trained with maximum likelihood, which is not achieved by previous work that adopted a similar approach. 1 Introduction Natural Language Generation (NLG) is a Natural Language Processing (NLP) task of automatically generating natural language. Most NLG tasks are grounded, which means the model need to generate the language from a given instance of data like image, text(language), knowledge base, etc. Therefore, grounded NLG can be generalized to translating data from the source domain to the target domain of natural language. Such tasks include machine translation (one language to another language), text summarization (long stories to short summaries), QA (question to answer), image captioning (image to description), etc. Models that perform well in grounded Natural Language Generation tasks need to understand the data in source domain (data that needs to be translated to natural language), as well as the language in target domain (desired output in natural language). After the arriving of Deep Learning era, there have been a large number of studies that tried to apply neural networks to NLG tasks. Among them are a specific class of models called Encoder-Decoder model that serves as the basic architecture of state-of-the-art systems in a lot of language generation tasks. An Encoder-Decoder model has two components: an encoder that encodes the data in source domain to a fixed-size vector, and a decoder that takes this vector as input and generate natural language in the target domain. Fig. 1 provides an visualization of a typical Encoder-Decoder model.

Upload: others

Post on 08-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Augmenting Neural Encoder-Decoder Model forNatural Language Generation Tasks

Yichen JiangDepartment of Computer Science

UNC Chapel [email protected]

Abstract

Neural Encoder-Decoder model has been widely adopted for grounded languagegeneration tasks. Such tasks usually require translation of data from one domain toa different domain, including machine translation (language to language), imagecaptioning (image to language), and text summarization (long article to shortsummary). In this thesis, we aim to improve Neural Encoder-Decoder model fortwo different generation tasks: text summarization and image captioning. Forsummarization, we aim to improve the encoder of a popular pointer-generatormodel by adding a ‘closed-book’ decoder without attention and pointer mechanism.We argue that such a decoder forces the encoder to be more selective on theinformation encoded in its memory state since it can’t rely on the extra informationprovided by the attention and copy modules, and hence improves the entire model.On the CNN/Daily Mail dataset, our model outperforms the baseline significantlyin terms of ROUGE and METEOR metrics and receives higher saliency scores,for both cross-entropy and reinforced setups. For image captioning, we propose aframework based on Conditional Generative Adversarial Network (CGAN), whichjointly trains a a generator that produces captions conditioned on the image, anda discriminator that evaluates the probability of the caption being generated ornot. We present a series of experiments to show that our CGAN-based modelconsistently and significantly outperforms strong baseline trained with maximumlikelihood, which is not achieved by previous work that adopted a similar approach.

1 Introduction

Natural Language Generation (NLG) is a Natural Language Processing (NLP) task of automaticallygenerating natural language. Most NLG tasks are grounded, which means the model need to generatethe language from a given instance of data like image, text(language), knowledge base, etc. Therefore,grounded NLG can be generalized to translating data from the source domain to the target domain ofnatural language. Such tasks include machine translation (one language to another language), textsummarization (long stories to short summaries), QA (question to answer), image captioning (imageto description), etc. Models that perform well in grounded Natural Language Generation tasks needto understand the data in source domain (data that needs to be translated to natural language), as wellas the language in target domain (desired output in natural language).

After the arriving of Deep Learning era, there have been a large number of studies that tried to applyneural networks to NLG tasks. Among them are a specific class of models called Encoder-Decodermodel that serves as the basic architecture of state-of-the-art systems in a lot of language generationtasks. An Encoder-Decoder model has two components: an encoder that encodes the data in sourcedomain to a fixed-size vector, and a decoder that takes this vector as input and generate naturallanguage in the target domain. Fig. 1 provides an visualization of a typical Encoder-Decoder model.

In this thesis, we focus on two NLG tasks, automatic text summarization and image captioning, byaugmenting the baseline Encoder-Decoder architecture with more intuitive components. In both tasks,our proposed models achieved statistical significant improvements on their baselines respectively, ina number of automatic metrics. In summarization particularly, our model obtained state-of-the-artresults among neural models on a challenging long-text summarization dataset, in terms of ROUGEscores.

Automatic text summarization Summarization is the task of condensing a long passage to a shorterversion that only covers the most salient information from the original text. Extractive summarizationmodels Jing and McKeown (2000); Knight and Marcu (2002); Clarke and Lapata (2008); Filippovaet al. (2015) pick words and phrases from the source text to form a summary, while an abstractivemodel samples words from a fixed-size vocabulary instead of copying from text directly. Recentsuccesses in neural Encoder-Decoder attention models Sutskever et al. (2014); Bahdanau et al. (2014)have fueled a lot of studies Rush et al. (2015); Nallapati et al. (2016); Chopra et al. (2016); Zeng et al.(2016); Gu et al. (2016); Gulcehre et al. (2016) in abstractive summarization. While these works ledto powerful summarizers with strong pointer decoders Vinyals et al. (2015a), here we focus on theencoder of an Encoder-Decoder model and aim to improve its ability to memorize salient informationfrom the source passage. Specifically, we build upon the pointer-generator See et al. (2017), but addanother ‘closed book’ decoder that has neither pointer nor attention. We provide both intuition andevidence of its contribution to a better summarization model in the following paragraphs.

The intuition is as followed: imagine there are two students learning to do abstractive summarizationfrom scratch. During training, Student A is allowed to constantly look back at the passage whenwriting the summary (like a pointer-generator), while Student B has to occasionally write the summarywithout looking back (like our 2-decoder model with a non-attention/copy decoder). During the finaltest, both students can look at the passage while writing the summary. Student B will write bettersummaries after training because s/he is sometimes not allowed to look at the paragraph when writingsummaries, and is hence forced to learn to distill important information from the input document.

In terms of back-propagation intuition, during the training of an attentional Encoder-Decoder model(e.g., See et al. (2017)), most gradients are back-propagated to the encoder’s hidden state throughthe attention layer. This encourages the encoder to correctly encode salient words at the currentencoding steps, but does make sure that this information is not forgotten by the encoder afterwards.However, for a plain LSTM (closed-book) decoder without attention, its only connection with theencoder is through the memory state. Hence, its loss amplifies the gradient flow in the memory state,and encourages the encoder to memorize only important information. Thus, we jointly train the twodecoders, which share one encoder, by optimizing the sum of their losses. This approximates thetraining routine of Student B as the sole encoder has to perform well for both decoders. Duringinference, we only employ the pointer decoder due to its copying advantage over the closed-bookdecoder, similar to the situation of Student B being able to refer back to the passage during the test forbest performance (but is still trained hard to do well in both situations). Figure (to be made) shows anexample of our model generating a summary that covers the original passage with more saliency thanthe baseline model.

Empirically, we tested our 2-decoder model on the CNN/Daily Mail dataset Hermann et al. (2015);Nallapati et al. (2016), and it surpassed the strong pointer-generator baselines significantly on bothROUGE Lin (2004) and METEOR Denkowski and Lavie (2014) metrics. This holds true both for across-entropy baseline as well as a policy-gradient reinforcement learning setup Williams (1992).Our 2-decoder model also achieves improvements on a test-only DUC-2002 setup. We further showthat this model achieved higher saliency (computed using blanks in the original CNN/Daily Mailcloze-Q&A dataset) than the pointer-generator baselines, and hence enabled the encoder to memorizemore important information from the input document.

Image captioning Image captioning is the task of accurately describing the visual contents of imagesusing natural language. It has many real-world applications, including assistance to the visually-impaired, text-based image retrieval, and human-machine interaction. The last few years have seensignificant progress on this task Mao et al. (2014); Vinyals et al. (2015b); Xu et al. (2015); You et al.(2016); Yao et al. (2016); Vinyals et al. (2017), as demonstrated by the MSCOCO Lin et al. (2014)leaderboard updates. This progress has been achieved via a series of innovations in CNN, RNN,Encoder-Decoder architecture with attention, and reinforcement learning models, by using one ormore human-annotated image-caption pair datasets like MSCOCO and Flikr30k. Models from most

2

Figure 1: Vanilla Encoder-Decoder model

works mentioned above are trained by maximum likelihood (teacher forcing). Such models suffersfrom exposure bias

In our work, we designed a framework based on a Conditional Generative Adversarial Network(CGAN) Mirza and Osindero (2014), which was originally proposed for conditioned image generation.We adapt this to conditioned language generation, where our Encoder-Decoder generator G modelsthe conditional distribution of captions given an image, and our discriminator D tries to discriminatecaptions generated by G from human-generated captions given the same image. The generator isstabilized with maximum likelihood pretraining and then in the joint training phase, the generator isoptimized to produce captions that D considers to be ‘true’.

Unlike image generation, where the generative function is continuous and deterministic, text gen-eration is a sequential (one word at a time) sampling procedure. This sampling process is non-differentiable, and hence we use policy gradient methods to jointly optimize our generator G withrewards from discriminator D (similar to Yu et al. (2017)). Finally, we also improve the speed andstability of a previous CGAN-based image captioning model Dai et al. (2017) (that focused on thetask of increasing caption diversity): by discarding their Monte-Carlo rollouts and by using a mixedRL+ML loss function for the generator. Empirically, we present a series of experiments to showthat our novel image captioning model significantly outperforms simple Encoder-Decoder baselinestrained with the same number of image-caption pairs in terms of several metrics.

In the next section, we give a brief introduction to Encoder-Decoder model and its attention module.Then we present our novel models for summarization and image captioning in section 3 and 4, whichare followed with related works, our experimental setup and results.

2 Encoder-Decoder Model

Encoder-Decoder model Sutskever et al. (2014) can be used for the task of translating data fromsource domain to a target domain. The translation starts with the encoder encoding the source datato a fixed size vector. This vector is then passed to the decoder as input, and the decoder generatesthe corresponding data in target domain. The architecture of the encoder and decoder can be neuralnetworks of any form. For image captioning (image to text), the encoder is a convolutional neuralnetwork (CNN) that encodes the image to a feature map, and the decoder is a recurrent neural network(RNN) that generates the caption from this feature map. For text summarization or machine translation(text to text), both encoder and decoder are RNN. Fig. 1 shows an example of Encoder-Decodermodel for text summarization.

2.1 Attention

The bottleneck of a vanilla Encoder-Decoder model as the one in Fig. 1 is the fact that decoder has togenerate target domain data from a fixed-size vector only. The information that can be encoded inthis vector is limited, and RNN encoder is known for its tendency to forget information from early

3

Figure 2: Encoder-Decoder model with attention

encoding steps (first sentence in a long article). Therefore, Bahdanau et al. (2014) augmented theEncoder-Decoder model with an shortcut connection between the decoder and encoder’s entire historyof hidden state. This strategy further developed into the attention mechanism: the shortcut connectionis analogous to attending source data at different encoding steps. This is enabled by an attentiondistribution that yields a weighted sum of encoder’s hidden states at all time steps, and incorporatethis vector into decoder’s decision process. Fig. 2 provides a visualization of an Encoder-Decodermodel with attention. In the next two sections, we describe the details of using Encoder-Decodermodel with attention mechanism for two NLG tasks, summarization and image captioning.

3 Summarization Model

3.1 Pointer-Generator

The pointer-generator network proposed in See et al. (2017) is an Encoder-Decoder with attentionmechanism. At each decoding step, the model can either sample a word from its vocabulary, or copya word directly from the source passage. This is enabled by the attention mechanism Bahdanau et al.(2014), which includes a distribution ai over all encoding steps, and a context vector h∗t that is theweighted sum of encoder’s hidden states. The probability of generating w at step t is modeled asbelow:

eti = vT tanh(Whhi +Wsst + battn)

ati = softmax(eti)

h∗t =∑i

atihi

(1)

where v, Wh, Ws, and battn are learnable parameters. hi is encoder’s hidden state at ith encodingstep, and st is decoder’s hidden state at tth decoding step. The distribution ati can be seen as theamount of attention at decode step t towards the ith encoder state. Therefore, the context vector h∗t isthe sum of the encoder’s hidden states weighted by attention distribution at. The context vector h∗t isconcatenated with the decoder state st to produce the logits for the vocabulary distribution Pvocab atdecode step t:

P tvocab = softmax(V2(V1[st, h∗t ] + b1) + b2) (2)

where V1, V2, b1, b2 are learnable parameters. To enable copying out-of-vocabulary words fromsource text, a pointer similar to Vinyals et al. (2015a) is built upon the attention distribution andcontrolled by the generation probability pgen.

ptgen = σ(Uh∗h∗t + Usst + Uxxt + bptr)

P tpg(w) = ptgenPtvocab(w) + (1− ptgen)

∑i:wi=w

ati(3)

where Uh∗ , Us, Ux, and bptr are learnable parameters. xt and st are the input token and decoder’sstate at tth decoding step. σ is the sigmoid function. We can see pgen as a soft gate that controls the

4

Figure 3: Our 2-decoder model with a pointer-generator decoder and a closed-book decoder.

model’s behavior of copying from text versus sampling from vocabulary. Here the word w comesfrom the extended vocabulary, which is the union of the preset vocabulary and all words appearing inthe source passage. We train the model by minimizing the negative log-likelihood.

Lpg =1

T

T∑t=1

− logP tpg(w|x1:t) (4)

where T is the maximum decoding step.

See et al. (2017) also applies coverage mechanism to punish repetition. They maintained a coveragevector ct as the sum of attention distribution over all previous decoding steps 1 : t− 1. This vector isincorporated in calculating the attention distribution at current step t:

ct =

t−1∑t′=0

at′

eti = vT tanh(Whhi +Wsst +Wccti + battn)

(5)

where Ws is a learnable parameter. They define the coverage loss covloss and combine it with theprimary loss Eqn. 4 to form a new loss function:

losstloss =∑i

min(ati, cti)

Ltotal =1

T

T∑t=1

(− logP tpg(w|x1:t) + λlosstcov)

(6)

3.2 Closed-Book Decoder

As shown in Bahdanau et al. (2014), the attention distribution ai partly depends on decoder’s hiddenstate st, which is derived from decoder’s memory state ct. Therefore, if ct does not encode salientinformation from the source passage, or encodes too much unimportant information, the decoder willhave a hard time to locate the encoder’s states with attention. To enhance encoder’s memory so thatonly important information will be encoded into the memory state ct, we add a closed-book decoderthat shares the same encoder with the pointer decoder (see Fig. 3). During training, we optimize thesum of negative log likelihoods from the two decoders:

L =1

T

T∑t=1

−(logP tpg(w|x1:t) + logP tcb(w|x1:t)) (7)

5

Figure 4: Conditional Generative Adversarial Network for image captioning. Generator is on the leftand Discriminator on the right.

where Ppg takes the form in Eqn. 3 and Pcb is the cross-entropy loss from the closed-book decoder.

3.3 Reinforcement Learning

Following Paulus et al. (2017), we also use a self-critical policy gradient training algorithm Rennieet al. (2016); Williams (1992) for both our baseline and 2-decoder model. For each iteration, wesample a summary ys = ws1:T+1, and greedily generate a baseline summary y, both according toPpg. Then these two summaries are fed to a reward function r that evaluates their closeness to theground-truth. We choose ROUGE-L scores as the reward function r as in previous work Paulus et al.(2017). The RL loss function is as follows:

Lrl =1

T

T∑t=1

(r(y)− r(ys)) logP tpg(wst+1|ws1:t) (8)

We train our reinforced model using the mixture of Eqn. 8 and Eqn. 7. Paulus et al. (2017) showedthat a pure RL objective would lead to summaries that receive high rewards but are not fluent. Themixed loss function is as followed.

Lmixed = γ ∗ Lrl + (1− γ) ∗ Lpg (9)

4 Image Captioning Model

4.1 Baseline

The baseline we used is a standard Encoder-Decoder model as in Vinyals et al. (2015b). At initial step,encoded image vector f(I) is fed into decoder LSTM and outputs an initial state. At each followingstep t = 1, ..., T − 1, the word wt from human-generated caption is fed into decoder, which outputsa conditional distribution P(w∗t+1|w1:t, f(I)) for w∗t+1 ∈ vocabulary. Maximum likelihood optimizesthe parameters by minimizing the negative conditional log-likelihood LML(I, w1:T ) given N training(I, w1:T ) pairs:

−N∑n=1

T−1∑t=1

logP(wn,t+1|f(In), wn,1, ..., wn,t) (10)

We also incorporated the forehead mentioned attention mechanism Bahdanau et al. (2014); Xu et al.(2015) to get a stronger baseline.

4.2 A CGAN-based Framework

In a typical GAN setting, the generator G and discriminator D are trained jointly, as D tries todistinguish between real and generated data and G is updated in order to generate samples that Dwould classify as being real. In Conditional GAN Mirza and Osindero (2014), the generator takesrandom noise z and a condition, which is image I in our case, as input. The complete loss functionL(Gθ, Dφ) to our model is:

Ec∈C(I)[logDφ(I, c)] + E[log(1−Dφ(I, Gθ(z|I)))] (11)

where C(I) denotes the set of human generated captions about image I. The second expectation isw.r.t. the sequential sampling procedure. Our overall C-GAN framework is represented in Fig. 4.

6

Generator After being trained with maximum likelihood and evaluated for later comparison, thebaseline is used as the pre-trained generator of our GAN in the next phase of training. Dai et al.(2017) incorporated random noise z by concatenating it with the image embedding I. We simplydiscard z in our generator because we think the sequential sampling procedure has already introducedrandomness into the generation process. Empirical results also show no difference in performancewith z present or not.

Discriminator The goal of our discriminator D is, given an image-caption pair (I, c), to approxi-mate the probability of c being human-generated conditioned on I. D has a CNN and a LSTM toencode both visual and textual information. The output of CNN encoder and the final state of LSTMencoder is concatenated and fed into a 2-layer MLP to get the final score. Following Reed et al.(2016) and Dai et al. (2017), we update D using this loss function:

LD(φ) = α× Ec∈C(I)[− logDφ(I, c)]

+ β × E[− log(1−Dφ(I, Gθ(I)))]

+ γ × Ec∈Cc(I)[− log(1−Dφ(I, c))]

(12)

where Cc(I) represents human-generated captions for images other than I.

Training G with Policy Gradient In order to overcome the difficulty of non-differentiabilitycaused by sampling, we followed previous work Dai et al. (2017); Yu et al. (2017) to adopt PolicyGradient Sutton et al. (2000). In this setting, the generation of discrete words in a sentence is asequence of actions. The decision to take what action is based on a Policy Network πθ, which outputsa distribution of all possible actions at that step. In our case, πθ is simply the generatorGθ without thelast sampling step. After the entire sequence cT = w1:T is sampled, it is fed into discriminator Dφ

and outputs an approximate of the reward R(w1:T |I). This reward is used to calculate state-actionvalue Q(w1:t−1, wt|I): where Q(w1:t−1, wt|I) is the state-action value for a particular action wt atstate w1:t−1 given image I, and should be calculated as follow:

Q(w1:t−1, wt|I) = Ewt+1:T[R(w1:t;wt+1:T |I)] (13)

Previous work Dai et al. (2017); Liu et al. (2016) used Monte Carlo Rollout to approximate thisexpectation. Here we simply use the terminal reward R(w1:T |I) as an one-sample estimate of Eqn. 13with large variance. Experiment presented in Table 7 shows that without rollouts, our model isable to perform well enough in terms of performance on evaluation metrics. To compensate for thevariance, for each image I we sample M sentences instead of just one. The detailed derivation is insupplementary materials and we just present the complete loss function of G as followed:

∇θJ(θ|In) =1

N

N∑n=1

1

M

M∑m=1

T∑t=2

[∇θ log πθ(wmt |wm1:t−1, In)×Q(wm1:t−1, wmt |In)] (14)

Mixed Loss To alleviate the instability of approximating rewards with outputs from D, we reusethe loss function from maximum likelihood (ML) and update G w.r.t this mixed loss Wu et al. (2016);Paulus et al. (2017); Pasunuru and Bansal (2017):

LG(θ) = γJ(θ) + (1− γ)LML(θ) (15)where LML takes the form of Eqn. 10. This mixed loss encourages G to explore the action space andselect those actions that lead to higher scores from D through the reinforcement loss J(θ), but alsoensures readability and fluency due to the cross-entropy loss LML. Finally we have the loss functionfor our generator G in the form of Eqn. 15 and discriminator D in the form of Eqn. 12.

5 Related Work

5.1 Summarization

Extractive and Abstractive Summarization Early models for automatic text summarization wereusually extractive Jing and McKeown (2000); Knight and Marcu (2002); Clarke and Lapata (2008);

7

Filippova et al. (2015). For abstractive summarization, different early non-neural approaches werealso applied, based on graphs Giannakopoulos (2009); Ganesan et al. (2010), discourse trees Geraniet al. (2014), syntactic parse trees Cheung and Penn (2014); Wang et al. (2016), and a combinationof linguistic compression and topic detection Zajic et al. (2004). Recent neural-network modelshave tackled abstractive summarization using more advanced methods (e.g., hierarchical, coverage,distraction, etc.) Rush et al. (2015); Chopra et al. (2016); Nallapati et al. (2016); Chen et al. (2016);Takase et al. (2016). In order to usher the abstractive approach into the domain of long-text sum-marization, Nallapati et al. (2016) adapted the CNN/Daily Mail Hermann et al. (2015) dataset forlong-text summarization, and provided an abstractive baseline Nallapati et al. (2016). The sameauthor then published an extractive model in another work Nallapati et al. (2017) that significantlyoutperformed the abstractive baseline in terms of ROUGE scores.

Pointer Network Pointer network Vinyals et al. (2015a) is a kind of Encoder-Decoder network thatmodels the distribution of elements from input sequence using attention Bahdanau et al. (2014) andis therefore able to copy those input elements to output directly. It is useful for automatic text summa-rization because summaries often need to copy/contain a large number of words that have appearedin the source text. This provides the advantages of both extractive and abstractive approaches, andusually includes a gating function to model the distribution for the extended vocabulary including thepreset vocabulary and words from the source text Zeng et al. (2016); Nallapati et al. (2016); Gu et al.(2016); Gulcehre et al. (2016); Miao and Blunsom (2016); See et al. (2017).

Reinforcement Learning Teacher forcing style maximum likelihood (ML) training suffers fromexposure bias Bengio et al. (2015), so recent work instead applies reinforcement learning style policygradient algorithms (REINFORCE Williams (1992)) to directly optimize on metric scores Henß et al.(2015); Paulus et al. (2017).

5.2 Image Captioning

Image Captioning Generating natural language description for images has been studied comprehen-sively as a task that lies at the intersection of Computer Vision and Natural Language Processing.Recently, methods based on Deep Neural Network in both fields pushed the boundary of state-of-artimage captioning result. Particularly, Convolutional Neural Network (CNN) achieved great improve-ment on image classification (Krizhevsky et al. (2012)) and inspired a huge branch of research onits properties. Recent works (Szegedy et al.; He et al. (2016); Szegedy et al. (2016)) continuouslyimprove on the structure of CNN in both depth and width, and achieved super-human results on Ima-geNet(Deng et al. (2009)) image classification task. These works fuel the study on image captioningby providing stronger image model that can accurately capture visual contents and features fromimages. Encoder-Decoder (Seq2Seq) paradigm Sutskever et al. (2014), with attentionBahdanau et al.(2014), as well as policy gradient approaches Liu et al. (2016) have achieved significant improvementsin supervised setups.

GANs and CGANs Generative Adversarial Network (GAN) was introduced in Goodfellow et al.(2014) to generate real-looking images. It has two components: a generative part G that, given arandom noise z, generate an image that approximates the real-data distribution, and a discriminativepart D that models the probability of an image being real as opposed to being generated by G.Throughout this adversarial training, G gradually learns to generate real-looking samples that followthe data distribution.

As one of the many variants, a Condition GAN (CGAN) Mirza and Osindero (2014) enabled directeddata generation by providing an extra condition to generator, and has been used for image captioningDai et al. (2017) and image-to-image translation Isola et al. (2016). Recent works extended the usageof GAN and CGAN to generating natural language words. Yu et al.Yu et al. (2017) first applied GANto generate sequential data by using Policy Gradient with Monte Carlo Rollouts to back-propagatethe gradients to their generator. However, their experiments were mostly conducted on synthetic data.More relevantly, Dai et al. Dai et al. (2017) proposed an image captioning model based on CGAN.Their work aimed at naturalness and diversity of the captions generated, at the cost of lower scores inbasically all metrics.

8

6 Experiments

6.1 Dataset

For summarization, we use CNN/Daily Mail dataset Hermann et al. (2015); Nallapati et al. (2016),which is a large-scale, long-paragraph, abstractive summarization dataset. It has online news articles(781 tokens or ~40 sentences on average) with paired human-generated summaries (56 tokens or3.75 sentences on average). The entire dataset has 287,226 training pairs, 13,368 validation pairsand 11,490 test pairs. We use the same version of data as See et al. (2017), which is the original textwith no preprocessing to replace named entities. We also present results on a DUC-2002 test-onlysetup, where we reused our model trained on CNN/Daily Mail to generated summaries for DUC-2002,which contains 566 news articles with corresponding human-generated summaries.

For image captioning, our experiments were conducted on MSCOCO dataset Lin et al. (2014), whichcontains 82,081 training images and 40,137 validation images and each image has 5 human-generatedcaptions. To better facilitate the training, we rearranged the dataset and produced a labeled training setof ~116,000 images and a validation set of ~6000 images. We evaluated the model on the MSCOCOhidden (non-public captions) test set that contains ~41,000 images.

6.2 Experimental setup

Summarization We trained our models on CNN/Daily Mail dataset. Our model is directly adaptedfrom the source code of See et al. (2017), and we kept most of their settings for our model. The modelis first trained to converge, and is trained with coverage loss See et al. (2017) for another 3k steps.Both our encoder and decoder LSTMs have hidden state dimension of 256, and the word embeddingdimension is set to 128. Our preset vocabulary is same for source texts and target summaries, whichhas a total of 50k word tokens including special tokens for start, end, and out-of-vocabulary(OOV)signals. These word embeddings are learned from scratch instead of pretrained as in Nallapati et al.(2016). All of our teacher forcing models reported are trained with Adagrad Duchi et al. (2011) withlearning rate 0.15 and an initial accumulator value of 0.1. The gradients are clipped to maximum normof 2.0. Same as reported in See et al. (2017), we found Adagrad to be able to yield the best result,while Adam Kingma and Ba (2014) provides faster convergence in early phase of the training. Thebatch size is set to 16. Our model with closed-book decoder converged in about 270,000 iterationsand achieved best result on validation set with another 3000 iterations with coverage added. Thisis similar to the numbers reported in See et al. (2017). Then we restore the best checkpoint andapply policy gradient. For this phase of training, we choose Adam optimizer Kingma and Ba (2014)because of its time efficiency, and the learning rate is set to 0.000001.

Image captioning For the encoder CNN we use an Inception-V3 Szegedy et al. (2016) pre-trainedon ImageNet. The decoder is an one-layer LSTM with state size of 512 for both c and h. Attentionmechanism Bahdanau et al. (2014); Xu et al. (2015) is incorporated to get a stronger baseline. Eachsymbol in the vocabulary is represented as a 512 dimensional dense word embedding vector that israndomly initialized.

In summary, we first pre-train G and D using maximum likelihood on labeled set until they converge.It is important to early stop the training before they overfit. Then we start to train G and D jointly byapplying gradient descent on them alternatively. Note that G and D share the same word embeddingmap. This map is updated along with G in pre-training, and with both G and D in joint GAN-training.We do not update the embedding map during pre-training of D as it will break G.

We used Adam optimizer Kingma and Ba (2014) for both G and D. The learning rate is set to 0.0001at pre-training of G and D, but we use a smaller value (0.00001 ~0.000001) to update G at jointtraining. The M in Eqn. 14 is set to 1, and γ in Eqn. 15 is set to 0.99.

7 Results

7.1 Summarization

We report our evaluation results on CNN/Daily Mail dataset in Table 1. As shown, our pointer-generator with the extra closed-book decoder (pg+cbdec) receives significantly higher scores than theoriginal pointer-generator (pg) from See et al. (2017) and our own pg baseline, in all ROUGE Lin

9

Figure 5: Summarization: Baseline makes factual errors and repeated itself (red), while our closed-book decoder model covers information that is not mentioned by baseline (green), while summarizingfrom a number of places in the source passage (blue).

ROUGE MTR1 2 L Full

pg (See17) 39.53 17.28 36.38 18.72lead-3 (See17) 40.57 17.70 36.57 22.21RL† (Paulus17) 39.87 15.82 36.90

pg (baseline) 39.22 17.02 35.95 18.70pg + cbdec 40.05 17.66 36.73 19.48

RL + pg 39.67 17.36 36.45 20.84RL + pg + cbdec 40.43 17.84 36.94 21.41

Table 1: Summarization: ROUGE F1 and METEOR scores on the test set. "pg+cbdec" is ouradapted pointer-generator with closed-book decoder. Coverage mechanism is used in all modelsexcept the RL summ and lead-3 baseline. The result marked with † was trained and evaluated on theanonymized version of the data.

(2004) and METEOR Denkowski and Lavie (2014) metrics.1 In the reinforced setting, our 2-decodermodel (RL+pg+cbdec) again outperforms our strong RL baseline by a considerable margin (stat.significance of p < 0.001). In ROUGE-L score, our teacher-forcing model even beats the lead-3baseline that generates summaries by selecting the first 3 sentences from the source text. We alsoevaluated our 2-decoder model on the DUC-2002 test-only transfer setup by decoding the entiredataset with our model pretrained on CNN/Daily Mail, again achieving some improvements (shownin Table 2) over the single-decoder baseline as well as See et al. (2017).

To further validate and sanity-check that the improvement is the result of the inclusion of our closed-book decoder and not due to some trivial effect of having two decoders, we ran a variant of our modelwith two duplicated attention-pointer decoders. Table 3 shows that this variant yields roughly thesame scores as the single-decoder baseline, and hence proves that the improvements of our model areindeed achieved by the closed-book decoder.2

Saliency: We also evaluate our model’s ability to encode and address the most salient informationfrom the source text. For this, we designed a keyword-matching test based on the original CNN/DailyMail cloze blank-filling task. Each news article in the dataset is paired with a few extracted keywordsthat represent the most salient entities, including names, locations, etc. We count the number of

1All of our improvements in Table 1 are statistically significant with p < 0.001 (using bootstrappedrandomization test Efron and Tibshirani (1994)) and have a 95% ROUGE-significance level of at most ±0.25.

2It is also important to point out that our model is not a 2-decoder ensemble, because we only use the pointerdecoder during inference.

10

ROUGE MTR1 2 L Full

pg (See17) 37.22 15.78 33.90 13.69pg (baseline) 37.15 15.68 33.92 13.65pg + cbdec 37.59 16.84 34.43 13.82

Table 2: Summarization: ROUGE F1 scores on DUC-2002 (test-only transfer setup).

R-1 R-2 R-L MTRpg baseline 39.22 17.02 35.95 18.70pg + ptrdec 39.26 17.03 36.02 18.72

Table 3: Summarization: ROUGE F1 scores of sanity check. pg is the baseline with one pointerdecoder. pg + ptrdec has an extra pointer decoder.

keywords that appear in our generated summaries, and found that the output of our best teacher-forcing model (pg+cbdec) contains 62.1% of those keywords, while the output provided by Seeet al. (2017) has only 60.5% covered. Our reinforced model (RL+pg+cbdec) further increases thispercentage to 67.6%. The full comparison is shown in Table 4. This again demonstrates our modelencoder’s ability to memorize important information and address them properly in the generatedsummary.

7.2 Image captioning

The evaluation results of our experiments is presented in Table 5. Compared to the baselines trainedwith maximum likelihood, our CGAN-based model generates captions that received considerablyhigher score in all metrics. Table 6 reports statistical significance levels3 of these various improve-ments, which is showed to be significant in all metrics.

Speed and Result Comparison with Rollouts Model Table 7 compares the CIDEr scores as wellas the speed of the Monte Carlo rollouts model used by Dai et al. (2017), against our model withoutthese rollouts (see Sec. 4.2). As shown, our model can achieve better performance at higher speeds.

8 Conclusion

In this thesis we augmented a standard Encoder-Decoder model for two Natural Language Generationtasks. For summarization, we added a second decoder that forces the encoder to present betterrepresentation in its memory state. We showed that our proposed model significantly outperformsthe state-of-the-art pointer-generator baselines in terms of ROUGE and METEOR scores, as wellas saliency (in both a teacher-forcing setup and a reinforcement learning setup). It also achievesimprovements in a test-only transfer setup on the DUC-2002 dataset. For image captioning, wetrained a discriminator along with the Encoder-Decoder generator, and optimize the entire modeladversarially. Experiments showed that our CGAN-based model successfully surpassed baseline interms of metric scores with statistically significant improvements. Overall, our experiments showedthat designing more intuitive architecture is a promising direction for future studies in neural networkand its application.

Acknowledgments

I’m grateful to have Prof. Bansal as my advisor. You accepted me into the group when I had noexperience in research and NLP, and you kept pushing me to my potential and showed me how to bea qualified researcher (although I’m still not a very good one) that works hard and thinks critically.

To Prof. Pozefsky, thanks for your kind advices and mentorship that dated back to the COMP523.You always inspire me and made me a better presenter during the tech-talk.

3Stat. signif. is based on bootstrap test Efron and Tibshirani (1994); we split our val set into a smaller 2kval set and an internal 4k test set, because the CodaLab evaluation server does not provide individual scores ofits test set captions. Here the model is tuned on the validation set of 2k images, following the same proceduredescribed above.

11

saliencypg (See17) 60.4%

pg (baseline) 59.6%pg + cbdec 62.1%

RL + pg 65.1%RL + pg + cbdec 67.6%

Table 4: Summarization: Saliency scores.

BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDErVinyals et al. (2015b) 0.713 0.542 0.407 0.309 0.254 0.943

Mao et al. (2014) 0.716 0.545 0.404 0.299 0.252 0.917Xu et al. (2015) 0.712 0.537 0.398 0.295 0.247 0.915You et al. (2016) 0.731 0.565 0.424 0.316 0.250 0.943

baseline 0.712 0.537 0.398 0.295 0.247 0.915GAN 0.711 0.538 0.400 0.299 0.250 0.934

Table 5: Image captioning: Results on MSCOCO test set.

And finally for my beloved friends and family members, thanks for the huge supports during my fouryears at UNC. You guys are always with me throughout my ups and downs. Hopefully my work canmake you proud.

ReferencesPeter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic

propositional image caption evaluation. In European Conference on Computer Vision, pages382–398. Springer.

D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural Machine Translation by Jointly Learning to Alignand Translate. ArXiv e-prints.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling forsequence prediction with recurrent neural networks. In Advances in Neural Information ProcessingSystems, pages 1171–1179.

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Hui Jiang. 2016. Distraction-based neuralnetworks for modeling documents. In IJCAI.

Jackie Chi Kit Cheung and Gerald Penn. 2014. Unsupervised sentence enhancement for automaticsummarization. In EMNLP, pages 775–786.

Sumit Chopra, Michael Auli, and Alexander M Rush. 2016. Abstractive sentence summarization withattentive recurrent neural networks. In Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Language Technologies, pages93–98.

James Clarke and Mirella Lapata. 2008. Global inference for sentence compression: An integer linearprogramming approach. Journal of Artificial Intelligence Research, 31:399–429.

Bo Dai, Dahua Lin, Raquel Urtasun, and Sanja Fidler. 2017. Towards diverse and natural imagedescriptions via a conditional gan. arXiv preprint arXiv:1703.06029.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scalehierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on, pages 248–255. IEEE.

Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluationfor any target language. In Proceedings of the ninth workshop on statistical machine translation,pages 376–380.

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learningand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.

12

GAN / baseline

B-4 0.310/0.305signif. 0.046

MTR 0.255/0.251signif. 0.001

CDr 0.986/0.964signif. 0.028

SPICE 0.186/0.182signif. 0.005

Table 6: Image captioning: Statistical significance levels (p values in bootstrap test) for our GANversus baseline comparison, on BLEU4, METEOR, CIDEr, and SPICE Anderson et al. (2016)metrics.

CIDEr Min/epochGAN 0.941 71

GANrollouts 0.926 281Table 7: Image captioning: CIDEr scores and speed (minutes per epoch) on a single GPU in thepresence of rollouts or not.

Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap. CRC press.

Katja Filippova, Enrique Alfonseca, Carlos A Colmenares, Lukasz Kaiser, and Oriol Vinyals. 2015.Sentence compression by deletion with lstms. In EMNLP, pages 360–368.

Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010. Opinosis: a graph-based approach toabstractive summarization of highly redundant opinions. In Proceedings of the 23rd internationalconference on computational linguistics, pages 340–348. ACL.

Shima Gerani, Yashar Mehdad, Giuseppe Carenini, Raymond T Ng, and Bita Nejat. 2014. Abstractivesummarization of product reviews using discourse structure. In EMNLP, volume 14, pages 1602–1613.

George Giannakopoulos. 2009. Automatic summarization from multiple documents. Ph. D. disserta-tion.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Z. Ghahramani,M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in NeuralInformation Processing Systems 27, pages 2672–2680. Curran Associates, Inc.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association forComputational Linguistics.

Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Point-ing the unknown words. In Proceedings of the 54th Annual Meeting of the Association forComputational Linguistics.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,pages 770–778.

Stefan Henß, Margot Mieskes, and Iryna Gurevych. 2015. A reinforcement learning approach foradaptive single-and multi-document summarization. In GSCL, pages 3–12.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, MustafaSuleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances inNeural Information Processing Systems, pages 1693–1701.

13

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-image translationwith conditional adversarial networks. arXiv preprint arXiv:1611.07004.

Hongyan Jing and Kathleen R. McKeown. 2000. Cut and paste based text summarization. InProceedings of the 1st North American Chapter of the Association for Computational LinguisticsConference, NAACL 2000, pages 178–185, Stroudsburg, PA, USA. Association for ComputationalLinguistics.

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980.

Kevin Knight and Daniel Marcu. 2002. Summarization beyond sentence extraction: A probabilisticapproach to sentence compression. Artificial Intelligence, 139(1):91–107.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deepconvolutional neural networks. In Advances in neural information processing systems, pages1097–1105.

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarizationbranches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, PiotrDollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. SpringerInternational Publishing, Cham.

S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy. 2016. Improved Image Captioning via PolicyGradient optimization of SPIDEr. ArXiv e-prints.

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioningwith multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632.

Yishu Miao and Phil Blunsom. 2016. Language as a latent variable: Discrete generative models forsentence compression. In Proceedings of the 2016 Conference on Empirical Methods in NaturalLanguage Processing.

M. Mirza and S. Osindero. 2014. Conditional Generative Adversarial Nets. ArXiv e-prints.

Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural networkbased sequence model for extractive summarization of documents. In AAAI, pages 3075–3081.

Ramesh Nallapati, Bowen Zhou, Çaglar Gulçehre, Bing Xiang, et al. 2016. Abstractive text sum-marization using sequence-to-sequence rnns and beyond. In Computational Natural LanguageLearning.

Ramakanth Pasunuru and Mohit Bansal. 2017. Reinforced video captioning with entailment rewards.arXiv preprint arXiv:1708.02300.

Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractivesummarization. arXiv preprint arXiv:1705.04304.

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee.2016. Generative adversarial text-to-image synthesis. In Proceedings of The 33rd InternationalConference on Machine Learning.

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2016. Self-critical sequence training for image captioning. arXiv preprint arXiv:1612.00563.

Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractivesentence summarization. In Empirical Methods in Natural Language Processing.

Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization withpointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association forComputational Linguistics. Association for Computational Linguistics.

14

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neuralnetworks. In Advances in neural information processing systems, pages 3104–3112.

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policygradient methods for reinforcement learning with function approximation. In Advances in neuralinformation processing systems, pages 1057–1063.

C Szegedy, W Liu, Y Jia, P Sermanet, S Reed, and D Anguelov. & rabinovich, a.(2015). goingdeeper with convolutions. In Proceedings of the IEEE conference on computer vision and patternrecognition, pages 1–9.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Re-thinking the inception architecture for computer vision. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR).

Sho Takase, Jun Suzuki, Naoaki Okazaki, Tsutomu Hirao, and Masaaki Nagata. 2016. Neuralheadline generation on abstract meaning representation. In Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing, pages 1054–1059.

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015a. Pointer networks. In Advances in NeuralInformation Processing Systems, pages 2692–2700.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015b. Show and tell: A neuralimage caption generator. In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR).

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and tell: Lessonslearned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysisand machine intelligence, 39(4):652–663.

Mingxuan Wang, Zhengdong Lu, Hang Li, and Qun Liu. 2016. Memory-enhanced decoder forneural machine translation. In Proceedings of the 54th Annual Meeting of the Association forComputational Linguistics.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 8(3-4):229–256.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machinetranslation system: Bridging the gap between human and machine translation. arXiv preprintarXiv:1609.08144.

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. 2015.Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ArXiv e-prints.

T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. 2016. Boosting Image Captioning with Attributes. ArXive-prints.

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning withsemantic attention. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarialnets with policy gradient. In AAAI, pages 2852–2858.

David Zajic, Bonnie Dorr, and Richard Schwartz. 2004. Bbn/umd at duc-2004: Topiary. InProceedings of the HLT-NAACL 2004 Document Understanding Workshop, Boston, pages 112–119.

Wenyuan Zeng, Wenjie Luo, Sanja Fidler, and Raquel Urtasun. 2016. Efficient summarization withread-again and copy mechanism. arXiv preprint arXiv:1611.03382.

15