adversarial reward learning for visual storytelling

Adversarial Reward Learning for Visual Storytelling

Maria Fabiano

Xin Wang, Wenhu Chen, Yuan-Fang Wang, William Yang Wang

Outline1. Motivation2. AREL Model Overview3. Policy Model4. Reward Model5. AREL Objective6. Data7. Training and Testing8. Evaluation9. Critique

MotivationThe authors want to explore how well a computer can create a story from a set of images.

Up to the point of this paper, little research had been done in visual storytelling.

Visual storytelling represents a deeper understanding of images. This goes beyond image captioning because it requires understanding more complicated visual scenarios, relating sequential images, and associating implicit concepts in the image (e.g., emotions).

● RL○ Hand-crafted rewards (e.g., METEOR) are too biased or too sparse to drive the policy search○ Fail to learn implicit semantics (coherence, expressiveness, etc)○ Require extensive feature and reward engineering

● GANs○ Prone to unstable gradients or vanishing gradients

● IRL○ Maximum margin approaches, probabilistic approaches

MotivationProblems with previous storytelling approaches

● Policy model: produces the story sequence from an image sequence

● Reward model: learns implicit reward function from human-annotated stories and sampled predictions

● Alternately train the models via SGD

AREL Model Overviewdversarial ward earning

● Images go through pre-trained CNN

● Encoder (bidirectional GRUs) gets high-level features of images

● Five decoders (single-layer GRU, shared weights) create five substories

● Concatenate substories

Policy ModelTakes an image sequence and sequentially chooses words from the vocabulary to create a story.

(Partial) Reward ModelAims to derive a human-like reward from human-annotated stories and sampled predictions.

We achieve the optimal reward function R* when the Reward-Boltzmann distribution pθ equals the actual data distribution p*

W = StoryRθ = Reward functionZθ = Partition function (a normalizing constant)pθ = Approximate data distribution

Adversarial Reward Learning: Reward Boltzmann Distribution

Adversarial Reward LearningWe want the Reward Boltzmann distribution pθ to get close the actual data distribution p*.

● Adversarial objective: min-max two player gameMaximize the similarity of pθ with the empirical distribution pe while minimizing the similarity of pθ with the generated data from the policy πβ.Meanwhile, πβ wants to maximize its similarity with pθ.

● Distribution similarity is measured using KL-divergence.

● The objective of the reward is to distinguish between human-annotated stories and machine-generated stories.○ Minimize KL-divergence with pe and maximize KL-divergence with πβ

● The objective of the policy is to create stories indistinguishable from human-written stories.○ Minimize KL-divergence with pθ

Data● VIST dataset of Flickr photos aligned to stories

● One sample is a story for five images from a photo album

● The same album is paired with five different stories as references

● Vocabulary of size 9,837 words (words have to appear more than three times in the training set)

Training and Testing1. Create a baseline model XE-ss (cross-entropy loss with scheduled sampling)

with the same architecture as the policy modela. Scheduled sampling uses a sampling probability to decide which action to take

2. Use XE-ss to initialize the policy model

3. Train with AREL framework

Training and Testing● Objective of the policy model:

maximize similarity with pθ

● Objective of the reward model: distinguish between human-generated and machine-generated stories

● Alternate between training policy and reward using SGD

○ N = 50 or 100

● For testing, policy uses beam search to create the story

● AREL achieves SOTA on all metrics except ROUGE; however, these gains are very small, and are very similar to the baseline model and vanilla GAN

2.2 1.1Gain0.9 0.4Range of new

methods

Automatic Evaluation

0.9 0.4 0.90.91.7

0.2 2.1

AREL greatly outperforms all other models in human evaluations:● Turing test● Relevance● Expressiveness● Concreteness

Comparison of Turing test results

Human Evaluation

● AREL – novel framework of adversarial reward learning to tell stories

● SOTA on VIST dataset and automatic metrics

● Automatic metrics are not great for training or evaluation (empirically shown)

● Comprehensive human evaluation via Turk○ Better results on relevance, expressiveness, and concreteness○ Clear description of how human evaluation was conducted

CritiqueThe “Good”

● Motivation: interesting problem to solve, but what are practical applications?○ Limited to five photos in a story

● XE-ss: not mentioned until evaluation section, but it initializes AREL● Partial rewards: more discussion and motivation needed for this approach● Missing details

○ Type of pooling in reward model is not specified (average? max?)○ Fine tuning the pre-trained ResNet?

● Data bias (gender and event), and the model augments the largest majority’s influence● Small gain in automatic evaluation metrics, and XE-ss performs similarly to AREL; no

direct comparison of human evaluation between AREL and previous methods● Human evaluation improvements

○ Include a reason why they chose which sentence was machine-generated or not○ Rankings instead of pairwise comparisons

● Decoder shared weights: maybe there is something specific about an image’s position that requires different weights (e.g., the structure of a narrative: setting, problem, rising action, climax, falling action, resolution)

CritiqueThe “Not so Good”

adversarial reward learning for visual storytelling

Documents