adversarial reward learning for visual storytelling

17
Adversarial Reward Learning for Visual Storytelling Maria Fabiano Xin Wang, Wenhu Chen, Yuan-Fang Wang, William Yang Wang

Upload: others

Post on 06-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adversarial Reward Learning for Visual Storytelling

Adversarial Reward Learning for Visual Storytelling

Maria Fabiano

Xin Wang, Wenhu Chen, Yuan-Fang Wang, William Yang Wang

Page 2: Adversarial Reward Learning for Visual Storytelling

Outline1. Motivation2. AREL Model Overview3. Policy Model4. Reward Model5. AREL Objective6. Data7. Training and Testing8. Evaluation9. Critique

Page 3: Adversarial Reward Learning for Visual Storytelling

MotivationThe authors want to explore how well a computer can create a story from a set of images.

Up to the point of this paper, little research had been done in visual storytelling.

Visual storytelling represents a deeper understanding of images. This goes beyond image captioning because it requires understanding more complicated visual scenarios, relating sequential images, and associating implicit concepts in the image (e.g., emotions).

Page 4: Adversarial Reward Learning for Visual Storytelling

● RL○ Hand-crafted rewards (e.g., METEOR) are too biased or too sparse to drive the policy search○ Fail to learn implicit semantics (coherence, expressiveness, etc)○ Require extensive feature and reward engineering

● GANs○ Prone to unstable gradients or vanishing gradients

● IRL○ Maximum margin approaches, probabilistic approaches

MotivationProblems with previous storytelling approaches

Page 5: Adversarial Reward Learning for Visual Storytelling

● Policy model: produces the story sequence from an image sequence

● Reward model: learns implicit reward function from human-annotated stories and sampled predictions

● Alternately train the models via SGD

AREL Model Overviewdversarial ward earning

Page 6: Adversarial Reward Learning for Visual Storytelling

● Images go through pre-trained CNN

● Encoder (bidirectional GRUs) gets high-level features of images

● Five decoders (single-layer GRU, shared weights) create five substories

● Concatenate substories

Policy ModelTakes an image sequence and sequentially chooses words from the vocabulary to create a story.

Page 7: Adversarial Reward Learning for Visual Storytelling

(Partial) Reward ModelAims to derive a human-like reward from human-annotated stories and sampled predictions.

Page 8: Adversarial Reward Learning for Visual Storytelling

We achieve the optimal reward function R* when the Reward-Boltzmann distribution pθ equals the actual data distribution p*

W = StoryRθ = Reward functionZθ = Partition function (a normalizing constant)pθ = Approximate data distribution

Adversarial Reward Learning: Reward Boltzmann Distribution

Page 9: Adversarial Reward Learning for Visual Storytelling

Adversarial Reward LearningWe want the Reward Boltzmann distribution pθ to get close the actual data distribution p*.

● Adversarial objective: min-max two player gameMaximize the similarity of pθ with the empirical distribution pe while minimizing the similarity of pθ with the generated data from the policy πβ.Meanwhile, πβ wants to maximize its similarity with pθ.

● Distribution similarity is measured using KL-divergence.

● The objective of the reward is to distinguish between human-annotated stories and machine-generated stories.○ Minimize KL-divergence with pe and maximize KL-divergence with πβ

● The objective of the policy is to create stories indistinguishable from human-written stories.○ Minimize KL-divergence with pθ

Page 10: Adversarial Reward Learning for Visual Storytelling

Data● VIST dataset of Flickr photos aligned to stories

● One sample is a story for five images from a photo album

● The same album is paired with five different stories as references

● Vocabulary of size 9,837 words (words have to appear more than three times in the training set)

Page 11: Adversarial Reward Learning for Visual Storytelling

Training and Testing1. Create a baseline model XE-ss (cross-entropy loss with scheduled sampling)

with the same architecture as the policy modela. Scheduled sampling uses a sampling probability to decide which action to take

2. Use XE-ss to initialize the policy model

3. Train with AREL framework

Page 12: Adversarial Reward Learning for Visual Storytelling

Training and Testing● Objective of the policy model:

maximize similarity with pθ

● Objective of the reward model: distinguish between human-generated and machine-generated stories

● Alternate between training policy and reward using SGD

○ N = 50 or 100

● For testing, policy uses beam search to create the story

Page 13: Adversarial Reward Learning for Visual Storytelling
Page 14: Adversarial Reward Learning for Visual Storytelling

● AREL achieves SOTA on all metrics except ROUGE; however, these gains are very small, and are very similar to the baseline model and vanilla GAN

2.2 1.1Gain0.9 0.4Range of new

methods

Automatic Evaluation

0.9 0.4 0.90.91.7

0.2 2.1

Page 15: Adversarial Reward Learning for Visual Storytelling

AREL greatly outperforms all other models in human evaluations:● Turing test● Relevance● Expressiveness● Concreteness

Comparison of Turing test results

Human Evaluation

Page 16: Adversarial Reward Learning for Visual Storytelling

● AREL – novel framework of adversarial reward learning to tell stories

● SOTA on VIST dataset and automatic metrics

● Automatic metrics are not great for training or evaluation (empirically shown)

● Comprehensive human evaluation via Turk○ Better results on relevance, expressiveness, and concreteness○ Clear description of how human evaluation was conducted

CritiqueThe “Good”

Page 17: Adversarial Reward Learning for Visual Storytelling

● Motivation: interesting problem to solve, but what are practical applications?○ Limited to five photos in a story

● XE-ss: not mentioned until evaluation section, but it initializes AREL● Partial rewards: more discussion and motivation needed for this approach● Missing details

○ Type of pooling in reward model is not specified (average? max?)○ Fine tuning the pre-trained ResNet?

● Data bias (gender and event), and the model augments the largest majority’s influence● Small gain in automatic evaluation metrics, and XE-ss performs similarly to AREL; no

direct comparison of human evaluation between AREL and previous methods● Human evaluation improvements

○ Include a reason why they chose which sentence was machine-generated or not○ Rankings instead of pairwise comparisons

● Decoder shared weights: maybe there is something specific about an image’s position that requires different weights (e.g., the structure of a narrative: setting, problem, rising action, climax, falling action, resolution)

CritiqueThe “Not so Good”