visual storytelling (naacl 2016, poster)

A black frisbee is

sitting on top of a

roof.

A man playing

soccer outside of a

white house with a

red door.

The boy is

throwing a soccer

ball by the red

door.

A soccer ball is

over a roof by a

frisbee in a rain

gutter.

Two balls and a

frisbee are on top

of a roof.

A discus got

stuck up on the

roof.

Why not try

getting it down

with a soccer

ball?

Up the soccer

ball goes.

It didn't work so

we tried a volley

ball.

Now the discus,

soccer ball, and

volleyball are all

stuck on the roof.

*Ting-Hao (Kenneth) Huang1, *Francis Ferraro2, Nasrin Mostafazadeh3, Ishan Misra1, Jacob Devlin6, Aishwarya Agrawal4, Ross Girshick5,

Xiaodong He6, Pushmeet Kohli6, Dhruv Batra4, Larry Zitnick5, Devi Parikh5, Lucy Vanderwende6, Michel Galley6 and Margaret Mitchell6

1 Carnegie Mellon University, 2 Johns Hopkins University, 3 University of Rochester, 4 Virginia Tech, 5 Facebook AI Research, 6 Microsoft Research

Stories ≠ Consecutive Captions ≠ Descriptive TextMotivation

Text/Image

Pairs (K)

Vocab

Size (K)Words/Sent.

Web Ppl.

(30B words)

Brown(comparison

only)

52.1

(text only)47.7 20.8 194.0

DIIDescription-in-

isolation151.8 13.8 11.0 147.0

SISStories-in-

sequence252.9 18.2 10.2 116.0

Getting Humans to Tell Stories

Peason’s r

BLEU 0.08

SkipThoughts 0.18

METEOR 0.22

This is a picture of a family. This is a picture of a cake. This is a picture of a dog.

This is a picture of a beach. This is a picture of a beach.

The family gathered together for a meal. The food was delicious. The dog was excited

to be there. The dog was enjoying the water. The dog was happy to be in the water.

The family gathered together for a meal. The food was delicious. The dog was excited

to be there. The kids were playing in the water. The boat was a little too much to drink.

The family got together for a cookout. They had a lot of delicious food. The dog

was happy to be there. They had a great time on the beach. They even had a

swim in the water.

Greedy

Stories

-Dups

+Grounded

Caption

Output

A solid next move in Artificial Intelligence is to go beyond basic

description of visual scenes towards human-like understanding of

grounded event structure and subjective expression. We introduce the

first dataset for sequential vision-to-language and explore how

modeling concrete description as well as figurative and social language

enables visual storytelling. Our data is at sind.ai.

Get Better Stories with Uniqueness & Visually Grounded Constraints

DII

SIS

Automatic Evaluation and Results

See our paper for the description-in-sequence tiers (DIS) and more!

We define 80-5-5-10 train-dev-validation-test splits for all three tiers.

Data

Analysis

Beam

= 10

Beam

= 1

-

Dups

+

Grounded

DII 23.55 19.10 19.21 ----

SIS 23.13 27.76 30.11 31.42

All values are statistically significant (< 1e-5).

Correlations of automatic scores

against human judgments on 3K

random SIS training stories.

METEOR scores on the validation

split, using a sequence-to-sequence

NN with gated recurrent units. Conclusion

Visual Storytelling

Flickr

Album

Description for

Images

in Isolation

&

in Sequences

Story 1

Storytelling

Story 2

Story 3

Re-telling

Preferred Photo

Sequence

Story 4

Story 5

Several strong baselines for the task of visual storytelling demonstrate that intelligent machines

can now begin to generate inferential, conceptual, and evaluative language to share humanlike

experience. METEOR serves as an automatic metric for evaluation, best correlated with

human descriptions. Much more work to be done: Combining a fully grounded model with a

model free to dream yields the best automatically generated stories to date.

visual storytelling (naacl 2016, poster)

Science