visual storytelling (naacl 2016, poster)

1
A black frisbee is sitting on top of a roof. A man playing soccer outside of a white house with a red door. The boy is throwing a soccer ball by the red door. A soccer ball is over a roof by a frisbee in a rain gutter. Two balls and a frisbee are on top of a roof. A discus got stuck up on the roof. Why not try getting it down with a soccer ball? Up the soccer ball goes. It didn't work so we tried a volley ball. Now the discus, soccer ball, and volleyball are all stuck on the roof. *Ting-Hao (Kenneth) Huang 1 , *Francis Ferraro 2 , Nasrin Mostafazadeh 3 , Ishan Misra 1 , Jacob Devlin 6 , Aishwarya Agrawal 4 , Ross Girshick 5 , Xiaodong He 6 , Pushmeet Kohli 6 , Dhruv Batra 4 , Larry Zitnick 5 , Devi Parikh 5 , Lucy Vanderwende 6 , Michel Galley 6 and Margaret Mitchell 6 1 Carnegie Mellon University, 2 Johns Hopkins University, 3 University of Rochester, 4 Virginia Tech, 5 Facebook AI Research,6 Microsoft Research Stories Consecutive Captions Descriptive Text Motivation Text/Image Pairs (K) Vocab Size (K) Words/Sent. Web Ppl. (30B words) Brown (comparison only) 52.1 (text only) 47.7 20.8 194.0 DII Description-in- isolation 151.8 13.8 11.0 147.0 SIS Stories-in- sequence 252.9 18.2 10.2 116.0 Getting Humans to Tell Stories Peason’s r BLEU 0.08 SkipThoughts 0.18 METEOR 0.22 This is a picture of a family. This is a picture of a cake. This is a picture of a dog. This is a picture of a beach. This is a picture of a beach. The family gathered together for a meal. The food was delicious. The dog was excited to be there. The dog was enjoying the water. The dog was happy to be in the water. The family gathered together for a meal. The food was delicious. The dog was excited to be there. The kids were playing in the water. The boat was a little too much to drink. The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water. Greedy Stories -Dups +Grounded Caption Output A solid next move in Artificial Intelligence is to go beyond basic description of visual scenes towards human-like understanding of grounded event structure and subjective expression. We introduce the first dataset for sequential vision-to-language and explore how modeling concrete description as well as figurative and social language enables visual storytelling. Our data is at sind.ai. Get Better Stories with Uniqueness & Visually Grounded Constraints DII SIS Automatic Evaluation and Results See our paper for the description-in-sequence tiers (DIS) and more! We define 80-5-5-10 train-dev-validation-test splits for all three tiers. Data Analysis Beam = 10 Beam =1 - Dups + Grounded DII 23.55 19.10 19.21 ---- SIS 23.13 27.76 30.11 31.42 All values are statistically significant (< 1e-5). Correlations of automatic scores against human judgments on 3K random SIS training stories. METEOR scores on the validation split, using a sequence-to-sequence NN with gated recurrent units. Conclusion Visual Storytelling Flickr Album Description for Images in Isolation & in Sequences Story 1 Storytelling Story 2 Story 3 Re-telling Preferred Photo Sequence Story 4 Story 5 Several strong baselines for the task of visual storytelling demonstrate that intelligent machines can now begin to generate inferential, conceptual, and evaluative language to share humanlike experience. METEOR serves as an automatic metric for evaluation, best correlated with human descriptions. Much more work to be done: Combining a fully grounded model with a model free to dream yields the best automatically generated stories to date.

Upload: ting-hao-huang

Post on 21-Jan-2018

200 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Visual Storytelling (NAACL 2016, Poster)

A black frisbee is

sitting on top of a

roof.

A man playing

soccer outside of a

white house with a

red door.

The boy is

throwing a soccer

ball by the red

door.

A soccer ball is

over a roof by a

frisbee in a rain

gutter.

Two balls and a

frisbee are on top

of a roof.

A discus got

stuck up on the

roof.

Why not try

getting it down

with a soccer

ball?

Up the soccer

ball goes.

It didn't work so

we tried a volley

ball.

Now the discus,

soccer ball, and

volleyball are all

stuck on the roof.

*Ting-Hao (Kenneth) Huang1, *Francis Ferraro2, Nasrin Mostafazadeh3, Ishan Misra1, Jacob Devlin6, Aishwarya Agrawal4, Ross Girshick5,

Xiaodong He6, Pushmeet Kohli6, Dhruv Batra4, Larry Zitnick5, Devi Parikh5, Lucy Vanderwende6, Michel Galley6 and Margaret Mitchell6

1 Carnegie Mellon University, 2 Johns Hopkins University, 3 University of Rochester, 4 Virginia Tech, 5 Facebook AI Research, 6 Microsoft Research

Stories ≠ Consecutive Captions ≠ Descriptive TextMotivation

Text/Image

Pairs (K)

Vocab

Size (K)Words/Sent.

Web Ppl.

(30B words)

Brown(comparison

only)

52.1

(text only)47.7 20.8 194.0

DIIDescription-in-

isolation151.8 13.8 11.0 147.0

SISStories-in-

sequence252.9 18.2 10.2 116.0

Getting Humans to Tell Stories

Peason’s r

BLEU 0.08

SkipThoughts 0.18

METEOR 0.22

This is a picture of a family. This is a picture of a cake. This is a picture of a dog.

This is a picture of a beach. This is a picture of a beach.

The family gathered together for a meal. The food was delicious. The dog was excited

to be there. The dog was enjoying the water. The dog was happy to be in the water.

The family gathered together for a meal. The food was delicious. The dog was excited

to be there. The kids were playing in the water. The boat was a little too much to drink.

The family got together for a cookout. They had a lot of delicious food. The dog

was happy to be there. They had a great time on the beach. They even had a

swim in the water.

Greedy

Stories

-Dups

+Grounded

Caption

Output

A solid next move in Artificial Intelligence is to go beyond basic

description of visual scenes towards human-like understanding of

grounded event structure and subjective expression. We introduce the

first dataset for sequential vision-to-language and explore how

modeling concrete description as well as figurative and social language

enables visual storytelling. Our data is at sind.ai.

Get Better Stories with Uniqueness & Visually Grounded Constraints

DII

SIS

Automatic Evaluation and Results

See our paper for the description-in-sequence tiers (DIS) and more!

We define 80-5-5-10 train-dev-validation-test splits for all three tiers.

Data

Analysis

Beam

= 10

Beam

= 1

-

Dups

+

Grounded

DII 23.55 19.10 19.21 ----

SIS 23.13 27.76 30.11 31.42

All values are statistically significant (< 1e-5).

Correlations of automatic scores

against human judgments on 3K

random SIS training stories.

METEOR scores on the validation

split, using a sequence-to-sequence

NN with gated recurrent units. Conclusion

Visual Storytelling

Flickr

Album

Description for

Images

in Isolation

&

in Sequences

Story 1

Storytelling

Story 2

Story 3

Re-telling

Preferred Photo

Sequence

Story 4

Story 5

Several strong baselines for the task of visual storytelling demonstrate that intelligent machines

can now begin to generate inferential, conceptual, and evaluative language to share humanlike

experience. METEOR serves as an automatic metric for evaluation, best correlated with

human descriptions. Much more work to be done: Combining a fully grounded model with a

model free to dream yields the best automatically generated stories to date.