the amazing mysteries of the gutter: drawing inferences...
TRANSCRIPT
The Amazing Mysteries of the Gutter:
Drawing Inferences Between Panels in Comic Book Narratives
Mohit Iyyer∗1 Varun Manjunatha∗1 Anupam Guha1 Yogarshi Vyas1
Jordan Boyd-Graber2 Hal Daume III1 Larry Davis1
1University of Maryland, College Park 2University of Colorado, Boulder
{miyyer,varunm,aguha,yogarshi,hal,lsd}@umiacs.umd.edu [email protected]
Abstract
Visual narrative is often a combination of explicit infor-
mation and judicious omissions, relying on the viewer to
supply missing details. In comics, most movements in time
and space are hidden in the “gutters” between panels. To
follow the story, readers logically connect panels together
by inferring unseen actions through a process called “clo-
sure”. While computers can now describe what is explic-
itly depicted in natural images, in this paper we examine
whether they can understand the closure-driven narratives
conveyed by stylized artwork and dialogue in comic book
panels. We construct a dataset, COMICS, that consists of
over 1.2 million panels (120 GB) paired with automatic
textbox transcriptions. An in-depth analysis of COMICS
demonstrates that neither text nor image alone can tell a
comic book story, so a computer must understand both
modalities to keep up with the plot. We introduce three
cloze-style tasks that ask models to predict narrative and
character-centric aspects of a panel given n preceding pan-
els as context. Various deep neural architectures under-
perform human baselines on these tasks, suggesting that
COMICS contains fundamental challenges for both vision
and language.
1. Introduction
Comics are fragmented scenes forged into full-fledged
stories by the imagination of their readers. A comics creator
can condense anything from a centuries-long intergalactic
war to an ordinary family dinner into a single panel. But it
is what the creator hides from their pages that makes comics
truly interesting: the unspoken conversations and unseen
actions that lurk in the spaces (or gutters) between adja-
cent panels. For example, the dialogue in Figure 1 suggests
that between the second and third panels, Gilda commands
her snakes to chase after a frightened Michael in some
∗ Authors contributed equally
Figure 1. Where did the snake in the last panel come from? Why
is it biting the man? Is the man in the second panel the same as
the man in the first panel? To answer these questions, readers form
a larger meaning out of the narration boxes, speech bubbles, and
artwork by applying closure across panels.
sort of strange cult initiation. Through a process called
closure [40], which involves (1) understanding individual
panels and (2) making connective inferences across panels,
readers form coherent storylines from seemingly disparate
panels such as these. In this paper, we study whether com-
puters can do the same by collecting a dataset of comic
books (COMICS) and designing several tasks that require
closure to solve.
Section 2 describes how we create COMICS,1 which
contains ∼1.2 million panels drawn from almost 4,000
publicly-available comic books published during the
“Golden Age” of American comics (1938–1954). COMICS
is challenging in both style and content compared to natural
images (e.g., photographs), which are the focus of most ex-
isting datasets and methods [32, 56, 55]. Much like painters,
comic artists can render a single object or concept in mul-
tiple artistic styles to evoke different emotional responses
from the reader. For example, the lions in Figure 2 are
drawn with varying degrees of realism: the more cartoon-
1Data, code, and annotations to be made available after blind review.
17186
Figure 2. Different artistic renderings of lions taken from the
COMICS dataset. The left-facing lions are more cartoonish (and
humorous) than the ones facing right, which come from action and
adventure comics that rely on realism to provide thrills.
ish lions, from humorous comics, take on human expres-
sions (e.g., surprise, nastiness), while those from adventure
comics are more photorealistic.
Comics are not just visual: creators push their stories for-
ward through text—speech balloons, thought clouds, and
narrative boxes—which we identify and transcribe using
optical character recognition (OCR). Together, text and im-
age are often intricately woven together to tell a story that
neither could tell on its own (Section 3). To understand a
story, readers must connect dialogue and narration to char-
acters and environments; furthermore, the text must be read
in the proper order, as panels often depict long scenes rather
than individual moments [10]. Text plays a much larger role
in COMICS than it does for existing datasets of visual sto-
ries [25].
To test machines’ ability to perform closure, we present
three novel cloze-style tasks in Section 4 that require a deep
understanding of narrative and character to solve. In Sec-
tion 5, we design four neural architectures to examine the
impact of multimodality and contextual understanding via
closure. All of these models perform significantly worse
than humans on our tasks; we conclude with an error anal-
ysis (Section 6) that suggests future avenues for improve-
ment.
2. Creating a dataset of comic books
Comics, defined by cartoonist Will Eisner as sequential
art [13], tell their stories in sequences of panels, or sin-
gle frames that can contain both images and text. Existing
comics datasets [19, 39] are too small to train data-hungry
machine learning models for narrative understanding; addi-
tionally, they lack diversity in visual style and genres. Thus,
# Books 3,948
# Pages 198,657
# Panels 1,229,664
# Textboxes 2,498,657
Text cloze instances 89,412
Visual cloze instances 587,797
Char. coherence instances 72,313
Table 1. Statistics describing dataset size (top) and the number of
total instances for each of our three tasks (bottom).
we build our own dataset, COMICS, by (1) downloading
comics in the public domain, (2) segmenting each page into
panels, (3) extracting textbox locations from panels, and (4)
running OCR on textboxes and post-processing the output.
Table 1 summarizes the contents of COMICS. The rest of
this section describes each step of our data creation pipeline.
2.1. Where do our comics come from?
The “Golden Age of Comics” began during America’s
Great Depression and lasted through World War II, ending
in the mid-1950s with the passage of strict censorship reg-
ulations. In contrast to the long, world-building story arcs
popular in later eras, Golden Age comics tend to be small
and self-contained; a single book usually contains multi-
ple different stories sharing a common theme (e.g., crime
or mystery). While the best-selling Golden Age comics
tell of American superheroes triumphing over German and
Japanese villains, a variety of other genres (such as ro-
mance, humor, and horror) also enjoyed popularity [18].
The Digital Comics Museum (DCM)2 hosts user-uploaded
scans of many comics by lesser-known Golden Age pub-
lishers that are now in the public domain due to copyright
expiration. To avoid off-square images and missing pages,
as the scans vary in resolution and quality, we download the
4,000 highest-rated comic books from DCM.3
2.2. Breaking comics into their basic elements
The DCM comics are distributed as compressed archives
of JPEG page scans. To analyze closure, which occurs from
panel-to-panel, we first extract panels from the page images.
Next, we extract textboxes from the panels, as both location
and content of textboxes are important for character and nar-
rative understanding.
Panel segmentation: Previous work on panel segmenta-
tion uses heuristics [34] or algorithms such as density gra-
dients and recursive cuts [52, 43, 48] that rely on pages
with uniformly white backgrounds and clean gutters. Un-
fortunately, scanned images of eighty-year old comics do
2http://digitalcomicmuseum.com/3Some of the panels in COMICS contain offensive caricatures and opin-
ions reflective of that period in American history.
7187
not particularly adhere to these standards; furthermore,
many DCM comics have non-standard panel layouts and/or
textboxes that extend across gutters to multiple panels.
After our attempts to use existing panel segmentation
software failed, we turned to deep learning. We annotate
500 randomly-selected pages from our dataset with rect-
angular bounding boxes for panels. Each bounding box
encloses both the panel artwork and the textboxes within
the panel; in cases where a textbox spans multiple pan-
els, we necessarily also include portions of the neighbor-
ing panel. After annotation, we train a region-based con-
volutional neural network to automatically detect panels.
In particular, we use Faster R-CNN [45] initialized with a
pretrained VGG CNN M 1024 model [9] and alternatingly
optimize the region proposal network and the detection net-
work. In Western comics, panels are usually read left-to-
right, top-to-bottom, so we also have to properly order all
of the panels within a page after extraction. We compute
the midpoint of each panel and sort them using Morton or-
der [41], which gives incorrect orderings only for rare and
complicated panel layouts.
Textbox segmentation: Since we are particularly inter-
ested in modeling the interplay between text and artwork,
we need to also convert the text in each panel to a machine-
readable format.4 As with panel segmentation, existing
comic textbox detection algorithms [22, 47] could not ac-
curately localize textboxes for our data. Thus, we re-
sort again to Faster R-CNN: we annotate 1,500 panels for
textboxes,5 train a Faster-R-CNN, and sort the extracted
textboxes within each panel using Morton order.
2.3. OCR
The final step of our data creation pipeline is applying
OCR to the extracted textbox images. We unsuccessfully
experimented with two trainable open-source OCR systems,
Tesseract [50] and Ocular [6], as well as Abbyy’s consumer-
grade FineReader.6 The ineffectiveness of these systems is
likely due to the considerable variation in comic fonts as
well as domain mismatches with pretrained language mod-
els (comics text is always capitalized, and dialogue phe-
nomena such as dialects may not be adequately represented
in training data). Google’s Cloud Vision OCR7 performs
much better on comics than any other system we tried.
While it sometimes struggles to detect short words or punc-
tuation marks, the quality of the transcriptions is good con-
4Alternatively, modules for text spotting and recognition [27] could be
built into architectures for our downstream tasks, but since comic dialogues
can be quite lengthy, these modules would likely perform poorly.5We make a distinction between narration and dialogue; the former
usually occurs in strictly rectangular boxes at the top of each panel and
contains text describing or introducing a new scene, while the latter is usu-
ally found in speech balloons or thought clouds.6http://www.abbyy.com7http://cloud.google.com/vision
sidering the image domain and quality. We use the Cloud
Vision API to run OCR on all 2.5 million textboxes for a cost
of $3,000. We post-process the transcriptions by removing
systematic spelling errors (e.g., failing to recognize the first
letter of a word). Finally, each book in our dataset contains
three or four full-page product advertisements; since they
are irrelevant for our purposes, we train a classifier on the
transcriptions to remove them.8
3. Data Analysis
In this section, we explore what makes understanding
narratives in COMICS difficult, focusing specifically on in-
trapanel behavior (how images and text interact within a
panel) and interpanel transitions (how the narrative ad-
vances from one panel to the next). We characterize panels
and transitions using a modified version of the annotation
scheme in Scott McCloud’s “Understanding Comics” [40].
Over 90% of panels rely on both text and image to con-
vey information, as opposed to just using a single modal-
ity. Closure is also important: to understand most tran-
sitions between panels, readers must make complex infer-
ences that often require common sense (e.g., connecting
jumps in space and/or time, recognizing when new char-
acters have been introduced to an existing scene). We con-
clude that any model trained to understand narrative flow
in COMICS will have to effectively tie together multimodal
inputs through closure.
To perform our analysis, we manually annotate
250 randomly-selected pairs of consecutive panels from
COMICS. Each panel of a pair is annotated for intrapanel
behavior, while an interpanel annotation is assigned to the
transition between the panels. Two annotators indepen-
dently categorize each pair, and a third annotator makes
the final decision when they disagree. We use four intra-
panel categories (definitions from McCloud, percentages
from our annotations):
1. Word-specific, 4.4%: The pictures illustrate, but do
not significantly add to a largely complete text.
2. Picture-specific, 2.8%: The words do little more than
add a soundtrack to a visually-told sequence.
3. Parallel, 0.6%: Words and pictures seem to follow
very different courses without intersecting.
4. Interdependent, 92.1%: Words and pictures go hand-
in-hand to convey an idea that neither could convey
alone.
We group interpanel transitions into five categories:
1. Moment-to-moment, 0.4%: Almost no time passes
between panels, much like adjacent frames in a video.
2. Action-to-action, 34.6%: The same subjects progress
through an action within the same scene.
8See supplementary material for specifics about our post-processing.
7188
INTRAPANEL
SUBJECT-TO-SUBJECT: 32.7%
SCENE-TO-SCENE: 13.8%
ACTION-TO-ACTION: 34.6%
CONTINUED CONVERSATION: 17.7%
INTERDEPENDENT: 92.1%
WORD-SPECIFIC: 4.4%
PARALLEL: 0.57%
PICTURE-SPECIFIC: 2.8%
MOMENT-TO-MOMENT: 0.39%
Figure 3. Five example panel sequences from COMICS, one for each type of interpanel transition. Individual panel borders are color-coded
to match their intrapanel categories (legend in bottom-left). Moment-to-moment transitions unfold like frames in a movie, while scene-to-
scene transitions are loosely strung together by narrative boxes. Percentages are the relative prevalance of the transition or panel type in an
annotated subset of COMICS.
3. Subject-to-subject, 32.7%: New subjects are intro-
duced while staying within the same scene or idea.
4. Scene-to-scene, 13.8%: Significant changes in time
or space between the two panels.
5. Continued conversation, 17.7%: Subjects continue a
conversation across panels without any other changes.
The two annotators agree on 96% of the intrapanel an-
notations (Cohen’s κ = 0.657), which is unsurprising be-
cause almost every panel is interdependent. The interpanel
task is significantly harder: agreement is only 68% (Co-
hen’s κ = 0.605). Panel transitions are more diverse, as
all types except moment-to-moment are relatively common
(Figure 3); interestingly, moment-to-moment transitions re-
quire the least amount of closure as there is almost no
change in time or space between the panels. Multiple tran-
sition types may occur in the same panel, such as simultane-
ous changes in subjects and actions, which also contributes
to the lower interpanel agreement.
4. Tasks that test closure
To explore closure in COMICS, we design three novel
tasks (text cloze, visual cloze, and character coherence) that
test a model’s ability to understand narratives and characters
given a few panels of context. As shown in the previous
section’s analysis, a high percentage of panel transitions re-
quire non-trivial inferences from the reader; to successfully
solve our proposed tasks, a model must be able to make the
same kinds of connections.
While their objectives are different, all three tasks
follow the same format: given preceding panels
pi−1, pi−2, . . . , pi−n as context, a model is asked to
predict some aspect of panel pi. While previous work
on visual storytelling focuses on generating text given
some context [24], the dialogue-heavy text in COMICS
makes evaluation difficult (e.g., dialects, grammatical
variations, many rare words). We want our evaluations to
focus specifically on closure, not generated text quality,
so we instead use a cloze-style framework [53]: given ccandidates—with a single correct option—models must use
the context panels to rank the correct candidate higher than
the others. The rest of this section describes each of the
three tasks in detail; Table 1 provides the total instances of
each task with the number of context panels n = 3.
Text Cloze: In the text cloze task, we ask the model to
predict what text out of a set of candidates belongs in a par-
ticular textbox, given both context panels (text and image)
as well as the current panel image. While initially we did
not put any constraints on the task design, we quickly no-
ticed two major issues. First, since the panel images include
textboxes, any model trained on this task could in princi-
ple learn to crudely imitate OCR by matching text candi-
dates to the actual image of the text. To solve this problem,
we “black out” the rectangle given by the bounding boxes
for each textbox in a panel (see Figure 4).9 Second, pan-
els often have multiple textboxes (e.g., conversations be-
tween characters); to focus on interpanel transitions rather
9To reduce the chance of models trivially correlating candidate length
to textbox size, we remove very short and very long candidates.
7189
THANKS OLD TIMER!
THE BATS WOULD
HAVE GOT US, SURE!
WHERE’D THEY COME
FROM?
SCOTTY’S MY NAME.
I’M THE SHERIFF. MEAN
TO TELL YOU’VE NEVER
HEARD OF THE BATS?
THANKS OLD TIMER! THE
BATS WOULD HAVE GOT
US, SURE! WHERE’D
THEY COME FROM?
SCOTTY’S MY
NAME. I’M THE
SHERIFF. MEAN TO
TELL YOU’VE
NEVER HEARD OF
THE BATS?
character
coherence
visual
cloze
Figure 4. In the character coherence task (top), a model must order the dialogues in the final panel, while visual cloze (bottom) requires
choosing the image of the panel that follows the given context. For visualization purposes, we show the original context panels; during
model training and evaluation, textboxes are blacked out in every panel.
than intrapanel complexity, we restrict pi to panels that con-
tain only a single textbox. Thus, nothing from the current
panel matters other than the artwork; the majority of the
predictive information comes from previous panels.
Visual Cloze: We know from Section 3 that in most cases,
text and image work interdependently to tell a story. In the
visual cloze task, we follow the same set-up as in text cloze,
but our candidates are images instead of text. A key differ-
ence is that models are not given text from the final panel;
in text cloze, models are allowed to look at the final panel’s
artwork. This design is motivated by eyetracking studies in
single-panel cartoons, which show that readers look at art-
work before reading the text [7], although atypical font style
and text length can invert this order [16].
Character Coherence: While the previous two tasks fo-
cus mainly on narrative structure, our third task attempts to
isolate character understanding through a re-ordering task.
Given a jumbled set of text from the textboxes in panel pi, a
model must learn to match each candidate to its correspond-
ing textbox. We restrict this task to panels that contain ex-
actly two dialogue boxes (narration boxes are excluded to
focus the task on characters). While it is often easy to order
the text based on the language alone (e.g., “how’s it going”
always comes before “fine, how about you?”), many cases
require inferring which character is likely to utter a partic-
ular bit of dialogue based on both their previous utterances
and their appearance (e.g., Figure 4, top).
4.1. Task Difficulty
For text cloze and visual cloze, we have two difficulty
settings that vary in how cloze candidates are chosen. In the
easy setting, we sample textboxes (or panel images) from
the entire COMICS dataset at random. Most incorrect can-
didates in the easy setting have no relation to the provided
context, as they come from completely different books and
genres. This setting is thus easier for models to “cheat” on
by relying on stylistic indicators instead of contextual in-
formation. With that said, the task is still non-trivial; for
example, many bits of short dialogue can be applicable in a
variety of scenarios. In the hard case, the candidates come
from nearby pages, so models must rely on the context to
perform well. For text cloze, all candidates are likely to
mention the same character names and entities, while color
schemes and textures become much less distinguishing for
visual cloze.
5. Models & Experiments
To measure the difficulty of these tasks for deep learn-
ing models, we adapt strong baselines for multimodal lan-
guage and vision understanding tasks to the comics do-
main. We evaluate four different neural models, variants
of which were also used to benchmark the Visual Ques-
tion Answering dataset [2] and encode context for visual
storytelling [25]: text-only, image-only, and two image-text
models. Our best-performing model encodes panels with a
hierarchical LSTM architecture (see Figure 5).
7190
z1
z3
t11
LSTM LSTM
z4
t4
HIYA KID! ALL
ALONE???
ALICE! I’VE BEEN
LOOKING ALL
OVER FOR YOU!
+
-
-
t12
LSTM LSTM z1
t11
LSTM
t12
LSTM LSTM
ReLu ReLu ReLu
Figure 5. The image-text architecture applied to an instance of the text cloze task. Pretrained image features are combined with learned
text features in a hierarchical LSTM architecture to form a context representation, which is then used to score text candidates.
On text cloze, accuracy increases when models are given
images (in the form of pretrained VGG-16 features) in addi-
tion to text; on the other tasks, incorporating both modali-
ties is less important. Additionally, for the text cloze and vi-
sual cloze tasks, models perform far worse on the hard set-
ting than the easy setting, confirming our intuition that these
tasks are non-trivial when we control for stylistic dissimilar-
ities between candidates. Finally, none of the architectures
outperform human baselines, which demonstrates the diffi-
culty of understanding COMICS: image features obtained
from models trained on natural images cannot capture the
vast variation in artistic styles, and textual models strug-
gle with the richness and ambiguity of colloquial dialogue
highly dependent on visual contexts. In the rest of this sec-
tion, we first introduce a shared notation and then use it to
specify all of our models.
5.1. Model definitions
In all of our tasks, we are asked to make a prediction
about a particular panel given the preceding n panels as
context.10 Each panel consists of three distinct elements:
image, text (OCR output), and textbox bounding box co-
ordinates. For any panel pi, the corresponding image is
zi. Since there can be multiple textboxes per panel, we
refer to individual textbox contents and bounding boxes
as tix and bix , respectively. Each of our tasks has a dif-
ferent set of answer candidates A: text cloze has three
text candidates ta1...3, visual cloze has three image candi-
dates za1...3, and character coherence has two combina-
10Test and validation instances for all tasks come from comic books that
are unseen during training.
tions of text / bounding box pairs, {ta1/ba1
, ta2/ba2
} and
{ta1/ba2
, ta2/ba1
}. Our architectures differ mainly in the
encoding function g that converts a sequence of context pan-
els pi−1, pi−2, . . . , pi−n into a fixed-length vector c. We
score the answer candidates by taking their inner product
with c and normalizing with the softmax function,
s = softmax(AT c), (1)
and we minimize the cross-entropy loss against the ground-
truth labels.11
Text-only: The text-only baseline only has access to the
text tix within each panel. Our g function encodes this text
on multiple levels: we first compute a representation for
each tix with a word embedding sum12 and then combine
multiple textboxes within the same panel using an intra-
panel LSTM [23]. Finally, we feed the panel-level represen-
tations to an interpanel LSTM and take its final hidden state
as the context representation (Figure 5). For text cloze, the
answer candidates are also encoded with a word embedding
sum; for visual cloze, we project the 4096-d fc7 layer of
VGG-16 down to the word embedding dimensionality with
a fully-connected layer.13
11Performance falters slightly on a development set with contrastive
max-margin loss functions [51] in place of our softmax alternative.12As in previous work for visual question answering [57], we observe no
noticeable improvement with more sophisticated encoding architectures.13For training and testing, we use three panels of context and three can-
didates. We use a vocabulary size of 30,000 words, restrict the maximum
number of textboxes per panel to three, and set the dimensionality of word
embeddings and LSTM hidden states to 256. Models are optimized using
Adam [29] for ten epochs, after which we select the best-performing model
on the dev set.
7191
Model Text Cloze Visual Cloze Char. Coheren.
easy hard easy hard
Random 33.3 33.3 33.3 33.3 50.0
Text-only 63.4 52.9 55.9 48.4 68.2
Image-only 51.7 49.4 85.7 63.2 70.9
NC-image-text 63.1 59.6 - - 65.2
Image-text 68.6 61.0 81.3 59.1 69.3
Human – 84 – 88 87
Table 2. Combining image and text in neural architectures im-
proves their ability to predict the next image or dialogue in
COMICS narratives. The contextual information present in pre-
ceding panels is useful for all tasks: the model that only looks at
a single panel (NC-image-text) always underperforms its context-
aware counterpart. However, even the best performing models lag
well behind humans.
Image-only: The image-only baseline is even simpler:
we feed the fc7 features of each context panel to an LSTM
and use the same objective function as before to score
candidates. For visual cloze, we project both the context
and answer representations to 512-d with additional fully-
connected layers before scoring. While the COMICS dataset
is certainly large, we do not attempt learning visual fea-
tures from scratch as our task-specific signals are far more
complicated than simple image classification. We also try
fine-tuning the lower-level layers of VGG-16 [4]; however,
this substantially lowers task accuracy even with very small
learning rates for the fine-tuned layers.
Image-text: We combine the previous two models by
concatenating the output of the intrapanel LSTM with the
fc7 representation of the image and passing the result
through a fully-connected layer before feeding it to the in-
terpanel LSTM (Figure 5). For text cloze and character co-
herence, we also experiment with a variant of the image-
text baseline that has no access to the context panels, which
we dub NC-image-text. In this model, the scoring function
computes inner products between the image features of piand the text candidates.14
6. Error Analysis
Table 2 contains our full experimental results, which we
briefly summarize here. On text cloze, the image-text model
dominates those trained on a single modality. However, text
is much less helpful for visual cloze than it is for text cloze,
suggesting that visual similarity dominates the former task.
Having the context of the preceding panels helps across the
board, although the improvements are lower in the hard set-
ting. There is more variation across the models in the easy
14We cannot apply this model to visual cloze because we are not allowed
access to the artwork in panel pi.
setting; we hypothesize that the hard case requires mov-
ing away from pretrained image features, and transfer learn-
ing methods may prove effective here. Differences between
models on character coherence are minor; we suspect that
more complicated attentional architectures that leverage the
bounding box locations bix are necessary to “follow” speech
bubble tails to the characters who speak them.
We also compare all models to a human baseline, for
which the authors manually solve one hundred instances of
each task (in the hard setting) given the same preprocessed
input that is fed to the neural architectures. Most human
errors are the result of poor OCR quality (e.g., misspelled
words) or low image resolution. Humans comfortably out-
perform all models, making it worthwhile to look at where
computers fail but humans succeed.
The top row in Figure 6 demonstrates an instance (from
easy text cloze where the image helps the model make the
correct prediction. The text-only model has no idea that
an airplane (referred to here as a “ship”) is present in the
panel sequence, as the dialogue in the context panels make
no mention of it. In contrast, the image-text model is able
to use the artwork to rule out the two incorrect candidates.
The bottom two rows in Figure 6 show hard text cloze
instances in which the image-text model is deceived by the
artwork in the final panel. While the final panel of the mid-
dle row does contain what looks to be a creek, “catfish creek
jail” is more suited for a narrative box than a speech bubble,
while the meaning of the correct candidate is obscured by
the dialect and out-of-vocabulary token. Similarly, a camera
films a fight scene in the last row; the model selects a candi-
date that describes a fight instead of focusing on the context
in which the scene occurs. These examples suggest that
the contextual information is overridden by strong associa-
tions between text and image, motivating architectures that
go beyond similarity by leveraging external world knowl-
edge to determine whether an utterance is truly appropriate
in a given situation.
7. Related Work
Our work is related to three main areas: (1) multimodal
tasks that require language and vision understanding, (2)
computational methods that focus on non-natural images,
and (3) models that characterize language-based narratives.
Deep learning has renewed interest in jointly reasoning
about vision and language. Datasets such as MS COCO [35]
and Visual Genome [31] have enabled image caption-
ing [54, 28, 56] and visual question answering [37, 36].
Similar to our character coherence task, researchers have
built models that match TV show characters with their vi-
sual attributes [15] and speech patterns [21].
Closest to our own comic book setting is the visual sto-
rytelling task, in which systems must generate [24] or re-
order [1] stories given a dataset (SIND) of photos from
7192
catfish creek
jail
thanks , lem ah
sho nuff will
hang tight evah
one - we ‘ uns are
UNK for the drink !
you won ‘ t be
using this
transmitter
here is sorcery
black magic
UNK him , boys !
guess i ‘ ll … great
guns ! another ship !
correct candidate incorrect candidates
black hood
overcoming
scorpio
about this why
i might be
murdered next!
the shooting
begins
Figure 6. Three text cloze examples from the development set,
shown with a single panel of context (boxed candidates are predic-
tions by the text-image model). The airplane artwork in the top
row helps the image-text model choose the correct answer, while
the text-only model fails because the dialogue lacks contextual in-
formation. Conversely, the bottom two rows show the image-text
model ignoring the context in favor of choosing a candidate that
mentions something visually present in the last panel.
Flikr galleries of “storyable” events such as weddings and
birthday parties. SIND’s images are fundamentally differ-
ent from COMICS in that they lack coherent characters and
accompanying dialogue. Comics are created by skilled pro-
fessionals, not crowdsourced workers, and they offer a far
greater variety of character-centric stories that depend on
dialogue to further the narrative; with that said, the text in
COMICS is less suited for generation because of OCR errors.
We build here on previous work that attempts to under-
stand non-natural images. Zitnick et al. [58] discover se-
mantic scene properties from a clip art dataset featuring
characters and objects in a limited variety of settings. Ap-
plications of deep learning to paintings include tasks such
as detecting objects in oil paintings [11, 12] and answering
questions about artwork [20]. Previous computational work
on comics focuses primarily on extracting elements such as
panels and textboxes [46]; in addition to the references in
Section 2, there is a large body of segmentation research on
manga [3, 44, 38, 30].
To the best of our knowledge, we are the first to com-
putationally model content in comic books as opposed to
just extracting their elements. We follow previous work
in language-based narrative understanding; very similar to
our text cloze task is the “Story Cloze Test” [42], in which
models must predict the ending to a short (four sentences
long) story. Just like our tasks, the Story Cloze Test proves
difficult for computers and motivates future research into
commonsense knowledge acquisition. Others have studied
characters [14, 5, 26] and narrative structure [49, 33, 8] in
novels.
8. Conclusion & Future Work
We present the COMICS dataset, which contains over 1.2
million panels from “Golden Age” comic books. We de-
sign three cloze-style tasks on COMICS to explore closure,
or how readers connect disparate panels into coherent sto-
ries. Experiments with different neural architectures, along
with a manual data analysis, confirm the importance of mul-
timodal models that combine text and image for comics un-
derstanding. We additionally show that context is crucial for
predicting narrative or character-centric aspects of panels.
However, for computers to reach human performance,
they will need to become better at leveraging context. Read-
ers rely on commonsense knowledge to make sense of dra-
matic scene and camera changes; how can we inject such
knowledge into our models? Another potentially intrigu-
ing direction, especially given recent advances in generative
adversarial networks [17], is generating artwork given dia-
logue (or vice versa). Finally, COMICS presents a golden
opportunity for transfer learning; can we train models that
generalize across natural and non-natural images much like
humans do?
9. Acknowledgments
We thank the anonymous reviewers for their insight-
ful comments and the UMIACS and Google Cloud Sup-
port staff for their help with OCR. Manjunatha and
Davis were supported by Office of Naval Research grant
N000141612713, while Iyyer, Boyd-Graber, and Daume
were supported by NSF grant IIS-1320538. Any opinions,
findings, or conclusions expressed here are those of the au-
thors and do not necessarily reflect the view of the sponsor.
7193
References
[1] H. Agrawal, A. Chandrasekaran, D. Batra, D. Parikh, and
M. Bansal. Sort story: Sorting jumbled images and captions
into stories. In Proceedings of Empirical Methods in Natural
Language Processing, 2016. 7
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra,
C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question
answering. In International Conference on Computer Vision,
2015. 5
[3] Y. Aramaki, Y. Matsui, T. Yamasaki, and K. Aizawa. Inter-
active segmentation for manga. In Special Interest Group on
Computer Graphics and Interactive Techniques Conference,
2014. 8
[4] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and
A. Torralba. Cross-modal scene networks. arXiv, 2016. 7
[5] D. Bamman, T. Underwood, and N. A. Smith. A Bayesian
mixed effects model of literary character. In Proceedings of
the Association for Computational Linguistics, 2014. 8
[6] T. Berg-Kirkpatrick, G. Durrett, and D. Klein. Unsupervised
transcription of historical documents. In Proceedings of the
Association for Computational Linguistics, 2013. 3
[7] P. J. Carroll, J. R. Young, and M. S. Guertin. Visual analysis
of cartoons: A view from the far side. In Eye movements and
visual cognition. Springer, 1992. 5
[8] N. Chambers and D. Jurafsky. Unsupervised learning of nar-
rative schemas and their participants. In Proceedings of the
Association for Computational Linguistics, 2009. 8
[9] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman.
Return of the devil in the details: Delving deep into convo-
lutional nets. In British Machine Vision Conference, 2014.
3
[10] N. Cohn. The limits of time and transitions: challenges
to theories of sequential image comprehension. Studies in
Comics, 1(1), 2010. 2
[11] E. Crowley and A. Zisserman. The state of the art: Object
retrieval in paintings using discriminative regions. In British
Machine Vision Conference, 2014. 8
[12] E. J. Crowley, O. M. Parkhi, and A. Zisserman. Face paint-
ing: querying art with photos. In British Machine Vision
Conference, 2015. 8
[13] W. Eisner. Comics & Sequential Art. Poorhouse Press, 1990.
2
[14] D. K. Elson, N. Dames, and K. R. McKeown. Extracting
social networks from literary fiction. In Proceedings of the
Association for Computational Linguistics, 2010. 8
[15] M. Everingham, J. Sivic, and A. Zisserman. Hello! my name
is... Buffy” – automatic naming of characters in TV video. In
Proceedings of the British Machine Vision Conference, 2006.
7
[16] T. Foulsham, D. Wybrow, and N. Cohn. Reading with-
out words: Eye movements in the comprehension of comic
strips. Applied Cognitive Psychology, 30, 2016. 5
[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
erative adversarial nets. In Proceedings of Advances in Neu-
ral Information Processing Systems, 2014. 8
[18] R. Goulart. Comic Book Encyclopedia: The Ultimate Guide
to Characters, Graphic Novels, Writers, and Artists in the
Comic Book Universe. HarperCollins, 2004. 2
[19] C. Guerin, C. Rigaud, A. Mercier, F. Ammar-Boudjelal,
K. Bertet, A. Bouju, J.-C. Burie, G. Louis, J.-M. Ogier,
and A. Revel. eBDtheque: a representative database of
comics. In International Conference on Document Analysis
and Recognition, 2013. 2
[20] A. Guha, M. Iyyer, and J. Boyd-Graber. A distorted skull
lies in the bottom center: Identifying paintings from text de-
scriptions. In NAACL Human-Computer Question Answer-
ing Workshop, 2016. 8
[21] M. Haurilet, M. Tapaswi, Z. Al-Halah, and R. Stiefelhagen.
Naming TV characters by watching and analyzing dialogs.
In IEEE Winter Conference on Applications of Computer Vi-
sion, 2016. 7
[22] A. K. N. Ho, J.-C. Burie, and J.-M. Ogier. Panel and speech
balloon extraction from comic books. In IAPR International
Workshop on Document Analysis Systems, 2012. 3
[23] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural computation, 1997. 6
[24] T.-H. K. Huang, F. Ferraro, N. Mostafazadeh, I. Misra,
A. Agrawal, J. Devlin, R. Girshick, X. He, P. Kohli, D. Batra,
et al. Visual storytelling. In Conference of the North Amer-
ican Chapter of the Association for Computational Linguis-
tics, 2016. 4, 7
[25] T. K. Huang, F. Ferraro, N. Mostafazadeh, I. Misra,
A. Agrawal, J. Devlin, R. B. Girshick, X. He, P. Kohli, D. Ba-
tra, C. L. Zitnick, D. Parikh, L. Vanderwende, M. Galley, and
M. Mitchell. Visual storytelling. In Conference of the North
American Chapter of the Association for Computational Lin-
guistics, 2016. 2, 5
[26] M. Iyyer, A. Guha, S. Chaturvedi, J. Boyd-Graber, and
H. Daume III. Feuding families and former friends: Un-
supervised learning for dynamic fictional relationships. In
Conference of the North American Chapter of the Associa-
tion for Computational Linguistics, 2016. 8
[27] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman.
Reading text in the wild with convolutional neural networks.
International Journal of Computer Vision, 116(1), 2016. 3
[28] A. Karpathy and F. Li. Deep visual-semantic alignments for
generating image descriptions. In IEEE Conference on Com-
puter Vision and Pattern Recognition, CVPR 2015, Boston,
MA, USA, June 7-12, 2015, 2015. 7
[29] D. Kingma and J. Ba. Adam: A method for stochastic opti-
mization. In Proceedings of the International Conference on
Learning Representations, 2014. 6
[30] S. Kovanen and K. Aizawa. A layered method for deter-
mining manga text bubble reading order. In International
Conference on Image Processing, 2015. 8
[31] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bern-
stein, and L. Fei-Fei. Visual genome: Connecting language
and vision using crowdsourced dense image annotations.
2016. 7
[32] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
7194
Proceedings of Advances in Neural Information Processing
Systems, 2012. 1
[33] W. G. Lehnert. Plot units and narrative summarization. Cog-
nitive Science, 5(4), 1981. 8
[34] L. Li, Y. Wang, Z. Tang, and L. Gao. Automatic comic page
segmentation based on polygon detection. Multimedia Tools
and Applications, 69(1), 2014. 2
[35] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Gir-
shick, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L.
Zitnick. Microsoft COCO: common objects in context. 2014.
7
[36] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical
question-image co-attention for visual question answering,
2016. 7
[37] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neu-
rons: A neural-based approach to answering questions about
images. In Computer Vision and Pattern Recognition, 2015.
7
[38] Y. Matsui. Challenge for manga processing: Sketch-based
manga retrieval. In Proceedings of the 23rd Annual ACM
Conference on Multimedia, 2015. 8
[39] Y. Matsui, K. Ito, Y. Aramaki, T. Yamasaki, and K. Aizawa.
Sketch-based manga retrieval using manga109 dataset. arXiv
preprint arXiv:1510.04389, 2015. 2
[40] S. McCloud. Understanding Comics. HarperCollins, 1994.
1, 3
[41] G. M. Morton. A computer oriented geodetic data base and
a new technique in file sequencing. International Business
Machines Co, 1966. 3
[42] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Ba-
tra, L. Vanderwende, P. Kohli, and J. Allen. A corpus and
cloze evaluation for deeper understanding of commonsense
stories. In Conference of the North American Chapter of the
Association for Computational Linguistics, 2016. 8
[43] X. Pang, Y. Cao, R. W. Lau, and A. B. Chan. A robust panel
extraction method for manga. In Proceedings of the ACM
International Conference on Multimedia, 2014. 2
[44] X. Pang, Y. Cao, R. W. H. Lau, and A. B. Chan. A robust
panel extraction method for manga. In Proceedings of the
ACM International Conference on Multimedia, 2014. 8
[45] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
wards real-time object detection with region proposal net-
works. In Proceedings of Advances in Neural Information
Processing Systems, 2015. 3
[46] C. Rigaud. Segmentation and indexation of complex objects
in comic book images. PhD thesis, University of La Rochelle,
France, 2014. 8
[47] C. Rigaud, J.-C. Burie, J.-M. Ogier, D. Karatzas, and
J. Van de Weijer. An active contour model for speech bal-
loon detection in comics. In International Conference on
Document Analysis and Recognition, 2013. 3
[48] C. Rigaud, C. Guerin, D. Karatzas, J.-C. Burie, and J.-M.
Ogier. Knowledge-driven understanding of images in comic
books. International Journal on Document Analysis and
Recognition, 18(3), 2015. 2
[49] R. Schank and R. Abelson. Scripts, Plans, Goals and Un-
derstanding: an Inquiry into Human Knowledge Structures.
L. Erlbaum, 1977. 8
[50] R. Smith. An overview of the tesseract ocr engine. In Inter-
national Conference on Document Analysis and Recognition,
2007. 3
[51] R. Socher, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded
compositional semantics for finding and describing images
with sentences. Transactions of the Association for Compu-
tational Linguistics, 2014. 6
[52] T. Tanaka, K. Shoji, F. Toyama, and J. Miyamichi. Lay-
out analysis of tree-structured scene frames in comic images.
In International Joint Conference on Artificial Intelligence,
2007. 2
[53] W. L. Taylor. Cloze procedure: a new tool for measuring
readability. Journalism and Mass Communication Quarterly,
30(4), 1953. 4
[54] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
tell: A neural image caption generator. In Computer Vision
and Pattern Recognition, 2015. 7
[55] C. Xiong, S. Merity, and R. Socher. Dynamic memory net-
works for visual and textual question answering. In Proceed-
ings of the International Conference of Machine Learning,
2016. 1
[56] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdi-
nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural
image caption generation with visual attention. In Proceed-
ings of the International Conference of Machine Learning,
2015. 1, 7
[57] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fer-
gus. Simple baseline for visual question answering. arXiv
preprint arXiv:1512.02167, 2015. 6
[58] C. L. Zitnick, R. Vedantam, and D. Parikh. Adopting abstract
images for semantic scene understanding. IEEE Trans. Pat-
tern Anal. Mach. Intell., 38(4):627–638, 2016. 8
7195