a pipeline for creative visual storytelling · 2018-05-11 · proceedings of the first workshop on...

Proceedings of the First Workshop on Storytelling, pages 20–32New Orleans, Louisiana, June 5, 2018. c©2018 Association for Computational Linguistics

A Pipeline for Creative Visual Storytelling

Stephanie M. Lukin, Reginald Hobbs, Clare R. VossU.S. Army Research Laboratory

Adelphi, MD, [email protected]

Abstract

Computational visual storytelling produces atextual description of events and interpreta-tions depicted in a sequence of images. Thesetexts are made possible by advances and cross-disciplinary approaches in natural languageprocessing, generation, and computer vision.We define a computational creative visual sto-rytelling as one with the ability to alter thetelling of a story along three aspects: to speakabout different environments, to produce vari-ations based on narrative goals, and to adaptthe narrative to the audience. These aspects ofcreative storytelling and their effect on the nar-rative have yet to be explored in visual story-telling. This paper presents a pipeline of task-modules, Object Identification, Single-ImageInferencing, and Multi-Image Narration, thatserve as a preliminary design for building acreative visual storyteller. We have piloted thisdesign for a sequence of images in an annota-tion task. We present and analyze the collectedcorpus and describe plans towards automation.

1 Introduction

Telling stories from multiple images is a creativechallenge that involves visually analyzing the im-ages, drawing connections between them, and pro-ducing language to convey the message of thenarrative. To computationally model this cre-ative phenomena, a visual storyteller must takeinto consideration several aspects that will influ-ence the narrative: the environment and presen-tation of imagery (Madden, 2006), the narrativegoals which affect the desired response of thereader or listener (Bohanek et al., 2006; Thorneand McLean, 2003), and the audience, who mayprefer to read or hear different narrative styles(Thorne, 1987).

The environment is the content of the imagery,but also its interpretability (e.g., image quality).Canonical images are available from a number

of high-quality datasets (Everingham et al., 2010;Plummer et al., 2015; Lin et al., 2014; Ordonezet al., 2011), however, there is little coverage oflow-resourced domains with low-quality imagesor atypical camera perspectives that might appearin a sequence of pictures taken from blind persons,a child learning to use a camera, or a robot survey-ing a site. For this work, we studied an environ-ment with odd surroundings taken from a cameramounted on a ground robot.

Narrative goals guide the selection of what ob-jects or inferences in the image are relevant or un-characteristic. The result is a narrative tailoredto different goals such as a general “describe thescene”, or a more focused “look for suspicious ac-tivity”. The most salient narrative may shift asnew information, in the form of images, is pre-sented, offering different possible interpretationsof the scene. This work posed a forensic task withthe narrative goal to describe what may have oc-curred within a scene, assuming some temporalconsistency across images. This open-endednessevoked creativity in the resulting narratives.

The telling of the narrative will also differ basedupon the target audience. A concise narrative ismore appropriate if the audience is expecting tohear news or information, while a verbose and hu-morous narrative is suited for entertainment. Au-diences may differ in how they would best experi-ence the narrative: immersed in the first person orthrough an omniscient narrator. The audience inthis work was unspecified, thus the audience wasthe same as the storyteller defining the narrative.

To build a computational creative visual story-teller that customizes a narrative along these threeaspects, we propose a creative visual storytellingpipeline requiring separate task-modules for Ob-ject Identification, Single-Image Inferencing, andMulti-Image Narration. We have conducted an ex-ploratory pilot experiment following this pipeline

20

Figure 1: Creative Visual Storytelling Pipeline: T1 (Object Identification), T2 (Single Image Inferencing),T3 (Multi-Image Narration)

to collect data from each task-module to train thecomputational storyteller. The collected data pro-vides instances of creative storytelling from whichwe have analyzed what people see and pay atten-tion to, what they interpret, and how they weavetogether a story across a series of images.

Creative visual storytelling requires an under-standing of the creative processes. We arguethat existing systems cannot achieve these cre-ative aspects of visual storytelling. Current objectidentification algorithms may perform poorly onlow-resourced environments with minimal train-ing data. Computer vision algorithms may over-identify objects, that is, describe more objects thanare ultimately needed for the goal of a coherentnarrative. Algorithms that generate captions of animage often produce generic language, rather thanlanguage tailored to a specific audience. Our pilotexperiment is an attempt to reveal the creative pro-cesses involved when humans perform this task,and then to computationally model the phenomenafrom the observed data.

Our pipeline is introduced in Section 2, wherewe also discuss computational considerations andthe application of this pipeline to our pilot experi-ment. In Section 3 we describe the exploratory pi-lot experiment, in which we presented images of alow-quality and atypical environment and have an-notators answer “what may have happened here?”This open-ended narrative goal has the potential toelicit diverse and creative narratives. We did not

specify the audience, leaving the annotator free towrite in a style that appeals to them. The data andanalysis of the pilot are presented in Section 4,as well as observations for extending to crowd-sourcing a larger corpus and how to use these cre-ative insights to build computational models thatfollow this pipeline. In Section 5 we compareour approach to recent works in other storytellingmethodologies, then conclude and describe futuredirections of this work in Section 6.

2 Creative Visual Storytelling Pipeline

The pipeline and interaction of task-modules wehave designed to perform creative visual story-telling over multiple images are depicted in Fig-ure 1. Each task-module answers a question criti-cal to creative visual storytelling: “what is here?”(T1: Object Identification), “what happens here?”(T2: Single-Image Inferencing), and “what hashappened so far?” (T3: Multi-Image Narration).We discuss the purpose, expected inputs and out-puts of each module, and explore computationalimplementations of the pipeline.

2.1 PipelineThis section describes the task-modules we de-signed that provide answers to our questions forcreative visual storytelling.

Task-Module 1: Object Identification (T1).Objects in an image are the building blocks for sto-rytelling that answer the question, literally, “what

21

is here?” This question is asked of every im-age in a sequence for the purposes of object cura-tion. From a single image, the expected outputsare objects and their descriptors. We anticipatethat two categories of object descriptors will be in-formative for interfacing with the subsequent task-modules: spatial descriptors, consisting of objectco-locations and orientation, and observational at-tribute descriptors, including color, shape, or tex-ture of the object. Confidence level will provideinformation about the expectedness of the objectand its descriptors, or if the object is difficult oruncertain to decipher given the environment.

Task-Module 2: Single-Image Inferencing(T2). Dependent upon T1, the Single-Image In-ferencing task-module is a literal interpretation de-rived from the objects previously identified in thecontext of the current image. After the curation ofobjects in T1, a second round of content selectioncommences in the form of inference determina-tion and selection. Using the selected objects, de-scriptors, and expectations about the objects, thistask-module answers the question “what happenshere?” For example, the function of “kitchen”might be extrapolated from the co-location of a ce-real box, pan, and crockpot.

Separating T2 from T1 creates a modular sys-tem where each task-module can make the best de-cision given the information available. However,these task-modules are also interdependent: as theinferences in T2 depend upon T1 for object selec-tion, so too does the object selection depend uponthe inferences drawn so far.

Task-Module 3: Multi-Image Narration(T3). A narrative can indeed be constructed froma single image, however, we designed our pipelineto consider when additional context, in the formof additional images, is provided. The Multi-Image Narration task-module draws from T1 andT2 to construct the larger narrative. All images,objects, and inferences are taken into consider-ation when determining “what has happened sofar?” and “what has happened from one image tothe next?” This task-module performs narrativeplanning by referencing the inferences and objectsfrom the previous images. It then produces a natu-ral language output in the form of a narrative text.Plausible narrative interpretations are formed fromglobal knowledge about how the addition of newimages confirm or disprove prior hypotheses andexpectations.

2.2 From Pipeline Design to Pilot

Our first step towards building this automatedpipeline is to pilot it. We will use the dataset col-lected and the results from the exploratory studyto to build an informed computational, creative vi-sual storyteller. When piloting, we refer to thispipeline a sequence of annotation tasks.

T1 is based on computer vision technology. Ofparticular interest are our collected annotations onthe low-quality and atypical environments that tra-ditionally do not have readily available object an-notations. Commonsense reasoning and knowl-edge bases drive the technology behind deriv-ing T2 inferences. T3 narratives consist of twosub-task-modules: narrative planning and natu-ral language generation. Each technology can bematched to our pipeline, and be built up separately,leveraging existing works, but tuned to this task.

Our annotators are required to write in naturallanguage (though we do not specify full sentences)the answers to the questions posed in each task-module. While this natural language intermedi-ate representation of T1 and T2 is appropriate fora pilot study, a semantic representation of thesetask-modules might be more feasible for compu-tation until the final rendering of the narrative text.For example, drawing inferences in T2 with theobjects identified in T1 might be better achievedwith an ontological representation of objects andattributes, such as WordNet (Fellbaum, 1998), andinferences mined from a knowledge base.

In our annotation, the sub-task-modules of nar-rative planning and natural language generationare implicitly intertwined. The annotator does notnote in the exercise intermediary narrative plan-ning before writing the final text. In computa-tion, T3 may generate the final narrative text word-by-word (combining narrative planning and natu-ral language generation). Another approach mightfirst perform narrative planning, followed by gen-eration from a semantic or syntactic representa-tion that is compatible with intermediate represen-tations from T1 and T2.

3 Pilot Experiment

A paper-based pilot experiment implementing thispipeline was conducted. Ten annotators (A1 -A10)1 participated in the annotation of the three

1A5, an author of this paper, designed the experiment andexamples. All annotators had varying degrees of familiaritywith the environment in the images.

22

Figure 2: image1, image2, and image3 in Pilot Experiment Scene

images in Figure 2 (image1 - image3). Theseimages were taken from a camera mounted on aground robot while it navigated an unfamiliar en-vironment. The environment was static, thus, pre-senting these images in temporal order was not ascritical as it would have been if the images werestill-frames taken from a video or if the imagescontained a progression of actions or events.

Annotators first addressed the questions posedin the Object Identification (T1) and Single-ImageInference (T2) task-modules for image1. They re-peated the process for image2 and image3, and au-thored a Multi-Image Narrative (T3). The anno-tator work flow mimicked the pipeline presentedin Figure 1. For each subsequent image, the timeallotted increased from five, to eight, to elevenminutes to allow more time for the narrative tobe constructed after annotators processed the ad-ditional images. An example image sequence withanswers was provided prior to the experiment. A5

gave a brief, oral, open-ended explanation of theexperiment as not to bias annotators to what theyshould focus on in the scene or what kind of lan-guage they should use. The goal of this data col-lection is to gather data that models the creativestorytelling processes, not to track these processesin real-time. A future web-based interface willallow us to track the timing of annotation, whatinformation is added when, and how each task-module influences the other task-modules for eachimage.

Object Identification did not require annotatorsto define a bounding box for labeled objects, norwere annotators required to provide objective de-scriptors2. Annotators authored natural languagelabels, phrases, or sentences to describe objects,attributes, and spatial relations while indicating

2As we design a web-based version of this experiment, wewill enforce interfaces explicitly linked to object annotations,and the desire to view previously annotated images.

confidence levels if appropriate.During Single Image Inferencing, annotators

were shown their response from T1 as they au-thored a natural language description of activityor functions of the image, as well as a natural lan-guage explanation of inferences for that determi-nation, citing supporting evidence from T1 out-put. For a single image, annotators may answerthe questions posed by T1 and T2 in any order tobuild the most informed narrative.

Annotators authored a Multi-Image Narrative toexplain what has happened in the sequence of im-ages presented so far. For each image seen in thesequence, annotators were shown their own natu-ral language responses from T1 and T2 for thoseimages. Annotators were encouraged to look backto their responses in previous images (as the bot-tom row of Figure 1 indicates), but not to makechanges to their responses about the previous im-ages. They were, however, encouraged to incor-porate previous feedback into the context of thecurrent image. From this task-module, annotatorswrote a natural language narrative connecting ac-tivity or functions in the images which will be usedto learn how to weave together a story across theimages.

The open-ended “what has happened here?”narrative goal has no single answer. These anno-tations may be treated as ground truth, but we runthe risk of potentially missing out on creative alter-natives. Bootstraping all possible objects and in-ferences would achieve greater coverage, yet thisquickly becomes infeasible. We lean toward themiddle, where the answers collected will help de-termine what annotators deem as important.

4 Results and Analysis

In this section, we discuss and analyze the col-lected data and provide insights for incorporatingeach task-module into a computational system.

23

# Annotators Objects

10 calendar, water bottle9 computer, table/desk8 chair4 walls, window2 blue triangles1 floor, praying rug

Table 1: Objects identified by annotators in image1


10 suitcase, shirt8 sign6 green object5 fire extinguisher4 walls3 bag2 floor, window1 coat hanger, shoes, rug

Table 2: Objects identified by annotators in image2


7 crockpot, cereal box, table6 pan5 container2 walls, label1 thread and needle, coffee pot,

jam, door frame

Table 3: Objects identified by annotators in image3 (to-tal of 7 annotators)

4.1 Object Identification (T1)

Thirty three objects were identified across the im-ages.3 A5 identified the most of these objects (20),and A1, the least (10). Tables 1 - 3 show theobjects identified and how many annotators ref-erenced each object. A set of objects emergedin each image that captured the annotators’ atten-tion. Object descriptor categories are tabulated inTable 44. Not surprisingly, the most common de-scriptors were attributes, e.g., color and shape, fol-lowed by co-locations. Orientation was not ob-served in this dataset, however this category maybe useful for other disrupted environments. Weobserved instances of uncertainty, e.g., “a suitcase,not entirely sure, because of zipper and size”, andunexpected objects, “unfinished floor”, whereas“floors” may have not been labeled otherwise.

Lack of coverage and overlap in this task withrespect to objects and descriptors is not discour-aging. In fact, we argue that exhaustive object

3Due to time constrains, A2 - A4 did not complete image3.4Tabulation of descriptors in Tables 7 - 9 in Appendix.

Total Average Min Max

SpatialCo-Location 51 6.3 0 14

ObservationalAttribute 99 12.2 3 22

ConfidenceUnexpected 7 0.7 0 4Uncertainty 28 3.3 0 8

Total 185 22.6 7 39

Table 4: Object descriptor summary with counts perannotator (A2 - A4 excluded from average, min, andmax; see footnote 4)

identification is counter-intuitive and detrimentalto creative visual storytelling. Annotators mayhave identified only the objects of interest to thenarrative they were forming, and viewed other ob-jects as distractors. The most frequent of the iden-tified objects are likely to be the most influentialin T2 where the calendar, computer, and chair pro-vide more information than the “blue triangles”.

Not only can selective object identification pro-vide the most salient objects for deriving interpre-tations, but the Object Identification exercise withrespect to storytelling can differentiate betweenobjects and descriptors that are commonplace orotherwise irrelevant. For instance, if a fire extin-guisher was not annotated as red, we are inclinedto deduce it is because this fact is well known orunimportant, rather than the result of a distractedannotator.5

When automating this task-module, new objectidentification algorithms should account for thefollowing: a sampling of relevant objects specificto the storytelling challenge, and attention to po-tential outlier descriptors which may be more in-dicative than a standard descriptor, depending onthe environment.

4.2 Single-Images Inferencing (T2)

We highlight A1 and A8 for the remainder of thediscussion6. Table 5 shows A1’s annotation ofSingle-Image Inferencing and Multi-Image Nar-ration. In the Single-Image Inferencing (T2) forimage1, A1 noted the “office” theme by referenc-ing the desk and computer, and expressed uncer-tainty with respect to the window looking “weird”and unlike a typical office building. A1 kept clear

5We expect this to be revealed in the web-based versionof the task with a stricter annotation interface.

6Other annotation results in Tables 10 - 17 in Appendix.

24

Image Single-Image Inference Multi-Image NarrativeImage1 Looks like a dingy, sparse office. The computer

desk, calendar indicate an office, but the space is un-finished (no dry wall, carpet) and area outside win-dow looks weird, not like an office building.

Image2 Looks like someone was staying here temporarily,using this now to store clothes, or maybe as a bed-room. Again, it’s atypical because its an unfinishedspace that looks uncomfortable.

I think this person was hiding out here to get readyfor some event. The space isn’t finished enough tobe intended for habitation, but someone had to stayhere, perhaps because they didn’t want to be found,and you wouldn’t expect someone to be living in aconstruction zone.

Image3 This area was used as a sort of kitchen or food stor-age prep area.

Someone was definitely living here even though itwasn’t finished or intended to be a house. They wereprobably using a crock pot because you can makefood in this without having larger appliances like astove, oven. There’s no milk, so this person may belactose intolerant. The robot should vanquish themwith milk.

Table 5: A1’s annotation (previously identified objects in Single-Image Inference text in italics)

Image Single-Image Inference Multi-Image NarrativeImage1 This is likely a workplace of some sort. It is unclear

if it is an unfinished part of a current/suspended con-struction project or it is just a utilitarian space insideof an industrial facility. The presence of a computermonitor suggest it is in use or a low crime area.

Image2 This is a jobsite of some sort. It has unfinished wallsand what may be a paper shredder.

This is an unfinished building. There is some ev-idence of office-type work (i.e. work involving pa-per and computers). The existence of “windows” be-tween rooms suggests that this is not a dwelling (orintended to become one), that is, a building designedto be a dwelling, but what it is remains unclear.

Image3 A room in a building is being used as a cooking andeating station, based upon presence of food, table,and cooking instruments.

This building is being used by a likely small numberof individuals for unclear purposes including cook-ing, eating, and basic office work.

Table 6: A8’s annotation (previously identified objects in Single-Image Inference text in italics)

the distinction between images in their annota-tion of image2, as there were no references to theoffice observed only in image1. Instead, refer-ences in image2 were to the storage of clothes. Inthe single-image interpretation of image3, A1 sug-gested that this was a food preparation area fromthe presence of the crockpot, cereal, and the otherfood items that appeared together. A8, whose an-notation is in Table 6, also noted the “workplace”theme from the desk and computer, though A8

leaned more towards a construction site, citing theutilitarian space. Due to uncertainty of the envi-ronment, A8 misidentified the suitcase in image2as a shredder, and incorporated it prominently intotheir interpretation. Similar to A1, A8 also indi-cated in image3 that this was a food preparationarea.

A8’s misinterpretation of the suitcase raises animplementation question: are the inferences andalgorithms we develop only as good as our en-

vironment data allows them to be? How mighta misunderstanding of the environment affect theinferences? This environment showcased theuniqueness of the physical space and low-qualityof images, yet all annotators indicated, withoutprompting or instruction, varying degrees of con-fidence in their interpretations based upon the ev-idence. A8 indicated their uncertainty about thesuitcase object by hedging that it was “what maybe a paper shredder”. This expression of uncer-tainty should be preserved in an automated systemfor instances such as this when an answer is un-known or has a low confidence level.

T2 is intended to inform a commonsense rea-soner and knowledge base based on T1 to deducethe setting. This task-module describes functionsof rooms or spaces, e.g., food preparation areasand office space. Additional interpretations aboutthe space were made by annotators from the over-all appearance of objects in the image, such as the

25

atmospheric observation “lighting of rooms is notvery good” (A7, Table 15 in Appendix). Theseinferences might not be easily deducible from T1alone, but the combination of these task-modulesallows for these to occur.

Evaluating this annotation in a computationalsystem will require some ground truth, though wehave previously stated that it is impossible to claimsuch a gold standard in a creative storytelling task.Evaluation must therefore be subject to both quali-tative and quantitative analyses, including, but notlimited to, commonsense reasoning on validationsets and determining plausible alternatives to com-monsense interpretations.

4.3 Multi-Image Narration (T3)

The narrative begins to form across the first twoimages in the Multi-Image Narration task-module(T3). A1 hypothesized that someone was “hidingout”, going a step beyond their T2 inference ofan “office space” in image1, to extrapolate “whathas happened here” rather than “what happenshere”. In image2, A1 had hedged their narrativewith “I think”, but the language became strongerand more confident in image3, in which A1 “def-initely” thought that the space was inhabited. A1

pointed out that a lack of milk was unexpected ina canonical kitchen, and supplemented their nar-rative with a joke, suggesting to “vanquish themwith milk”. In image2, A8 interpreted that thespace was not intended for long-term dwelling.Their narrative shifted in image3 when anotherscene was revealed. A8 concluded that this spacewas inhabited by a group, despite the annotator’sprevious assumption in image2 that it was notsuited for this purpose.

There is no a guaranteed “correct” narrative thatunfolds, especially if we are seeking creativity.Some narrative pieces may fall into place as ad-ditional images provided context, but in the caseof these environments, annotators were challengedto make sense of the sequence and pull together aplausible, if not uncertain, narrative.

The narrative goal and audience aspects of cre-ative visual storytelling will directly inform T3.A variety of creative narratives and interpretationsemerged from this pilot, despite the particularlysparse and odd environment and openness of thenarrative goal. Based on the responses from eachsuccessive task-modules, all annotators’ interpre-tations and narratives are correct. Even with anno-

tator misunderstandings, the narratives presentedwere their own interpretation of the environment.As the audience in this task was not specified, an-notators could use any style to tell their story. Thedata collected expressed creativity through jokes(A1), lists and structured information (A5), con-cise deductions (A6, A8), uncertain deductions(A4), first person (A1, A3, A5), omniscient nar-rators (A2), and the use of “we” inclusive of therobot navigating the space (A7, A9, A10).

Future annotations may assign an audience ora style prompt in order to observe the varied lan-guage use. This will inform computational modelsby curating stylistic features and learning from ap-propriate data sources.

5 Related work

Visual storytelling is still a relatively new subfieldof research that has not yet begun to capture thehighly creative stories generated by text-based sto-rytelling systems to date. The latter supports thedefinition of specific goals or presents alternatenarrative interpretations by generating stories ac-cording to character goals (e.g., Meehan (1977))and author goals (e.g., Lebowitz (1985)). Other in-teractive, co-constructed, text-based narrative sys-tems make use of information retrieval methods byimplicitly linking the text generation to the inter-pretation. As a result, systems incorporating thesemethods cannot be adjusted for different narrativegoals or audiences (Cychosz et al., 2017; Swansonand Gordon, 2008; Munishkina et al., 2013).

Other research in text-based storytelling focuseson answering the question “what happens next?”to infer the selection of the most appropriate nextsentence. This method indirectly relies on the se-lection of sentences to evaluation the results of aforced choice between the “best” or “correct” nextsentence of the choices when given a narrativecontext (as in the Story Close Test (Mostafazadehet al., 2016) and the Children’s Book Test (Hillet al., 2015)). Our pipeline, by contrast, buildson a series of open-ended questions, for whichthere is no single gold-standard or reference an-swer. Instead, we expect in time to follow priorwork by Roemmele et al. (2011) where evaluationwill entail generating and ranking plausible inter-pretations.

Recent work on caption generation combinescomputer vision with a simplified narration, or sin-gle sentence text description of an image (Vinyals

26

et al., 2015). Image processing typically takesplace in one phase, while text generation fol-lows in a second phase. Superficially, this sep-aration of phases resembles the division of laborin our approach, where T1 and T2 involve image-specific analysis, and T3 involves text generation.However this form of caption generation dependssolely on training data where individual imagesare paired with individual sentences. It assumesthe T3 sub-task-modules can be learned from thesame data source, and generates the same sen-tences on a per-image basis, regardless of the or-der of images. One can readily imagine the inade-quacy of stringing together captions to construct anarrative, where the same captions describe bothimages of a waterfall flowing down, and thosesame images in reverse order where instead thewater seems to be flowing up.

The work most similar in approach to our vi-sual storyteller annotation pipeline is Huang et al.(2016) who separate their tasks into three tiers: thefirst over single images, generating literal descrip-tions of images in isolation (DII), the second overmultiple images, generating literal descriptions ofimages in sequence (DIS), and the third over mul-tiple images, generating stories for images in se-quence (SIS). While these tiers may seem analo-gous to ours, there are different assumptions un-derlying the tasks in data collection. For each task,their images are annotated independently by dif-ferent annotators, while in our approach, all im-ages are annotated by annotators performing allof our tasks. The DII task is an exhaustive objectidentification task on single images, yet we leaveT1 up to our annotators to determine how manyobjects and attributes to describe in an image toavoid the potential for object over-identification.The SIS task involves a set of images over whichannotators select and possibly reorder, then writeone sentence per image to create a narrative, withthe opportunity to skip images. In our pipeline,we have intentionally designed our task-modulesto allow for the possibility of one task-module tobuild off of and influence one another. It is possi-ble in our approach for an annotator’s inference inT2 of one image to feed forward and affect theirT1 annotations in the subsequent image, whichmight in turn affect the resulting T3 narrative. Inshort, Huang et al. (2016) capture the thread ofstorytelling in one tier only, their SIS condition,while our annotators build their narratives across

task-modules as they progress from image to im-age.

6 Conclusion and Future Work

This paper introduces a creative visual storytellingpipeline for a sequence of images that delegatesseparate task-modules for Object Identification,Single-Image Inferencing, and Multi-Image Nar-ration. These task-modules can be implementedto computationally describe diverse environmentsand customize the telling based on narrative goalsand different audiences. The pilot annotation hascollected data for this visual storyteller in a low-resourced environment, and analyzed how creativevisual storytelling is performed in this pipeline forthe purposes of training a computational, creativevisual storyteller. The pipeline is grounded in nar-rative decision-making processes, and we expectit to perform well on both low- and high-qualitydatasets. Using only curated datasets, however,runs the risk of training algorithms that are notgeneral use.

We are now positioned to conduct a crowd-sourcing annotation effort, followed by an imple-mentation of this storyteller following the outlinedtask-modules for automation. Our pipeline andimplementation detail are algorithmically agnos-tic. We anticipate off-the-shelf and state-of-the-artcomputer vision and language generation method-ologies will provide a number of baselines forcreative visual storytelling: to test environments,compare an object identification algorithm trainedon high-quality data against one trained on low-quality data; to test narrative goals, compare acomputer vision algorithm that may over-identifyobjects against one focused on a specific set toform a story; to test audience, compare a captiongeneration algorithm that may generate genericlanguage against one tailored to the audience de-sires.

The streamlined approach of our experimen-tal annotation pipeline allows us to easily promptfor different narrative goals and audiences in fu-ture crowdsourcing to obtain and compare differ-ent narratives. Evaluation of the final narrativemust take into consideration the narrative goal andaudience. In addition, evaluation must balance thecorrectness of the interpretation with expressingcreativity, as well as the grammaticality of the gen-erated story, suggesting new quantitative and qual-itative metrics must be developed.

27

ReferencesJennifer G Bohanek, Kelly A Marin, Robyn Fivush,

and Marshall P Duke. 2006. Family narrative inter-action and children’s sense of self. Family process,45(1):39–54.

Margaret Cychosz, Andrew S Gordon, ObiageliOdimegwu, Olivia Connolly, Jenna Bellassai, andMelissa Roemmele. 2017. Effective scenario de-signs for free-text interactive fiction. In Inter-national Conference on Interactive Digital Story-telling, pages 12–23. Springer.

Mark Everingham, Luc Van Gool, Christopher KIWilliams, John Winn, and Andrew Zisserman. 2010.The pascal visual object classes (voc) challenge. In-ternational journal of computer vision, 88(2):303–338.

Christiane Fellbaum. 1998. WordNet. Wiley OnlineLibrary.

Felix Hill, Antoine Bordes, Sumit Chopra, and JasonWeston. 2015. The goldilocks principle: Readingchildren’s books with explicit memory representa-tions. arXiv preprint arXiv:1511.02301.

Ting-Hao Kenneth Huang, Francis Ferraro, NasrinMostafazadeh, Ishan Misra, Aishwarya Agrawal, Ja-cob Devlin, Ross Girshick, Xiaodong He, PushmeetKohli, Dhruv Batra, et al. 2016. Visual storytelling.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1233–1239.

Michael Lebowitz. 1985. Story-telling as planning andlearning. Poetics, 14(6):483–502.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European confer-ence on computer vision, pages 740–755. Springer.

Matt Madden. 2006. 99 ways to tell a story: exercisesin style. Random House.

James R Meehan. 1977. Tale-spin, an interactive pro-gram that writes stories. In Ijcai, volume 77, pages91–98.

Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A cor-pus and cloze evaluation for deeper understandingof commonsense stories. In Proceedings of NAACL-HLT, pages 839–849.

Larissa Munishkina, Jennifer Parrish, and Marilyn AWalker. 2013. Fully-automatic interactive story de-sign from film scripts. In International Conferenceon Interactive Digital Storytelling, pages 229–232.Springer.

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg.2011. Im2text: Describing images using 1 millioncaptioned photographs. In Neural Information Pro-cessing Systems (NIPS).

Bryan A Plummer, Liwei Wang, Chris M Cervantes,Juan C Caicedo, Julia Hockenmaier, and Svet-lana Lazebnik. 2015. Flickr30k entities: Col-lecting region-to-phrase correspondences for richerimage-to-sentence models. In Computer Vision(ICCV), 2015 IEEE International Conference on,pages 2641–2649. IEEE.

Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S Gordon. 2011. Choice of plausible alterna-tives: An evaluation of commonsense causal reason-ing. In AAAI Spring Symposium: Logical Formal-izations of Commonsense Reasoning, pages 90–95.

Reid Swanson and Andrew S Gordon. 2008. Say any-thing: A massively collaborative open domain storywriting companion. In Joint International Confer-ence on Interactive Digital Storytelling, pages 32–40. Springer.

Avril Thorne. 1987. The press of personality: Astudy of conversations between introverts and ex-traverts. Journal of Personality and Social Psychol-ogy, 53(4):718.

Avril Thorne and Kate C McLean. 2003. Telling trau-matic events in adolescence: A study of master nar-rative positioning. Connecting culture and memory:The development of an autobiographical self, pages169–185.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Computer Vision and Pat-tern Recognition (CVPR), 2015 IEEE Conferenceon, pages 3156–3164. IEEE.

28

Appendix: Additional AnnotationsObject # Descriptor Text # DescriptorCalendar 10 hanging off the table, taped to table top 4 co-location

marked up, ink, red circle on calendar, marked with pen 4 attributeforeign language 1 attributepaper 1 attributepicture on top 1 attribute

Water bottle 10 on the floor, on ground, on floor 3 co-locationto the right of table 1 co-locationmostly empty, unclear if it has been opened 2 attributeplastic 2 attributeclosed with lid 1 attribute

Computer 9 screen, black turned off; monitor, black 3 attributeTable / Desk 9 has computer on it, [computer] on table 2 co-location

gray, black 2 attributewood 2 attributemetal 1 attribute(presumed) rectangular 1 uncertainty

Chair 8 folding 6 attributemetal 5 attributegrey 3 attribute

Walls 4 wood 1 attributeunfinished and showing beams, unfinished construction 3 unexpected

Window 4 in wall behind chair 1 co-locationwindow to another room; perhaps chairs in other room 1 uncertaintyno glass 1 unexpected

Blue triangles 2 blue objects in windowsill 1 unexpectedFloor 1 unfinished 1 unexpectedPraying rug 1

Table 7: Object Identification for image1

Object # Descriptor Text # DescriptorSuitcase 10 black, orange stripes; black and red; black with red trim; blue and copper 5 attribute

(not entirely sure) because of zipper item and size 1 uncertaintyresembles a paper shredder 1 uncertaintya suitcase or a heater 1 uncertainty

Shirt 10 on hanger; on fire extinguisher; on wall; hanging 6 co-locationblack 5 attributelong sleeves 2 attributeblack thing hanging on wall (unclear what it is); black object 2 uncertainty

Sign 8 on the wall 4 co-locationmaybe indicating ’3’?; roman numerals; 3 dashes; Arabic numbers; foreignlanguage; room number 111

6 attribute

poster 1 attributemap or blueprints 1 uncertainty

Green object 6 spherical 1 attributehanging in window; in windowsill 2 co-locationgreen thing outside room; green object; unidentifiable object; lime greenobject

4 uncertainty

light post? fan? 1 uncertaintyFire extinguisher 5 hanging off of black thing (also unclear as to what this is or does); on wall 3 co-location

obscured 1 co-locationcylindrical 1 attributewhite and red thing; red object, white and red piece of object 3 uncertainty

Wall 4 wooden 1 attributeunfinished; visible plywood studs 2 unexpected

Bag 3 backpack or bag; something round; pile of clothes 2 uncertaintyon the ground 2 co-locationnext to suitcase 1 co-location

Floor 2 marking of industry grade particle board, unfinished 2 attributeWindow 2Coat hanger 1 hanging on wall 1 co-location

wire 1 attributewhite 1 attribute

Shoes 1 shoes or hat 1 uncertaintyRug 1


Object # Descriptor Text # DescriptorCrockpot 7 on table 1 co-location

green 2 attributeold fashioned 1 attributekitchen appliance 1 attributewhite or silver 1 uncertainty

Cereal box 7 on table 1 co-locationto the right of crockpot 1 co-locationshredded wheat 4 attributecardboard 1 attributeprinted black letters 1 attribute

Table 7 wood 3 attributecoffee table style 2 attributepale 1 attribute

Pan 6 on ground; on floor 3 co-locationblue handle 1 attributemedium-size 1 attribute

Container 5 clear 2 attributeplastic 3 attributeempty 2 attributerectangular 1 attributehinged top 1 attribute

Walls 2 lined with paper 1 attributeLabel 2 on pressure cooker 2 co-location

white 1 attributeThread and nee-dle

1 to the right of cereal box 1 co-location

Coffee pot 1 what looks like a coffee pot 1 uncertaintyempty 1 attributebehind cereal box 1 attribute

Jam 1 plaid red and white lid 1 attributeDoor frame 1


Image Single-Image Inferencing Multi-Image NarrationImage1 Someone sits at table and puts water bottle on floor

while perhaps taking notes for others in some room.Folding chair suggests temporary or new use ofspace while building under construction.

Image2 Hallway view, suggesting exit path where someonemight leave luggage while being in building

Same building as in the first scene because same typeof wood for walls, floor, and opening/window con-struction. Arabic numbers on paper sign loosely at-tached (because wavy surface of paper e.g. not rigid,not laminated) to the wall suggests temporary des-ignation of space for specific use, as an organizedarrangement by some people for others.

Image3 N/A N/A

Table 10: A2’s annotation

Image Single-Image Inferencing Multi-Image NarrationImage1 I believe this is an office, because there is a computer

monitor on a table, the table is serving as a desk,and there is a metal chair next to the monitor andthe desk. A calendar is typically found in an office,however the calendar here is not in a location that isconvenient for a person

Image2 I believe that this is a standard room that serves as astorage area. The absence of other objects does nothint at this room serving any other purpose.

I believe that this storage room is located in a homesince personal items such as a luggage bag and spareshirt are not typically found in a public building.From the marked calendar in the previous picture,it appears that the occupants are preparing to travelvery soon.

Image3 N/A N/A


30

Image Single-Image Inferencing Multi-Image NarrationImage1 An office or computer setting/workstation. Actual

computer maybe under desk (not visible) or miss-ing. Water bottle suggests someone used this spacerecently. Chair not facing desk suggests person leftin a hurry (not pushed under desk). Red circled datesuggests some significance.

Image2 Shirt and suitcase suggests someone stored their per-sonal items in this space. Room being labeled sug-gests recent occupants used more than 1 part of thisspace. Space does not look comfortable, but per-sonal effects are here anyway. Holiday?

Someone camped out here and planned activities.They left in a hurry and didn’t spend time puttingthings in their suitcases, or they had a visitor and thevisitor left abruptly. The occupant may have left onthe date marked in the calendar. The date may havehad personal significance for an operation.

Image3 N/A N/A


Image Single-Image Inferencing Multi-Image NarrationImage1 Office (chair, desk, computer, calendar). Unfinished

building (walls, floor, window)Image2 In an unfinished building closet, common space.

Things thrown to the side. Doesn’t care much aboutoffice safety because fire extinguisher is covered,therefore not easily accessible.

Not sure about either workzone because randomlyplaced clothes and unsafe work environment. Couldbe a factory with unsafe conditions. Someone livingor storing clothes in a “break room”?

Image3 “Camp” site but not outdoors. Items on floor indicatesome disarray or disregard for cleanliness. Why isthe crock pot on the coffee table with cereal? Break-fast? But why are the walls strange?

Food like this shouldn’t appear in a safe work envi-ronment, so I no longer think that. Someone seemsto be living here in an unsafe and probably unregu-lated (re: fire extinguisher) way. Someone is hidingout in an uninhabited warehouse or work site (walls,floors, windows)


Image Single-Image Inferencing Multi-Image NarrationImage1 This is an office space because there is a desk, chair,

computer and calendar. These items are typicalitems that would be in an office space.

Image2 This looks like a storage space, a closet, or the en-trance/exit to a building. People typically pile thingssuch as a suitcase, hanging clothes, backpack, etc.at one of those locations. A storage space or closetwould allow for the items to be stored for a long timebut would also be due to people being ready to leaveon travel.

Due to the lack of decorations I would say these pic-tures were taken in a location where people werestaying or working temporarily (like a headquarterssafe house, etc.)

Image3 These are items that would typically be found in akitchen or break area. You would see a table orcounter in a kitchen or break room. The pan andcrock pot are not items that would be seen in otherrooms, like a living room, office, bathroom, bed-room.

I would say this is a house or temporary space be-cause the items are not organized and the surround-ing area is not decorative. The scenes look messyand it doesn’t look like it gets cleaned or has beencleaned recently. Plus the space contains a suitcasewhich gives the impressions that the person has notunpacked.


31

Image Single-Image Inferencing Multi-Image NarrationImage1 Looks like a place to work, with chair, table, mon-

itor. Calendar is out of place because people don’thave calendars from the edge of a table, so it canonly be seen from the floor. Walls are unfurnished,only wood and plywood. A window in the wall, likean interior window. Not sure what it is a windowfor-why is the window in that location?

Image2 We are looking through a doorway or hallway. Shirtand suitcase belong together. Not sure what otherobjects are (green, red, black on ground).

Might be same location as Image 1, because thewooden/plywood walls and floor are similar. Notsure what the images have to do with each other, butmight be 2 different rooms in same location. We’reviewing this image from another room, because thisroom has a poster in it. Lighting of rooms is not verygood, almost looks like spot lights, so not like an or-dinary, prototypical house.

Image3 A bunch of objects on a table, with a few objectsunderneath. The objects on/under the table all haveto do with food or preparing food. Walls are lightcolored. In the foreground appears to be a woodendoor jam. Although there are some kitchen items,this does not look like a typical kitchen

It is difficult to tell if this is in the same location asthe previous 2 images. The wood door jam mightbe the same, but hard to know if wall is plywoodand we don’t see any other wooden framing. Roomsfrom all 3 images don’t appear connected physically.No understandable context or connections.


Image Single-Image Inferencing Multi-Image NarrationImage1 This looks like a make-shift room or space. Has a

military of intel feel to it. Could be a briefing or aninterrogation room. Given the prayer rug, definitelyinteraction between parties of different backgrounds,etc.

Image2 This view or room reflects living quarters. Given thenature of the condition of the wall, it is a make-shift.The existence of a number identifying this room in-dicates that it is one of many.

Combining the 2 pictures, this is beginning to looklike part of a structure used for military/intel pur-poses. The location is most likely somewhere in theMiddle East given how the numbers are written inHindi indicating Arabic language. This also meanswe have multi-party/individual interactions.

Image3 This picture has all the ingredients to presentinga kitchen: food and cookware leads to a kitchen.Given the “rough” look of the setting, this has thehallmarks of a make-shift kitchen.

This confirms, more than anything else, the scenariodescribed in picture 2. As a whole, looks like somesort of post or output or a make-shift temporary type.Only necessities are present and the place couldn’tquickly be abandoned.


Image Single-Image Inferencing Multi-Image NarrationImage1 This was probably used as a workspace, given the

chair and table with the monitor and the calendar.Someone was recently there because the bottle is up-right.

Image2 This was a space that someone lived in given theclothes, fan(?), heater/suitcase(?). Given the mess,they left abruptly. The fire extinguisher indicates apresence because it is a safety aid.

This suggests we’re in a space occupied by some-one because of the office type and “living room” typeroom setup. It was purposefully made and left veryabruptly (messy clothes, chair not pushed in).

Image3 This seems to be a kitchen area because all objectsare food related. It is messy. The rice cooker has ablue light and may be on. There is a window lettingin light, visible on the back wall.

This supports the assumption that the environmentwas recently occupied. Food is opened, rice cookeris on, mess suggests it was abruptly abandoned,much like image 2’s mess. The robot appears to be Ithe doorway at an angle.


32

a pipeline for creative visual storytelling · 2018-05-11 · proceedings of the first workshop on...

Documents