aligning actions across recipe graphs

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6930–6942November 7–11, 2021. c©2021 Association for Computational Linguistics

6930

Aligning Actions Across Recipe Graphs

Lucia Donatelli, Theresa Schmidt, Debanjali Biswas,Arne Kohn, Fangzhou Zhai, Alexander KollerDepartment of Language Science and Technology

Saarland Informatics CampusSaarland University

{donatelli,theresas,dbiswas,koehn,fzhai,koller}@coli.uni-saarland.de

Abstract

Recipe texts are an idiosyncratic form of in-structional language that pose unique chal-lenges for automatic understanding. One chal-lenge is that a cooking step in one recipecan be explained in another recipe in differ-ent words, at a different level of abstraction, ornot at all. Previous work has annotated corre-spondences between recipe instructions at thesentence level, often glossing over importantcorrespondences between cooking steps acrossrecipes. We present a novel and fully-parsedEnglish recipe corpus, ARA (Aligned RecipeActions), which annotates correspondences be-tween individual actions across similar recipeswith the goal of capturing information implicitfor accurate recipe understanding. We rep-resent this information in the form of recipegraphs, and we train a neural model for pre-dicting correspondences on ARA. We find thatsubstantial gains in accuracy can be obtainedby taking fine-grained structural informationabout the recipes into account.

1 Introduction

Cooking recipes are a type of instructional text thatmany people interact with in their everyday lives.A recipe explains step by step how to cook a cer-tain dish, describing the actions a chef needs toperform as well as the ingredients and intermediateproducts of the cooking process. However, recipesfor the same dish often differ in which cookingactions they describe explicitly, how they describethem, and in which order.1 For instance, in thethree recipes in Fig. 1, the overall process of as-sembling a batter and making waffles is explainedwith different levels of detail: recipe (a) explainsthe process with three distinct cooking actions (inbold); recipe (b) with eight distinct actions; andrecipe (c) with nineteen actions.

1For now, we understand an action as synonymous to asemantic event. We use the former term for consistency withother recipe work.

(a) Beat eggs. Mix in remaining ingredients. Cook on hot

waffle iron. 2

(b) Preheat your waffle iron. In a large bowl, mix together

the flour, salt, baking powder, and sugar. In another bowl,

beat the eggs. Add the milk, butter, and vanilla to the

eggs. Pour the liquid into the flour mixture and beat until

blended. Ladle the batter into the waffle iron and cookuntil crisp and golden. 3

(c) Sift together in a large mixing bowl flour, baking pow-

der, salt, and sugar. In a jug, measure out milk. Separateeggs, placing egg whites in the bowl of standing mixer.

Add yolks and vanilla essence to milk and whisk together.

Pour over the flour mixture and very gently stir until about

combined. Stir in the melted butter and continue mix-ing very gently until combined. Beat egg whites until

stiff and slowly fold into batter. Spoon the batter into pre-heated waffle iron in batches and cook according to its

directions. Remove immediately and serve with maple

syrup and fruits. 4

Figure 1: Three recipe texts for making waffles withcooking actions in bold. The recipes are ordered fromleast amount of detail (a) to most (c).

The set of recipes in Fig. 1 provides more generalinformation about how to make waffles than eachindividual recipe can. To our knowledge, only oneprevious work focuses on alignment of instructionsacross recipes to facilitate recipe interpretation ona dish level: Lin et al. (2020) (henceforth referredto as L’20) present the Microsoft Research Mul-timodal Aligned Recipe Corpus to align Englishrecipe text instructions to each other and to videosequences. L’20 focus on alignment of instructionsas sentences. Yet, as sentences can be quite longand contain multiple actions, defining instructionsat the sentence level often glosses over the relation-ship between individual actions and excludes the

2http://myhappytribe.com/recipes/waffles/

3http://www.tastygardener.com/waffles/4http://bakesbychichi.com/2014/03/

waffles.html

mailto:[email protected]






https://web.archive.org/web/20160729233017/http://myhappytribe.com/recipes/waffles/

http://myhappytribe.com/recipes/waffles/

https://web.archive.org/web/20160729233017/http://myhappytribe.com/recipes/waffles/

http://myhappytribe.com/recipes/waffles/

https://web.archive.org/web/20130901212914/http://www.tastygardener.com/waffles/

http://www.tastygardener.com/waffles/

https://web.archive.org/web/20160827112542/https://bakesbychichi.com/2014/03/waffles.html

http://bakesbychichi.com/2014/03/waffles.html

https://web.archive.org/web/20160827112542/https://bakesbychichi.com/2014/03/waffles.html

http://bakesbychichi.com/2014/03/waffles.html

6931

complex event structure that makes recipe interpre-tation challenging and compelling. For example,in L’20, “Ladle the batter into the waffle iron andcook [...]” in Recipe (b) is aligned to “Spoonthe batter into preheated waffle iron in batchesand cook [..]” in Recipe (c). If aligned at thesentence level only, the action correspondence of(b)’s recipe-initial “Preheat” and (c)’s “preheated”much later in the recipe is not (and cannot be) ac-counted for. The relationship between the individ-ual actions ((b) “Ladle,” (c) “Spoon”) as well asbetween both instances of “cook” is additionallyobscured by coarse-grained sentence alignment.

Aligning recipes at the action level (i.e. the bolditems in Fig. 1) instead of the sentence level ispractically useful to aggregate detailed informationabout how to cook a dish in different ways. Thisalignment would additionally offer greater insightinto how recipe actions implicitly contain informa-tion about ingredients, tools, cooking processes,and other semantic information necessary for accu-rate recipe interpretation.

In this paper, we make two contributions towardsthe goal of complete and accurate alignment of ac-tions across similar recipes. First, we collect anovel corpus that aligns cooking actions acrossrecipes, Aligned Recipe Actions (ARA), (Section4), by crowdsourcing alignment annotations on topof the L’20 corpus for a select number of dishes. Inorder to determine the annotation possibilities, wedevelop a neural recipe parser to identify descrip-tions of actions and substances and arrange themin a recipe graph (Section 3). Second, we presentan alignment model which identifies correspon-dences between actions across recipe graphs basedon our parsed corpus (Section 5). Error analysis ofboth human and machine performance illustratesthat, though complex, the task of aligning recipeactions is achievable with our methodology and caninform future work on aligning sets of instructions.Our corpus and code are publicly available.5

2 Background And Related Work

Recipe text. The recipes in Fig. 1 illustrate someof the idiosyncracies of recipe text: all texts arewritten entirely in imperative mood ((a) “cook”;(b) “preheat”; (c) “sift”); definite noun phrasesfrequently drop their determiners ((a) “eggs”; (c)“milk,” “egg whites”); many arguments are elidedor left implicit ((b) “beat ∅”; (c) “pour ∅ over”);

5https://github.com/coli-saar/ara

bare adjectives are used to describe desired endstates ((b) “until crisp and golden”); and manyanaphoric expressions refer to entities which werenot explicitly introduced before ((b) “the liquid”;(c) “the flour mixture”). Accurate interpretationof single recipe instructions requires familiaritywith situated food ingredients, knowledge of verbsemantics to identify how each cooking step relatesto the others, and general commonsense about thecooking environment and instructional syntax.

Recipe graphs and corpora. Representingrecipes as graphs is the dominant choice (Moriet al., 2014; Jermsurawong and Habash, 2015; Kid-don et al., 2015; Yamakata et al., 2016; Chen, 2017;Chang et al., 2018; Ozgen, 2019). Of relevanceto this paper is the recipe corpus of Yamakataet al. (2020) (Y’20), which consists of 300 En-glish recipes annotated with graphs as in Fig. 2.We train a recipe parser on this corpus (Section3) and use the trained parser to identify actions inthe L’20 corpus. As noted earlier, Lin et al. (2020)(L’20) created the Microsoft Research MultimodalAligned Recipe Corpus of roughly 50k text recipesand 77k recipe videos across 4000 dishes in En-glish. The recipes and video transcriptions weresegmented into sentences (but not parsed), and 200recipe pairs were manually annotated for corre-spondences between recipe sentences. L’20 clusterdishes based on exact match of recipe title; theyperform sentence alignments on pairs of recipesusing the unsupervised algorithm of Naim et al.(2014). We use L’20’s dataset as a basis for ouralignment task in Sections 4 and 5.

Recipe parsing. Parsing recipes into graphs isusually comprised of two steps: (i) tagging men-tions of substances and cooking steps, and (ii) link-ing these mentions with input and output edges.Recent work on English recipes has achieved F-Scores above 90 for identifying mentions (Chen,2017) and F-Scores above 80 for adding the edges(Jermsurawong and Habash, 2015; Chen, 2017;Ozgen, 2019). Most of this work uses supervisedlearning based on hand-annotated recipe datasets.Using unsupervised methods, Kiddon et al. (2015)train a generative neural model on a large corpusof unannotated recipe texts and achieve an F-Scoreof 80 on predicting edges given gold informationabout the nodes; the output graphs are less detailedthan ours. Ozgen (2019) achieves an F-Score of 75on the same task and presents a subtask of creatingaction graphs similar to ours in Section 3.4.

https://github.com/coli-saar/ara

https://github.com/coli-saar/ara

6932

Event alignment. Our work shares an interestin modelling procedural knowledge with the de-tection and alignment of script events. Chambersand Jurafsky (2008, 2009) identified event typesfrom text according to their predicate-argumentstructures and behavior in event chains via count-based statistics. We capture similar information ina crowdsourcing task reminiscent of Wanzare et al.(2016, 2017) to automatically align actions withoutall surface text (Regneri et al., 2010).

3 Parsing Recipes into Graphs

The main contribution of this paper is a corpus ofaction alignments between action graphs of cook-ing recipes. Basing our corpus on unannotatedrecipe texts from L’20, we are dependent on anaccurate tagger and parser for pre-processing. Thetagger identifies the alignable actions in a recipe,and the parser structures recipes into graph repre-sentations. For both tasks, we train neural modelson the data of Yamakata et al. (2020) (Y’20); we seta new state of the art on this dataset. Finally, we dis-till the Y’20-style recipe graphs into more focusedaction graphs (Section 3.4) which the alignmentmodel (Section 5) takes as input.

3.1 Recipe GraphsThe recipe graphs in the Y’20 dataset are directedacyclic graphs with node and edge labels (Fig. 2).Nodes represent entities of ten different types, suchas ingredients, cooking tools, and various typesof actions. Edges represent actions’ predicate-argument structure, as well as part-whole relationsand temporal information. The full graph showsthe states of ingredients and tools throughout thecooking process; it can be read from top to bottomas inputs transformed into consecutive outputs.

3.2 Tagging RecipesWe split the parsing task into two steps. In thefirst step, we tag the tokens of a recipe with theirrespective node types. We implement the sequencetagger as a neural network (NN) with a two-layeredBiLSTM encoder generating predictions in a CRFoutput layer:

~y = CRF (BiLSTM (2)(BiLSTM (1)(~x))),

where ~y are the predicted tags over the inputsequence with the embedding ~x. For comparison,Y’20 employ BERT-NER6 for their ‘recipe named

6https://github.com/kyzhouhzau/BERT-NER

eggs [F]bowl [T]

BEAT [Ac]

eggs [F]vanilla [F]

butter [F]

milk [F]

ADD [Ac]

liquid [F]

MIXTOGETHER [Ac]

bowl [T]

large [Sf]

sugar [F]bakingpowder [F]

salt [F]

flour [F]

flourmixture [F]

POUR [Ac]blended [Sf]

BEAT [Ac]

batter [F]

LADLE [Ac]

COOK [Ac]

crisp [Sf]golden [Sf]

waffleiron [T]

PREHEAT [Ac]

waffleiron [T]

other-mod

t-comp

targ

targ

targtarg

f-eq

dest

t-comp targ

f-eq

dest

targ

targ

targ

f-eq

targ

targ

t-eq

dest

targv-time

f-eq

targ

targv-timev-time

Figure 2: Full graph for recipe (b) (Fig. 1) in the styleof Y’20. Actions are displayed as diamonds, foods ascircles, tools as rectangles, and all else as ellipses.

entity (r-NE)’ tagging task.Contrary to common expectation, we find that

this tagger performs better with ELMo embeddings(Peters et al., 2018) than with BERT embeddings(Devlin et al., 2019) (Table 1). Trained and evalu-ated on Y’20’s 300-r corpus, our tagger performstwo points better than Y’20’s tagger and reachesY’20’s inter-annotator agreement.

3.3 Parsing Recipes

In the second step, we predict the edges of thegraph. We use the biaffine dependency parser byDozat and Manning (2017), implemented by Gard-ner et al. (2018). The model consists of a biaffineclassifier upon a three-layered BiLSTM with mul-tilingual BERT embeddings. The model takes asinput the tagged recipe text generated by the taggerand generates a dependency tree over the recipe.We use the parser to generate connected recipegraphs of full recipes, so we parse the entire recipe

https://github.com/kyzhouhzau/BERT-NER




6933

Model Corpus Embedder Precision Recall F-Score

IAA 100-r by Y’20 89.9 92.2 90.5

Y’20 300-r by Y’20 86.5 88.8 87.6Our tagger 300-r by Y’20 English ELMo 89.9 ± 0.5 89.2 ± 0.4 89.6 ± 0.3Our tagger 300-r by Y’20 multilingual BERT 88.7 ± 0.4 88.4 ± 0.1 88.5 ± 0.2

Table 1: Recipe tagging performance compared to Y’20’s performance and inter-annotator agreement (IAA).

as a single “sentence”.As this is a dependency tree parser, we remove

edges from the recipe graphs in the training data tomake them into trees: we preserve the edge that islisted first in the raw annotation data and ignore allother edges. This results in dependency edges thatpoint from the inputs to the actions and from theactions to the outputs, such that the final dish onlyhas incoming edges. We still evaluate against thecomplete recipe graphs in the test set.

Our parsing results are presented in Table 2. Wetrain the parser on the English 300-r corpus byY’20. Our parser sets a new state of the art on thiscorpus, with an F-Score of 78.2 on gold-taggedevaluation data. Moreover, the combined taggerand parser achieve an F-Score of 72.3 on unanno-tated recipe text, compared to 43.3 of Y’20’s ownparser on automatically tagged recipe text. See Sup-plementary Material A for model hyperparametersand label-specific performance on actions.

3.4 Action Graphs for Recipe Alignment

To automatically align actions across recipe graphs,we abstract the output of our parser to actiongraphs, which only retain the action nodes from thefull graphs: the full graph in Fig. 2 is transformedinto the action graph in Fig. 3. We accomplish thisby removing all non-action nodes. The paths be-tween action nodes in the full graph become edgesin the action graph. Similar to full recipe graphs,action graphs can be read from top to bottom asa temporal and sometimes causal sequence of ac-tions. Each interior node has parent action nodesit is dependent on, as well as child nodes condi-tional upon it. We utilize this information in ourautomatic alignment model (Section 5).

In creating our action graphs, we unify all Y’20action types into a single type action: actions bychef, both continuous (e.g. “chop,” “add”) and dis-continuous (“siftai ingredients togetherai+1”); ac-tions by food (“until the butter meltsaj”); and ac-tions by tool (“until skewer comes outak clean”).7

7On the specific task of action tagging, we also outperform

SEPARATE

MEASURE

ADDPLACING

BEAT WHISKSIFT

POUR

STIR

MELTED

STIR

CONTINUEMIXING

FOLD

SPOON

PREHEATED

COOK

REMOVE

SERVE

BEAT

ADD

POUR

MIXTOGETHER

PREHEAT BEAT

LADLE

COOK

BEAT

MIX

COOK

(c) (b) (a)

Figure 3: An example of how actions align across ac-tion graphs for the recipes in Fig. 1 (expert annotation).

While most actions have surface realizations of ex-actly one verb token, actions can span up to fourconsecutive tokens (see Table 3).

4 A Corpus of Recipe Action Alignments

We collect a corpus of crowdsourced annotationsof action alignments between recipes for a sub-set of L’20’s corpus. Our Aligned Recipe Action(ARA) corpus makes two important contributions:(i) manual alignments, as opposed to the majorityof L’20’s automatically aligned corpus; (ii) align-ments are at the action level, as opposed to themore coarse-grained sentence level.

4.1 Data PreparationWe draw the recipes for the crowdsourcing fromthe training data of L’20. We manually chose 10

Y’20 on gold labels (89.6 F-Score); see Table 5, Supplemen-tary Materials.

6934

Model Corpus Tag source Precision Recall F-Score

IAA 100-r by Y’20 gold tags 84.4 80.4 82.3

Y’20 300-r by Y’20 gold tags 73.7 68.6 71.1Our parser 300-r by Y’20 gold tags 80.4 ± 0.0 76.1 ± 0.0 78.2 ± 0.0

Y’20 300-r by Y’20 Y’20 tagger 51.1 37.7 43.3Our parser 300-r by Y’20 our ELMo tagger 74.4 ± 0.5 70.4 ± 1.0 72.3 ± 0.8

Table 2: Recipe parsing performance compared to Y’20 and inter-annotator agreement (IAA).

count mean

dishes 10recipes 110recipes per dish 11total action pairs annotated 1592annotations per source action at least 3actions per recipe 3-37 15.1actions per source recipe 4-37 15.9actions per sentence 0-9 1.7tokens per action 1-4 1.2

Table 3: Makeup of our ARA annotated corpus.

dishes from the subset of dishes with exactly 11pairwise-matched recipes.8 We chose the 10 dishesand corresponding recipes for their variation andto ensure our work generalizes to new cuisines andrecipes: they span different cuisines (Italian, Chi-nese, Indian, American, German) and dish types(appetizer, side, main, dessert). A full list of disheswe annotate is in Supplementary Material B. In allrecipes, we detect actions with our tagger (Section3.3) using ELMo embeddings and trained on theEnglish data of Y’20.

For each dish, we define ten pairs of recipessuch that one recipe is the source recipe of an align-ment to the next shorter target recipe of the samedish. We select these pairings by measuring thelength of recipes in number of actions. Using thismethodology, all recipes except for the longest andshortest recipes of each dish are annotated onceas source recipe and once as target recipe, respec-tively. The pairing procedure is motivated by tworationales: 1. Long-to-Short. In order to limit an-notator disagreement from 1:n alignments in ourdata, we always present the longer recipe as thesource recipe as we expect 1:n alignments to be notas common if the source recipe is longer than thetarget recipe. 2. Transitive closure. An alignmentfrom a source recipe to a secondary target recipecan be approximated by the alignments between(i) the source recipe and the target recipe and (ii)the alignments between the target recipe and thesecondary target recipe.

8Dishes in L’20 have between 3 and 100 recipes.

4.2 Data Collection

We obtain the alignments in the corpus in a multi-step process: For every source action a, we initiallyask three crowd-workers to vote for the correct tar-get action. If we find a unique plurality vote fora target action b, a is aligned to b. If there is noagreement between the crowd-workers, we itera-tively ask more crowd-workers to select a targetaction, until we have a unique plurality vote for thetarget. This approach optimizes resource spendingas it takes into account that some alignments arevery easy (needing few votes) whereas others areharder, resulting in more noise in the votes andtherefore needing more votes to obtain a reliableannotation. In extreme cases, where we did notobtain a plurality even after five or six votes, wedeferred to an expert annotator: 190 out of 1592source actions did not receive a unique plurality tar-get; these were adjudicated by two of the authors.

Crowd-sourcing setup. We implemented ourcrowdsourcing experiment with Lingoturk (Pusseet al., 2016). Participants were hired through Pro-lific9. In total we hired 250 participants, each ofwhom answered on average 32 questions. Eachparticipant was paid 1.47 GBP; the average hourlypayment was 8.82 GBP.

A participant’s task was to align the set of actionsin a source recipe to its target recipe. Importantly,this task is not a simple verb matching task. Partic-ipants were instructed to read both recipes in theirentirety and were then presented with two sourceactions at a time in the order in which they appearin the recipe.10 For each source action, the partici-pant was asked to choose one action from the targetrecipe that it “best corresponds to”, or “None ofthese”; the entire set of actions in the target recipewas available for selection. Both recipes were dis-

9https://www.prolific.co/; as part of screening for the plat-form, all participants were native English speakers and hadnormal vision (in particular, they distinguish colors correctly).

10If a source recipe has an odd number of actions, the lastset of questions for the recipe pair has only one question.

6935

played at all times with all actions bolded and inunique colors for ease of identification. Each par-ticipant did this for two recipe pairs of differinglength and for different dishes, such that the totallength of each experiment was roughly comparable.See Supplementary Material B for more details onthe annotation setup.

Annotator agreement. To quantitatively assessagreement between our annotators, we computehow often the votes by the participants agreed withthe alignment we obtained (i.e., the probability ofa participant’s vote to align with the plurality). Adistribution of the support behind the plurality voteis seen in Fig. 4. While some questions are easyand receive 100% agreement, many of them onlyreceive a plurality from 50% of the answers. Forsome questions, disagreement between annotatorswas so high that the most-chosen answer was onlychosen by 20% of the annotators. In such cases,we collected a high number of votes from differentparticipants to obtain a plurality.

Overall, 69.3% of the target selections by ourannotators agreed with the annotation in the dataset(selected by plurality vote). This measure is not anupper bound for system performance because incor-rect annotations by an annotator are only reflectedin the dataset if by chance several annotators chosethe same incorrect target. Otherwise, the incorrectdecision by an annotator will be remedied by therequirement of having a plurality vote from at leastthree different annotators. It is also not an upperbound for human performance because some anno-tators were more reliable than others and we reportthe average over all annotators.

IAA measures. Due to the data collection design(a large group of people answering question withvariable answer sets and a skewed but unknownprior distribution over the answers), common inter-annotator metrics such as Krippendorff’s α or Co-hen’s κ are not applicable: These measures requireeither a fixed set of possible answers for all ques-tions or knowledge of the prior distribution of theanswers, or a fixed number of annotators (or a com-bination of these requirements). Thus, computingsuch a reliability metric would require making (in-correct) assumptions of the data and render theresulting number uninterpretable.

Human performance and baseline. To obtaina human accuracy score, a subset of the authorsmanually annotated ten recipe pairs and obtained

20 30 40 50 60 70 80 90 1000

100

200

300

400

% agreement

num

ques

tions

Figure 4: For each question we computed how manyanswers agreed with the plurality vote. The plot showsthe distribution of questions over this agreement score.

a 79% accuracy with respect to the gold standard.The majority baseline for the annotations is 31.3%(always choosing “None”).

Corpus statistics. Details about ARA, our fullannotated corpus, are in Table 3. On average, 16action pairs are collected for each recipe pair. No-tably, this average number of actions per recipe(15.1) is almost double the average number of 8sentences per dish in L’20, further motivating ourtask of annotating fine-grained actions to collectmore detailed recipe information. Of particularinterest for our task, 70% of the recipes have re-occurring action words (e.g. “add”) that do notsignal re-occurring actions based on the surround-ing recipe context. There are 366 unique actionsdistributed over 1659 action instances. The major-ity of recipes have two repeated verbs; the highestrepetition is 5 identical verbs in one recipe, whichoccurs in two recipes.

4.3 Disagreement AnalysisQualitative analysis reveals that inter-annotator dis-agreement has several sources (actions bolded). Weexpect 10% of actions to be mistagged (Table1); typos in the original L’20 corpus are a specialcase of this (“from” misspelled as “form”). Severalrecipes consist of more than 20 actions and containmultiple identical actions in their surface form (seeparagraph Corpus statistics). Though these actionsappear in different colors on the interface, it is easyto forget which color corresponds to which actionand subsequently misalign it. This can also causeannotators to simply choose the most similar su-perficial action without considering context: forexample, “mix” aligned to “mash” regardless of

6936

surrounding ingredients or stage of recipe.Even given the full recipe context, linguistic pe-

culiarities of recipe text make deciding whether ac-tions between two recipes correspond difficult. Wediscuss several cases of this and quantify their fre-quency based on 100 of the questions that initiallyreceived no majority (percentages in parentheses).Notably, some of these categories overlap.11

One-to-many alignments (6%). Our Long-to-Short principle (Section 4.1) pays off, although westill find cases where one source action can align totwo distinct target actions. We see this result withinfinitive actions: “set aside to coolai ,” as one ac-tion, can be aligned to either action in “transferajto coolaj+1 .” We also see this with prepositionalresult phrases: “stirak until combinedak+1

.” Ad-ditional manual analysis shows missed cases areminimal enough to not impact downstream perfor-mance.

Many-to-one alignments (26%). The reversecase of one-to-many alignments causes disagree-ment in whether individual actions that comprise amore complex action should be aligned. This hap-pens frequently in baking recipes, where recipesoften differ in when and how they combine wet anddry ingredients. For example in Fig. 3, “Add” addsthe milk, butter, and vanilla to eggs in (b), whilethe same action in (c) adds egg yolks and vanillaessence to milk. Though these high-level actionsmay sequentially align across recipes, the differ-ent ingredients that act as arguments can impactwhether they are judged to correspond. Conflatedactions contribute to this phenomenon: “Add dryingredients alternately with wet ingredients”; “Letcool in pan or on wire rack.”

Implicit actions (24%). Implicit actions comein many forms in recipe texts and may cause con-fusion as to whether an action should be aligned tothe implicit step or not aligned at all. We see thiswith nouns that imply actions: “return to continuecooking” versus “return to cooker.” We also seeactions that must be inferred from their surround-ing actions: “Spoon onto baking tray. Take outafter 8 minutes.” to connote “Bake.” Finally, in-gredients themselves can imply actions for the chef(“crushed garlic”), or not (“canned tomatoes”).

Let and light verbs (22%). Light causativeverbs such as let and allow are frequent sourcesof disagreement. Part of this disagreement stems

11For example, ‘let’ and light verbs can cause 1:n align-ments. The 3% we do not report in the text are due to typos(2) and a noun misclassified by our tagger as an action (1).

from our action tagger: while sequences such as“let restai” and “allow to standaj” are tagged as oneaction, “allowak to returnak+1

” is tagged as two. Ifa noun intervenes, such as “letal dough restal+1

,”one action can become two.12 Disagreement canthen arise as to which of the two actions carries thesemantic weight and should be aligned. By-phrasesalso introduce confusing subevent structure: “cre-ate layers by putting pasta...” versus “layer pasta.”

Negative and conditional events (2%). Somerecipes include negative events: “Avoid over mix-ing” or “without kneading.” Conditional eventsmay also cause disagreement based on whether anannotator judges them as real or not: “If refriger-ated...”; “When you are ready to bake...”

No best alignment (17%). Though some ac-tions have no best alignment, the crowdsourcingtask may bias participants to choose an answer.

The sources of disagreement we find in ARA il-lustrate the need for analyzing recipe text at the fine-grained level of actions. In refining the sentence-level alignments of L’20 to the cooking action level,we find that judging when and how specific actions“correspond” is a complex task that requires intri-cate knowledge of the cooking domain.

5 Automatic Alignment of Actions

We develop two automatic alignment models thatmimic our crowdsourcing task; these models alignaction nodes between our parsed action graphs(Section 3.4). We evaluate both models on thecrowdsourced alignment dataset (Section 4): oneusing features solely from the actions to be aligned,and one incorporating parent and child action nodesinto the alignment decision.

5.1 Alignment Model Structure

We treat alignment as an assignment problem: foreach action in the source recipe R1, we indepen-dently decide which action in the target recipeR2 itshould align to. Alternatively, a source action mayhave no adequate alignment and be unaligned. We

12This is peculiarity of English verb constructions. In pilotannotation on a German corpus, this disagreement does notoccur, as all objects stand alone from complex verb construc-tions and action sequences are not interrupted. Yamakata et al.(2020) notice this difficulty with English, too, and distinguish‘actions by chef’ and ‘actions by food,’ such that “let doughrest” consists of one action by chef and one action by food(Table 6). The authors still notice inconsistency in how anno-tators evaluate these sequences: while Ac is annotated with92.7% accuracy, other actions show accuracy between 17%and 46%.

6937

use a two-block architecture for this classificationtask.

Encoder The Encoder generates an encodingvector enc(i) for each action a(i) of a recipe. Were-tokenize each recipe with the BERT tokenizerand obtain token embeddings emb(j) from BERT.For each action a(i) we track the list of tokenst(a(i)) that correspond to this action. We then ob-tain enc(i) for the two versions of our alignmentmodel in the following way.

In the base model, we run an LSTM over the em-beddings to generate the representations for eachaction, such as:

enc b(i) = LSTM seq([emb(j) | j ∈ t(a(i))])enc(i) = enc b(i)

The extended model incorporates structural in-formation about the recipe graph. We extract thechild and parent action nodes for each source ac-tion. We then combine the base encoding of eachaction with the base encodings of its children andparents. As each action can have multiple parentsand children, we run an LSTM p over the parentbase encodings for all parents p(a(i)) for an actiona(i) and an LSTM c over the child base encodingsfor all child nodes c(a(i)). We obtain enc ext(i)by concatenating enc b(i) with the outputs of thesetwo LSTMs:

enc p(i) = LSTM p([enc b(p) | p ∈ p(a(i))])enc c(i) = LSTM c([enc b(c) | c ∈ c(a(i))])enc ext(i) = [enc b(i); enc c(i); enc p(i)]

enc(i) = enc ext(i)

Whenever the parent or child list is empty, we re-place the LSTM output with a trained embeddingrepresenting the empty child / parent list.

Scorer The Scorer predicts the alignment tar-get for an action using one-versus-all classifica-tion. Given a source action a1(i) ∈ R1, wecompute scores s(enc(a1(i)), enc(a2(j))) for thealignment to every target action (including the“None” target) a2(j) ∈ R2 ∪ none. For both thebase and the extended model, the encoding of the“None” alignment target is a trained vector of thesame size as the action encoding. We computes(enc(a1(i)), enc(a2(j))) using a multi-layer per-ceptron with two hidden layers and the element-wise product enc(a1(i))� enc(a2(j)) as input.

Model Name Accuracy

Human Upper Bound 79.0

Sequential Order 16.5Cosine Similarity 41.5Common Action Pairs 52.1

Our Alignment Model (base) 66.3Our Alignment Model (extended) 72.4

Table 4: Performance comparison of the three baselinemodels, alignment model (base), and alignment model(extended) on the Action Alignment Corpus.

Training and evaluation We train both modelsusing cross-entropy loss, with updates only on in-correct predictions to avoid overfitting. We use10-fold cross validation on the ten different dishes,with one dish serving as test dish, such that thealigner is always evaluated on an unknown domain.

5.2 Experiment ResultsWe implement three baselines for comparison: (i)sequential ordering of alignments, such that a1(i)from recipe R1 will be aligned to a2(i) from recipeR2; (ii) cosine similarity scores between the BERTembeddings of action pairs as the scorer; and (iii)common action pair frequencies with 10-fold cross-validation (Table 4). We further compare our modelto the human upper bound for alignment accuracyas discussed in Section 4.2. Our base alignmentmodel outperforms all the baselines, but we ob-serve a substantial gain in accuracy in the extendedalignment model, illustrating the importance ofstructural information about the recipe in aligningactions. The extended model still does not reachhuman performance, illustrating the difficulty ofthe task.

The extended alignment model achieves an ac-curacy of 69.8% on aligning actions of the longerrecipe to actions of the shorter recipe. This suggeststhat our Long-to-Short approach to data collectionyields valid data in both directions.

As a point of comparison, L’20 achieve an F-Score of 54.6 for text-to-text alignments and 70.3for text-to-video alignment at the sentence level,suggesting that alignments of actions may be harderto annotate than alignments of entire sentences, buteasier to align automatically.

6 Conclusion & Future Work

In this paper, we have made two contributions: (i)ARA, a novel corpus of human-annotated align-ments between corresponding cooking actions, and(ii) a first neural model to align actions across

6938

recipe graphs. We find that incorporating structuralinformation about recipes improves the accuracyof the neural model, highlighting the usefulness ofrecipe graphs and of recipe parsers.

Compared to previous work, our corpus andmodel represent alignments at the level of indi-vidual actions and not of entire sentences. In re-fining the sentence-level alignments of L’20 to thecooking action level, we find that judging whenand how specific actions “correspond” is a com-plex task that requires intricate knowledge of thecooking domain. Alternatively, the complexity ofrecipe interpretation can be framed as a matter ofrecognizing nuances in how meaning is construedin recipe text given the genre and its preferred syn-tactic constructions (Langacker, 1993; Trott et al.,2020).

Looking ahead, our work lays a foundation forresearch which automatically aggregates multiplerecipe graphs for the same dish, identifying com-mon and distinct parts of the different recipes. Thisopens up a variety of applications in the cookingdomain, including dialogue systems which can ex-plain a recipe at different levels of abstraction.

Acknowledgments

Partially funded by the Deutsche Forschungsge-meinschaft (DFG, German Research Foundation) –Project-ID 232722074 – SFB 1102.

ReferencesNathanael Chambers and Dan Jurafsky. 2008. Unsuper-

vised learning of narrative event chains. In Proceed-ings of ACL-08: HLT, pages 789–797, Columbus,Ohio. Association for Computational Linguistics.

Nathanael Chambers and Dan Jurafsky. 2009. Unsu-pervised learning of narrative schemas and their par-ticipants. In Proceedings of the Joint Conference ofthe 47th Annual Meeting of the ACL and the 4th In-ternational Joint Conference on Natural LanguageProcessing of the AFNLP, pages 602–610, Suntec,Singapore. Association for Computational Linguis-tics.

Minsuk Chang, Leonore V. Guillain, Hyeungshik Jung,Vivian M. Hare, Juho Kim, and Maneesh Agrawala.2018. Recipescape: An interactive tool for ana-lyzing cooking instructions at scale. In Proceed-ings of the 2018 CHI Conference on Human Factorsin Computing Systems, CHI 2018, Montreal, QC,Canada, April 21-26, 2018, page 451. ACM.

Yuzhe Chen. 2017. A statistical machine learningapproach to generating graph structures from foodrecipes. Master’s thesis, Brandeis University.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Timothy Dozat and Christopher D. Manning. 2017.Deep biaffine attention for neural dependency pars-ing. In 5th International Conference on LearningRepresentations, ICLR 2017, Toulon, France, April24-26, 2017, Conference Track Proceedings. Open-Review.net.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe-ters, Michael Schmitz, and Luke Zettlemoyer. 2018.AllenNLP: A deep semantic natural language pro-cessing platform. In Proceedings of Workshop forNLP Open Source Software (NLP-OSS), pages 1–6, Melbourne, Australia. Association for Computa-tional Linguistics.

Jermsak Jermsurawong and Nizar Habash. 2015. Pre-dicting the structure of cooking recipes. In Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, pages 781–786, Lisbon, Portugal. Association for Computa-tional Linguistics.

Chloe Kiddon, Ganesa Thandavam Ponnuraj, LukeZettlemoyer, and Yejin Choi. 2015. Mise en place:Unsupervised interpretation of instructional recipes.In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing (EMNLP).

Ronald W. Langacker. 1993. Universals of construal.In Annual Meeting of the Berkeley Linguistics Soci-ety, volume 19, pages 447–463.

Angela Lin, Sudha Rao, Asli Celikyilmaz, Elnaz Nouri,Chris Brockett, Debadeepta Dey, and Bill Dolan.2020. A recipe for creating multimodal aligneddatasets for sequential tasks. In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics, pages 4871–4884, Online. As-sociation for Computational Linguistics.

Shinsuke Mori, Hirokuni Maeta, Tetsuro Sasada,Koichiro Yoshino, Atsushi Hashimoto, Takuya Fu-natomi, and Yoko Yamakata. 2014. Flowgraph2text:Automatic sentence skeleton compilation for proce-dural text generation. In Proceedings of the 8th In-ternational Conference on Natural Language Gener-ation (INLG).

Iftekhar Naim, Young Chol Song, Qiguang Liu,Henry A. Kautz, Jiebo Luo, and Daniel Gildea. 2014.Unsupervised alignment of natural language instruc-tions with video segments. In Proceedings of theTwenty-Eighth AAAI Conference on Artificial Intel-ligence, July 27 -31, 2014, Quebec City, Quebec,Canada, pages 1558–1564. AAAI Press.

https://www.aclweb.org/anthology/P08-1090





https://doi.org/10.1145/3173574.3174025

https://doi.org/10.1145/3173574.3174025

http://bir.brandeis.edu/bitstream/handle/10192/34135/ChenThesis2017.pdf?sequence=3&isAllowed=y



https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://openreview.net/forum?id=Hk95PK9le

https://openreview.net/forum?id=Hk95PK9le

https://doi.org/10.18653/v1/W18-2501

https://doi.org/10.18653/v1/W18-2501

https://doi.org/10.18653/v1/D15-1090

https://doi.org/10.18653/v1/D15-1090

https://doi.org/10.18653/v1/D15-1114

https://doi.org/10.18653/v1/D15-1114

https://doi.org/10.18653/v1/2020.acl-main.440


https://doi.org/10.3115/v1/W14-4418

https://doi.org/10.3115/v1/W14-4418

https://doi.org/10.3115/v1/W14-4418

http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8600

http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8600

6939

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. arXiv preprint arXiv:1802.05365.

Florian Pusse, Asad Sayeed, and Vera Demberg. 2016.LingoTurk: managing crowdsourced tasks for psy-cholinguistics. In Proceedings of the 2016 Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics: Demonstrations,pages 57–61, San Diego, California. Association forComputational Linguistics.

Michaela Regneri, Alexander Koller, and ManfredPinkal. 2010. Learning script knowledge with webexperiments. In Proceedings of the 48th AnnualMeeting of the Association for Computational Lin-guistics, pages 979–988, Uppsala, Sweden. Associa-tion for Computational Linguistics.

Sean Trott, Tiago Timponi Torrent, Nancy Chang, andNathan Schneider. 2020. (re)construing meaningin NLP. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 5170–5184, Online. Association for Computa-tional Linguistics.

Lilian Wanzare, Alessandra Zarcone, Stefan Thater,and Manfred Pinkal. 2017. Inducing script struc-ture from crowdsourced event descriptions via semi-supervised clustering. In Proceedings of the 2ndWorkshop on Linking Models of Lexical, Sententialand Discourse-level Semantics, pages 1–11, Valen-cia, Spain. Association for Computational Linguis-tics.

Lilian D. A. Wanzare, Alessandra Zarcone, StefanThater, and Manfred Pinkal. 2016. A crowdsourceddatabase of event sequence descriptions for the ac-quisition of high-quality script knowledge. In Pro-ceedings of the Tenth International Conference onLanguage Resources and Evaluation (LREC’16),pages 3494–3501, Portoroz, Slovenia. EuropeanLanguage Resources Association (ELRA).

Yoko Yamakata, Shinji Imahori, Hirokuni Maeta, andShinsuke Mori. 2016. A method for extracting ma-jor workflow composed of ingredients, tools, and ac-tions from cooking procedural text. In 2016 IEEEInternational Conference on Multimedia & ExpoWorkshops (ICMEW).

Yoko Yamakata, Shinsuke Mori, and John Carroll.2020. English recipe flow graph corpus. In Proceed-ings of the 12th Language Resources and EvaluationConference, pages 5187–5194, Marseille, France.European Language Resources Association.

Mehmet Ozgen. 2019. Tagging and action graph gener-ation for recipes. Msc thesis, Hacettepe University.

A Tagger and Parser Evaluation

A.1 MethodologyAll our measures are computed as averages of themean values of four individually trained instances

of our model and are reported together with therespective standard deviations. For the evaluationof parsing automatically tagged recipe text (Table 2in the paper), the recipes are tagged by four individ-ually trained instances of our tagger (with ELMoembeddings) before they are parsed by four indi-vidually trained instances of the parser.

While Y’20 cross-validate over the whole 300-rcorpus, we train on a subset of 240 recipes andevaluate on a subset of 30 recipes. Each subset israndomly chosen from the full corpus such that theproportions of 100-r to 200-r are preserved as 1:2 .

A.2 Tagger

For hyper-parameters, see Table 5. The hyper-parameters were fine-tuned separately for the pre-trained embeddings (ELMo and BERT, respec-tively).

Hyper-parameter ELMo BERTHidden size BiLSTM 50 200Layers BiLSTM 2 2Dropout BiLSTM 0.5 0.5Dropout CRF 0.5 0.5Regularization method L2 (α = 0.5) L2 (α = 0.1)Optimization method Adam AdamLearning rate 0.0075 0.001Gradient norm 10.0 10.0Training epochs 50 50

Table 5: Hyper-parameters for the tagger after fine-tuning on the German data set. The number of epochsis an observation.

For reference, we display an overview over thelabels defined by Y’20 in Table 6. The label-specific performance for action sequences is item-ized in detail in Table 7.

A.3 Parser

We did not perform any fine-tuning on the parser.The hyper-parameters are reported in Table 8.

Hyper-parameter ValueTag embedding dim 100

Hidden size BiLSTM 400Layers BiLSTM 3

Input dropout 0.3BiLSTM dropout 0.3Classifier dropout 0.3

Optimization method Dense Sparse Adam(betas (0.9, 0.9))

Gradient norm 5Training epochs 20 ± 10

Table 8: Hyper-parameters for the parser - no fine-tuning performed. The number of training epochs isan observation.

https://doi.org/10.18653/v1/N16-3012

https://doi.org/10.18653/v1/N16-3012





https://doi.org/10.18653/v1/W17-0901

https://doi.org/10.18653/v1/W17-0901

https://doi.org/10.18653/v1/W17-0901

https://www.aclweb.org/anthology/L16-1556



https://doi.org/10.1109/ICMEW.2016.7574705



https://www.aclweb.org/anthology/2020.lrec-1.638

http://www.openaccess.hacettepe.edu.tr:8080/xmlui/bitstream/handle/11655/9346/10293616.pdf?sequence=1&isAllowed=y

http://www.openaccess.hacettepe.edu.tr:8080/xmlui/bitstream/handle/11655/9346/10293616.pdf?sequence=1&isAllowed=y

6940

Label Meaning Explanation

F Food Eatable; also intermediate productsT Tool Knife, container, etc.D Duration Duration of cookingQ Quantity Quantity of foodAc Action by chef Verb representing a chef’s actionAc2 Discontinuous Ac (English only) Second, non-contiguous part of a single action by chefAf Action by food Verb representing action of a foodAt Action by tool (English only) Verb representing a tool’s actionSf Food state Food’s initial or intermediate stateSt Tool state Tool’s initial or intermediate state

Table 6: Y’20 labels for the tagging task.

Model Data Label # Instances Precision Recall F-Score

Y’20 Y’20 {Ac, Ac2, Af, At} 88.7 89.3 89.0Our model (’21) Y’20 {Ac, Ac2, Af, At} 92.0 ± 1.6 88.4 ± 1.8 90.1 ± 1.7

Y’20 Y’20 Ac 4977 92.3 93.1 92.7Our model (’21) Y’20 Ac 483 94.6 ± 1.1 91.7 ± 1.6 93.1 ± 1.2Y’20 Y’20 Ac2 178 43.8 46.8 45.3Our model (’21) Y’20 Ac2 16 69.2 ± 9.0 68.8 ± 10.2 68.1 ± 3.2Y’20 Y’20 Af 255 51.4 50.4 50.9Our model (’21) Y’20 Af 32 64.8 ±9.9 47.7 ± 7.8 54.9 ± 8.7Y’20 Y’20 At 15 60.0 10.0 17.1Our model (’21) Y’20 At 2 100 ± 0.0 100 ± 0.0 100 ± 0.0

Table 7: Label-specific performance for action sequences; comparison of our model and that of Y’20. The valuesin lines 1 and 2 are micro-averaged over the four individual action labels.

The set of edges labels as determined by Y’20are displayed in Table 9.

B Crowdsourcing

B.1 Dishes Annotated(1) Baked Ziti; (2) Blueberry Banana Bread; (3)Cauliflower Mash; (4) Chewy Chocolate ChipCookies; (5) Garam Masala; (6) Homemade PizzaDough; (7) Orange Chicken; (8) Pumpkin Choco-late Chip Bread; (9) Slow Cooker Chicken TortillaSoup; (10) Waffles.

B.2 Experiment PlatformFigure 5 gives a screenshot of the instruction page,and Figure 6 gives a screenshot of the questionpage.

C Alignment Model Training Details

In order to re-tokenize and generate token embed-dings for the recipes, we utilized the bert-uncased-base BERT model with embedding dimension of768. We trained both alignment models (base andextended) for 10-folds with 1 dish in the test setand 9 dishes for training. In each fold, the modelruns for 40 epochs with a train/dev split of 8/1dishes respectively. We use Adam as the optimizer

with a 0.0001 learning rate and Cross Entropy lossas the loss function during training. For aligner’shyper-parameters, see Table 10.

Hyper-parameter Value

BERT embedding dim 768LSTM (seq) hidden dim 768LSTM (p) hidden dim 768LSTM (c) hidden dim 768MLP layer 1 output dim 128MLP layer 2 output dim 32MLP layer 3 output dim 1

Table 10: Hyper-parameters for the Alignment model.

6941

Label Meaning Explanation

Agent Subject Relationship with actions (Ac or Af)Targ Direct object Relationship with actions (Ac or Af)Dest Indirect object (container) Relationship with actions (Ac or Af)t-comp tool complement Tool used in an actionF-comp Food complement Food used as a toolF-eq Food equality Identical foodF-part-of Food part-of Refer to a part of a foodF-set Food set Refer to a set of foodsT-eq Tool equality Identical toolT-part-of Tool part-of Refer to a part of a toolA-eq Action equality Identical action (Ac, Af)V-tm Head verb for timing, etc.other-mod Other relationships

Table 9: Y’20 edge labels for the parsing task.

Figure 5: A screen shot of the instruction page.

6942

Figure 6: A screen shot of the experiment interface.

aligning actions across recipe graphs

Documents