learning to learn semantic parsers from natural language ... · semantic parsing is a problem of...

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1676–1690Brussels, Belgium, October 31 - November 4, 2018. c©2018 Association for Computational Linguistics

1676

Learning to Learn Semantic Parsers fromNatural Language Supervision

Igor Labutov∗LAER AI, Inc.New York, NY

[email protected]

Bishan Yang∗LAER AI, Inc.New York, NY

[email protected]

Tom MitchellCarnegie Mellon University

Pittsburgh, [email protected]

Abstract

As humans, we often rely on language to learnlanguage. For example, when corrected in aconversation, we may learn from that correc-tion, over time improving our language flu-ency. Inspired by this observation, we pro-pose a learning algorithm for training semanticparsers from supervision (feedback) expressedin natural language. Our algorithm learns asemantic parser from users’ corrections suchas “no, what I really meant was before hisjob, not after”, by also simultaneously learn-ing to parse this natural language feedback inorder to leverage it as a form of supervision.Unlike supervision with gold-standard logicalforms, our method does not require the user tobe familiar with the underlying logical formal-ism, and unlike supervision from denotation,it does not require the user to know the correctanswer to their query. This makes our learningalgorithm naturally scalable in settings whereexisting conversational logs are available andcan be leveraged as training data. We con-struct a novel dataset of natural language feed-back in a conversational setting, and show thatour method is effective at learning a semanticparser from such natural language supervision.

1 Introduction

Semantic parsing is a problem of mapping a natu-ral language utterance into a formal meaning rep-resentation, e.g., an executable logical form (Zelleand Mooney, 1996). Because the space of all logi-cal forms is large but constrained by an underlyingstructure (i.e., all trees), the problem of learning asemantic parser is commonly formulated as an in-stance of structured prediction.

Historically, approaches based on supervisedlearning of structured prediction models haveemerged as some of the first and still remain com-mon in the semantic parsing community (Zettle-

∗Work done while at Carnegie Mellon University.

moyer and Collins, 2005, 2009; Kwiatkowskiet al., 2010). A well recognized practical chal-lenge in supervised learning of structured mod-els is that fully annotated structures (e.g., logicalforms) that are needed for training are often highlylabor-intensive to collect. This problem is furtherexacerbated in semantic parsing by the fact thatthese annotations can only be done by people fa-miliar with the underlying logical language, mak-ing it challenging to construct large scale datasetsby non-experts.

Over the years, this practical observation hasspurred many creative solutions to training seman-tic parsers that are capable of leveraging weakerforms of supervision, amenable to non-experts.One such weaker form of supervision relies onlogical form denotations (i.e, the results of a logi-cal form’s execution) – rather than the logical formitself, as “supervisory” signal (Clarke et al., 2010;Liang et al., 2013; Berant et al., 2013; Pasupat andLiang, 2015; Liang et al., 2016; Krishnamurthyet al., 2017). In Question Answeing (QA), forexample, this means the annotator needs only toknow the answer to a question, rather than the fullSQL query needed to obtain that answer. Para-phrasing of utterances already annotated with log-ical forms is another practical approach to scale upannotation without requiring experts with a knowl-edge of the underlying logical formalism (Berantand Liang, 2014; Wang et al., 2015).

Although these and similar methods do reducethe difficulty of the annotation task, collectingeven these weaker forms of supervision (e.g., de-notations and paraphrases) still requires a dedi-cated annotation event, that occurs outside of thenormal interactions between the end-user and thesemantic parser1. Furthermore, expert knowledgemay still be required, even if the underlying logi-

1although in some cases existing datasets can be leveragedto extract this information

1677

Before [September 10th], how many places was[Brad] employed?

find employment [Brad] had, prior to [September 10th]

Yes, but I also want you to count how many placed Brad was employed at.

User Utterance

System NLG

User Feedback

Figure 1: Example (i) user’s original utterance, (ii)confirmation query generated by inverting the originalparse, (iii) user’s generated feedback towards the con-firmation query (i.e., original parse) .

cal form does not need to be given by the annotator(QA denotations, for example, require the annota-tor to know the correct answer to the question – anassumption which doesn’t hold for end-users whoasked the question with the goal of obtaining theanswer). In contrast, our goal is to leverage naturallanguage feedback and corrections that may occurnaturally as part of the continuous interaction withthe non-expert end-user, as training signal to learna semantic parser.

The core challenge in leveraging natural lan-guage feedback as a form of supervision in train-ing semantic parsers, however, is the challenge ofcorrectly parsing that feedback to extract the su-pervisory signal embedded in it. Parsing the feed-back, just like parsing the original utterance, re-quires its own semantic parser trained to inter-pret that feedback. Motivated by this observa-tion, our main contribution in this work is a semi-supervised learning algorithm that learns a taskparser (e.g., a question parser) from feedback ut-terances while simultaneously learning a parser tointerpret the feedback utterances. Our algorithmrelies only on a small number of annotated logicalforms, and can continue learning as more feedbackis collected from interactions with the user.

Because our model learns from supervision thatit simultaneously learns to interpret, we call ourapproach learning to learn semantic parsers fromnatural language supervision.

2 Problem Formulation

Formally, the setting proposed in this work can bemodelled as follows: (i) the user poses a naturallanguage input ui (e.g., a question) to the system,(ii) the system parses the user’s utterance ui, pro-ducing a logical form yi, (iii) the system commu-nicates yi to the user in natural language in theform of a confirmation (i.e., “did you mean . . . ”),(iv) in response to the system’s confirmation, theuser generates a feedback utterance fi, which may

be a correction of yi expressed in natural language.The observed variables in a single interaction arethe task utterance ui, the predicted task logicalform yi and the user’s feedback fi; the true logicalform yi is hidden. See Figure 1 for an illustration.

A key observation that we make from the aboveformulation is that learning from such interactionscan be effectively done in an offline (i.e., non-interactive) setting, using only the logs of past in-teractions with the user. Our aim is to formulatea model that can learn a task semantic parser (i.e.,one parsing the original utterance u) from such in-teraction logs, without access to the true logicalforms (or denotations) of the users’ requests.

2.1 Modelling conversational logs

Formally, we propose to learn a semantic parserfrom conversational logs represented as follows:

D = {(ui, yi, fi)}i,...,N

where ui is the user’s task utterance (e.g., a ques-tion), yi is the system’s original parse (logicalform) of that utterance, fi is the user’s natural lan-guage feedback towards that original parse and Nis the number of dialog turns in the log. Note thatthe original parse yi of utterance ui could comefrom any semantic parser that was deployed at thetime the data was logged – there are no assump-tions made on the source of how yi was produced.

Contrast this with the traditional learning set-tings for semantic parsers, where the user (or an-notator) provides the correct logical form yi or theexecution of the correct logical form JyiK (deno-tation) (Table 1). In our setting, instead of thecorrect logical form yi, we only have access tothe logical form produced by whatever semanticparser was interacting with the user at the time thedata was collected, i.e., yi is not necessarily thecorrect logical form (though it could be). In thiswork, however, we focus only on corrective feed-back (i.e., cases where y 6= y) as our main objec-tive is to evaluate the ability to interpret rich natu-ral language feedback (which is only given whenthe original parse is incorrect). Our key hypothesisin this work is that although we do not observe yidirectly, combining yi with the user’s feedback fishould be sufficient to get a better estimate of thecorrect logical form (closer to yi), which we canthen leverage as a “training example” to improvethe task parser (and the feedback parser).

1678

Supervision DatasetFull logical forms {(ui, yi)}i,...,NDenotations {(ui, JyiK)}i,...,NBinary feedback {(ui, Jyi = yiK)}i,...,NNL feedback (this work) {(ui, yi, fi)}i,...,N

Table 1: Different types of supervision used in litera-ture for training semantic parsers, and the correspond-ing data needed for each type of supervision. Notationused in the table: u corresponds to user utterance (lan-guage), y corresponds to a gold-standard logical formparse of u, y corresponds to a predicted logical form,J·K is the result of executing an expression inside thebrackets and f is the user’s feedback expressed in nat-ural language in the context of an utterance and a pre-dicted logical form.

3 Natural Language Supervision

3.1 Learning problem formulationIn this section, we propose a learning algorithm fortraining a semantic parser from natural languagefeedback. We will use the terms task parser andfeedback parser to refer to the two distinct parsersused for parsing the user’s original task utterance(e.g., question) and parsing the user’s follow-upfeedback utterance respectively. Our learning al-gorithm does not assume any specific underlyingmodel for the two parsers aside from the require-ment that each parser specifies a probability distri-bution over logical forms given the utterance (forthe task parser) and the utterance plus feedback(for the feedback parser):

Task Parser: P (y | u; θt)Feedback Parser: P (y | u, f, y; θf )

where θt and θf parametrize the task and feedbackparsers respectively. Note that the key distinctionbetween the task and feedback parser models isthat in addition to the user’s original task utteranceu, the feedback parser also has access to the user’sfeedback utterance f , and the original parser’s pre-diction y. We now introduce a joint model thatcombines task and feedback parsers in a way thatencourages the two models to agree:

P (y | u, f, y; θt, θf ) =1

ZP (y | u; θt)︸︷︷︸

task parser

P (y | u, f, y; θf )︸︷︷︸feedback parser

At training time, our objective is to maximize theabove joint likelihood by optimizing the parser pa-rameters θt and θf of the task and feedback parsersrespectively. The intuition behind optimizing thejoint objective is that it encourages the “weaker”model that does not have access to the feedback

utterance to agree with the “stronger” model thatdoes (i.e., using the feedback parser’s prediction asa noisy “label” to bootstrap the task parser), whileconversely encouraging a more “complex” modelto agree with a “simpler” model (i.e., using thesimpler task parser model as a “regularizer” for themore complex feedback parser; note that the feed-back parser generally has higher model complex-ity compared to the task parser because it incorpo-rates additional parameters to account for process-ing the feedback input).2 Note the feedback parseroutput is not simply the meaning of the feed-back utterance in isolation, but instead the revisedsemantic interpretation of the original utterance,guided by the feedback utterance. In this sense,the task faced by the feedback parser is to both de-termine the meaning of the feedback, and to applythat feedback to repair the original interpretationof ui. Note that this model is closely related toco-training (Blum and Mitchell, 1998) (and moregenerally multi-view learning (Xu et al., 2013))and is also a special case of a product-of-experts(PoE) model (Hinton, 1999).

3.2 Learning

The problem of maximizing the joint-likelihoodP (y | u, f, y; θt, θf ) can be approached as a stan-dard problem of maximum likelihood estimationin the presence of latent variables (i.e., unob-served logical form y), suggesting the applicationof the Expectation Maximization (EM) algorithmfor learning parameters θt and θf . The direct ap-plication of EM in our setting, however, is facedwith a complication in the E-step. Because thehidden variables (logical forms y) are structured,computing the posterior over the space of all log-ical forms and taking the expectation of the log-likelihood with respect to that posterior is gener-ally intractable (unless we assume certain factor-ized forms for the logical form likelihood).

Instead, we propose to approximate the E-stepby replacing the expectation of the log joint-likelihood with its point estimate, using the maxi-mum a posteriori (MAP) estimate of the posteriorover logical forms y to obtain that estimate (thisis sometimes referred to as “hard-EM”). Obtain-ing the MAP estimate of the posterior over log-

2Note that in practice, to avoid locally optimal degeneratesolutions (e.g., where the feedback parser learns to triviallyagree with the task parser by learning to ignore the feedback),some amount of labeled logical forms would be required topre-train both models.

1679

ical forms may itself be a difficult optimizationproblem, and the specific algorithm for obtainingit would depend on the internal models of the taskand the feedback parser.

As an alternative, we propose a simple, MCMCbased algorithm for approximating the MAP esti-mate of the posterior over logical forms, that doesnot require access to the internals of the parsermodels, i.e., allowing us to conveniently treat bothparsers as “black boxes”. The only assumptionthat we make is that it’s easy to sample logi-cal forms from the individual parsers (though notnecessarily from the joint model), as well as tocompute the likelihoods of those sampled logicalforms under at least one of the models.

We use Metropolis Hastings to sample fromthe posterior over logical forms, using one of theparser models as the proposal distribution. Specif-ically, if we choose the feedback parser as the pro-posal distribution, and initialize the Markov Chainwith a logical form sampled from the task parser,it can be shown that the acceptance probability rfor the first sample conveniently simplifies to thefollowing expression:

r = min

(1,P (yf | u, θt)P (yt | u, θt)

)(1)

where yt and yf are logical forms sampled fromthe task and feedback parsers respectively. In-tuitively the above expression compares the like-lihood of the logical form proposed by the taskparser to the likelihood of the logical form pro-posed by the feedback parser, but with both likeli-hoods computed under the same task parser model(making the comparison fair). The proposed parseis then accepted with a probability proportional tothat ratio. See Algorithm 2 for details.

Finally, given this MAP estimate of y, we canperform optimization over the parser parametersθt and θf using a single step with stochastic gradi-ent descent before re-estimating the latent logicalform y. We also approximate the gradients of eachparser model by ignoring the gradient terms asso-ciated with the log of the normalizer Z. 3

3.3 Task Parser ModelOur task parser model is implemented basedon existing attention-based encoder-decoder mod-

3We have experimented with a sampling based approxi-mation of the true gradients with contrastive divergence (Hin-ton, 2002), however, found that our approximation works suf-ficiently well empirically. See Algorithm 1 for more detailsof the complete algorithm.

Algorithm 1: Semantic Parser Training from NaturalLanguage Supervision

Input : D = {(ui, yi, fi)}1,...,NOutput : Task parser parameters θt; Feedback parser

parameters θfParameter: Number of training epochs Tfor t = 1 to T do

for i = 1 to N doyfi ← MH-MAP (ui, yi, fi, θt, θf ) ;∇θt ← ∇ logP (yfi | ui; θt) ;∇θf ← ∇ logP (yfi | ui, fi, yi; θf ) ;θt ← SGD (θt,∇θt) ;θf ← SGD (θf ,∇θf ) ;

endend

Algorithm 2: Metropolis Hastings-based MAP estima-tion of latent semantic parse

Input : u, y, f , θt, θfOutput : latent parse yf

Parameter: Number of sampling iterations NFunction MH-MAP(u, y, f , θt, θf):

samples← [ ]// sample parse from task parserSample yt ∼ P (y | u, θt)ycurr ← ytfor i = 1 to N do

// sample parse from feedback parserSample yf ∼ P (y | u, f, y, θf )r ← min

(1,

P (yf |u,θt)P (ycurr|u,θt)

)Sample accept ∼ Bernoulli(r)if accept then

pt ← P (yf | u, θt)pf ← P (yf | u, f, y, θf )samples[yf ]← pt · pfycurr ← yf

endendyf ← argmax samplesreturn yf

els (Bahdanau et al., 2014; Luong et al., 2015).The encoder takes in an utterance (tokenized) uand computes a context-sensitive embedding hifor each token ui using a bidirectional Long Short-Term Memory (LSTM) network (Hochreiter andSchmidhuber, 1997), where hi is computed as theconcatenation of the hidden states at position ioutput by the forward LSTM and the backwardLSTM. The decoder generates the output logicalform y one token at a time using another LSTM.At each time step j, it generates yj based on thecurrent LSTM hidden state sj , a summary of theinput context cj , and attention scores aji, whichare used for attention-based copying as in (Jia andLiang, 2016). Specifically,

p(yj = w,w ∈ Vout|u, y1:j−1) ∝ exp (Wo[sj ; cj ])

1680

p(yj = ui|u, y1:j−1) ∝ exp (aji)

where yj = w denotes that yj is chosen from theoutput vocabulary Vout; yj = ui denotes that yjis a copy of ui; aji = sTj Wahi is an attentionscore on the input word ui; cj =

∑i αihi, αi ∝

exp (aji) is a context vector that summarizes theencoder states; andWo andWa are matrix parame-ters to be learned. After generating yj , the decoderLSTM updates its hidden state sj+1 by taking asinput the concatenation of the embedding vectorfor yj and the context vector cj .

An important problem in semantic parsing forconversation is resolving references of people andthings mentioned in the dialog context. Instead oftreating coreference resolution as a separate prob-lem, we propose a simple way to resolve it as a partof semantic parsing. For each input utterance, werecord a list of previously-occurring entity men-tionsm = {m1, ...,mL}. We consider entity men-tions of four types: persons, organizations, times,and topics. The encoder now takes in an utteranceu concatenated with m4, and the decoder can gen-erate a referenced mention through copying.

The top parts of Figure 7 and Figure 8 visualizethe attention mechanism of the task parser, wherethe decoder attends to both the utterance and theconversational context during decoding. Note thatthe conversational context is only useful when theutterance contains reference mentions.

3.4 Feedback Parser Model

Our feedback parser model is an extension of theencoder-decoder model in the previous section.The encoder consists of two bidirectional LSTMs:one encodes the input utterance u along with thehistory of entity mentionsm and the other encodesthe user feedback f . At each time step j duringdecoding, the decoder computes attention scoresover each word ui in utterance u as well as eachfeedback word fk based on the decoder hiddenstate sj , the context embedding bk output by thefeedback encoder, and a learnable weight matrixWe: ejk = sTj Webk. The input context vector cjin the previous section is updated by adding a con-text vector that attends to both the utterance andthe feedback:

c′j =∑i

αihi +∑k

βkbk

4The mentions are ordered based on their types and eachmention is wrapped by special boundary symbols [ and ].

where αi ∝ exp(aji) and βk ∝ exp(ejk). Ac-cordingly, the decoder is allowed to copy wordsfrom the feedback:

p(yj = fk|u, f, y1:j−1) ∝ exp (ejk)

The bottom parts of Figure 7 and Figure 8 vi-sualize the attention mechanism of the feedbackparser, where the decoder attends to the utterance,the conversational context, and the feedback dur-ing decoding.

Note that our feedback model does not explic-itly incorporate the original logical form y that theuser was generating their feedback towards. Weexperimented with a number of ways to incorpo-rate y in the feedback parser model, but found itmost effective to instead use it during MAP infer-ence for the latent parse y. In sampling a logicalform in Algorithm 2, we simply reject the sampleif the sampled logical form matches y.

4 Dataset

In this work we focus on the problem of semanticparsing in a conversational setting and construct anew dataset for this task. Note that a parser is re-quired to resolve references to the conversationalcontext during parsing, which in turn may furtheramplify ambiguity and propagate errors to the fi-nal logical form. Conveniently, the same conver-sational setting also offers a natural channel forcorrecting such errors via natural language feed-back users can express as part of the conversation.

4.1 Dataset constructionWe choose conversational search in the domain ofemail and biographical research as a setting for ourdataset. Large enterprises produce large amountsof text during daily operations, e.g., emails, re-ports, and meeting memos. There is an increasingneed for systems that allow users to quickly findinformation over text through search. Unlike reg-ular tasks like booking airline tickets, search tasksoften involve many facets, some of which may ormay not be known to the users when the search be-gins. For example, knowing where a person worksmay triggers a followup question about whetherthe person communicates with someone internallyabout certain topic. Such search tasks will oftenbe handled most naturally through dialog.

Figure 1 shows example dialog turns containingthe user’s utterance (question) u, system’s confir-mation of the original parse y, and the user’s feed-back f . To simplify and scale data collection, we

1681

decouple the dialog turns and collect feedback foreach dialog turn in isolation. We do that by show-ing a worker on Mechanial Turk the original utter-ance u, the system’s natural language confirmationgenerated by inverting the original logical formy, and the dialog context summarized in a table.See Figure 2 for a screenshot of the MechanicalTurk interface. The dialog context is summarizedby the entities (of types person, organization, timeand topic) that were mentioned earlier in the con-versation and could be referenced in the originalquestion u shown on the screen5. The worker isthen asked to type their feedback given (u, y) andthe dialog context displayed on the screen. Turkersare instructed to write a natural language feedbackutterance that they might otherwise say in a realconversation when attempting to correct anotherperson’s incorrect understanding of their question.

We recognize that our data collection processresults in only an approximation of a true multi-turn conversation, however we find that this ap-proach to data collection offers a convenient trade-off for collecting a large number of controlled anddiverse context-grounded interactions. Qualita-tively we find that turkers are generally able toimagine themselves in the hypothetical ongoingdialog, and are able to generate realistic contextualfeedback utterances using only the context sum-mary table provided to them.

The initial dataset of questions paired with orig-inal logical form parses {ui, yi}1,...,N that we useto solicit feedback from turkers, is prepared of-fline. In this separate offline task we collect adataset of 3556 natural language questions, anno-tated with gold standard logical forms, in the samedomain of email and biographical research. Weparse each utterance in this dataset with a float-ing grammar-based semantic parser trained usinga structured perceptron algorithm (implemented inSEMPRE (Berant et al., 2013)) on a subset of thequestions. We then construct the dataset for feed-back collection by sampling the logical form yfrom the first three candidate parses in the beamproduced by the grammar-based parser. We usethis grammar-based parser intentionally as a verydifferent model from the one that we would ulti-mately train (LSTM-based parser) on the conver-sational logs produced by the original parser.

We retain 1285 out of the 3556 annotated ques-

5zero or more of the context entities may actually be ref-erenced in the original utterance

Figure 2: Screenshot of the Mechanical Turk web in-terface used to collect natural language feedback.

tions to form a test set. The rest 2271 questions wepair with between one and three predicted parses ysampled from the beam produced by the grammar-based parser, and present each pair of original ut-terance and predicted logical form (ui, yi) to aturker who then generates a feedback utterancefi. In total, we collect 4321 question/originalparse/feedback triples (ui, yi, fi) (averaging ap-proximately 1.9 feedback utterances per question).

5 Experiments

The key hypothesis that we aim to evaluate in ourexperiments is whether natural language feedbackis an effective form of supervision for training asemantic parser. To achieve this goal, we controland measure the effect that the number of feedbackutterances used during training has on the resultingperformance of the task semantic parser on a held-out test set. Across all our experiments we also usea small seed training set (300 questions) that con-tains gold-standard logical forms to pre-train thetask and feedback parsers. The number of “unla-beled” questions6 (i.e., questions not labeled withgold standard logical forms but that have naturallanguage feedback) ranges from 300, 500, 1000 to1700 representing different experimental settings.For each experimental setting, we rerun the exper-iment 10 times, re-sampling the questions in boththe training and unlabeled sets, and report the aver-aged results. The test-set remains fixed across allexperiments and contains 1285 questions labeledwith gold-standard logical forms. The implemen-tation details for the task parser and the feedbackparser are included in the appendix.

6Note that from hereon we will refer to the portion of thedata that contains natural language feedback as the only formof supervision as “unlabeled data”, to emphasize that it is notlabeled with gold standard logical forms (in contrast to theseed training set).

1682

5.1 Models and evaluation metrics

In our evaluations, we compare the following fourmodels:

• MH (full model) Joint model described in Sec-tion 3 and in Algorithm 1 (using MetropolisHastings-based MAP inference described in Al-gorithm 2).

• MH (no feedback) Same as the full model, ex-cept we ignore the feedback f and the origi-nal logical form y information in the feedbackmodel. Effectively, this reduces the model ofthe feedback parser to that of the task parser.Because during training, both models would beinitialized differently, we may still expect the re-sulting model averaging effects to aid learning.

• MH (no feedback + reject y) Same as theabove baseline without feedback, but we in-corporate the knowledge of the original logicalform y during training. We incorporate y usingthe same method as described in Section ??.

• Self-training Latent logical form inference isperformed using only the task parser (usingbeam search). Feedback utterance f and orig-inal logical form y are ignored. Task parser pa-rameters θt are updated in the same way as inAlgorithm 1.

Note that all models are exposed to the same train-ing seed set, and differ only in the way they takeadvantage of the unlabeled data. We perform twotypes of evaluations of each model:

• Generalization performance we use thelearned task parser to make predictions on held-out data. This type of evaluation tests the abilityof the parser trained with natural language feed-back to generalize to unseen utterances.

• Unlabeled data performance we use thelearned task parser to make predictions on theunlabeled data that was used in training it. Notethat for each experimental condition, we per-form this evaluation only on the portion ofthe unlabeled data that was used during train-ing. This ensures that this evaluation teststhe model’s ability to “recover” correct logicalforms from the questions that have natural lan-guage feedback associated with them.

6 Results

6.1 Generalization performanceFigure 3a shows test accuracy as a function ofthe number of unlabeled questions (i.e., ques-tions containing only feedback supervision with-out gold standard logical forms) used during train-ing, across all four models. As expected, usingmore unlabeled data generally improves general-ization performance. The self-training baselineis the only exception, where performance startsto deteriorate as the ratio of unlabeled to labeledquestions increases beyond a certain point. Thisbehavior is not necessarily surprising – when theunlabeled examples significantly outnumber thelabeled examples, the model may more easily veeraway to local optima without being strongly regu-larized by the loss on the small number of labeledexamples.

Interestingly, the MH (no feedback) baselineis very similar to self-training, but has a signifi-cant performance advantage that does not deterio-rate with more unlabeled examples. Recall that theMH (no feedback) model modifies the full modeldescribed in Algorithm 1 by ignoring the feedbackf and the original logical form y in the feedbackparser model P (y | u, y, f ; θf ). This has the ef-fect of reducing the model of the feedback parserinto the model of the task parser P (y | u; θt). Thetraining on unlabeled data proceeds otherwise inthe same way as described in Algorithm 1. As aresult, the principle behind the MH (no feedback)model is the same as that behind self-training, i.e.,a single model learns from its own predictions.However, different initializations of the two copiesof the task parser, and the combined model averag-ing appears to improve the robustness of the modelsufficiently to keep it from diverging as the amountof unlabeled data is increased.

The MH (no feedback + reject y) baseline alsodoes not observe the feedback utterance f , butincorporates the knowledge of the original parsey. As described in Section ??, this knowledge isincorporated during MAP inference of the latentparse, by rejecting any logical form samples thatmatch y. As expected, incorporating the knowl-edge of y improves the performance of this base-line over the one that does not.

Finally, the full model that incorporates both,the feedback utterance f and the original logicalform y outperforms the baselines that incorporateonly some of that information. The performance

1683

gain over these baselines grows as more ques-tions with natural language feedback supervisionare made available during training. Note that boththis and the MH (no feedback + reject y) modelincorporate the knowledge of the original logicalform y, however, the performance gain from in-corporating the knowledge of y without the feed-back is relatively small, indicating that the gainsfrom the model that observes feedback is primar-ily from its ability to interpret it.

6.2 Performance on unlabeled dataFigure 3b shows accuracy on the unlabeled data,as a function of the number of unlabeled ques-tions used during training, across all four models.The questions used in evaluating the model’s ac-curacy on unlabeled data are the same unlabeledquestions used during training in each experimen-tal condition. The general trend and the relation-ship between baselines is consistent with the gen-eralization performance on held-out data in Fig-ure 3a. One of the main observations is that accu-racy on unlabeled training examples remains rela-tively flat, but consistently high (> 80%), acrossall models regardless of the amount of unlabeledquestions used in training (within the range thatwe experimented with). This suggests that whilethe models are able to accurately recover the un-derlying logical forms of the unlabeled questionsregardless of the amount of unlabeled data (withinour experimental range), the resulting generaliza-tion performance of the learned models is signif-icantly affected by the amount of unlabeled data(more is better).

6.3 Effect of feedback complexity onperformance

Figure 4 reports parsing accuracy on unlabeledquestions as a function of the number of correc-tions expressed in the feedback utterance pairedwith that question. Our main observation is thatthe performance of the full model (i.e., the jointmodel that uses natural language feedback) deteri-orates for questions that are paired with more com-plex feedback (i.e., feedback containing more cor-rections). Perhaps surprisingly, however, is thatall models (including those that do not incorporatefeedback) deteriorate in performance for questionspaired with more complex feedback. This is ex-plained by the fact that more complex feedbackis generally provided for more difficult-to-parsequestions. In Figure 4, we overlay the parsing per-

formance with the statistics on the average lengthof the target logical form (number of predicates).

A more important take-away from the results inFigure 4, is that the model that takes advantage ofnatural language feedback gains an even greateradvantage over models that do not use feedbackwhen parsing more difficult questions. This meansthat the model is able to take advantage of morecomplex feedback (i.e., with more corrections)even for more difficult to parse questions.

200 400 600 800 1000 1200 1400 1600 1800

Number of unlabeled questions

70

72

74

76

78

80

82

Unl

abel

edAcc

urac

y

MH (full model)

MH (no feedback + reject y)

MH (no feedback)

Self-training

200 400 600 800 1000 1200 1400 1600 1800

Number of unlabeled questions

70

72

74

76

78

80

82

84

Tes

tAcc

urac

y

MH (full model)


MH (no feedback)

Self-training

(a)

(b)

Figure 3: (a) Parsing accuracy on held-out questionsas a function of the number of unlabeled questionsused during training. (b) Parsing accuracy on unla-beled questions as a function of the number of unla-beled questions used during training. In both panels,all parsers were initialized using 300 labeled examplesconsisting of questions and their corresponding logicalform.

7 Related Work

Early semantic parsing systems map natural lan-guage to logical forms using inductive logicalprogramming (Zelle and Mooney, 1996). Mod-ern systems apply statistical models to learn frompairs of sentences and logical forms (Zettlemoyerand Collins, 2005, 2009; Kwiatkowski et al.,2010). As hand-labeled logical forms are verycostly to obtain, different forms of weak super-vision have been explored. Example works in-clude learning from pairs of sentences and an-swers by querying a database (Clarke et al., 2010;

1684

1 2 3

Number of corrections expressed in feedback utterance

7.2

7.4

7.6

7.8

8.0

8.2

8.4

Ave

rage

num

ber

ofpr

edic

ates

inlo

gica

lfo

rm

Avg. number of predicates in logical form

65.0

67.5

70.0

72.5

75.0

77.5

80.0

82.5

Unl

abel

edac

cura

cy

MH (full model)


MH (no feedback)

Self-training

Figure 4: Parsing accuracy on unlabeled questions, par-titioned by feedback complexity (i.e., number of cor-rections expressed in a single feedback utterance).

Liang et al., 2013; Berant et al., 2013; Pasupatand Liang, 2015; Liang et al., 2016; Krishna-murthy et al., 2017); learning from indirect super-vision from a large-scale knowledge base (Reddyet al., 2014; Krishnamurthy and Mitchell, 2012);learning from conversations of systems askingfor and confirming information (Artzi and Zettle-moyer, 2011; Thomason et al., 2015; Padmakumaret al., 2017); and learning from interactions witha simulated world environment (Branavan et al.,2009; Artzi and Zettlemoyer, 2013; Goldwasserand Roth, 2014; Misra et al., 2015). The supervi-sion used in these methods is mostly in the form ofbinary feedback, partial logical forms (e.g., slots)or execution results. In this paper, we explore anew form of supervision – natural language feed-back. We demonstrate that such feedback not onlyprovides rich and expressive supervisory signalsfor learning but also can be easily collected viacrowd-sourcing. Recent work (Iyer et al., 2017)trains an online language-to-SQL parser from userfeedback. Unlike our work, their collected feed-back is structured and is used for acquiring morelabeled data during training. Our model jointlylearns from questions and feedback and can betrained with limited labeled data.

There has been a growing interest on machinelearning from natural language instructions. Muchwork has been done in the setting where an au-tonomous agent learns to complete a task in anenvironment, for example, learning to play gamesby utilizing text manuals (Branavan et al., 2012;Eisenstein et al., 2009; Narasimhan et al., 2015)and guiding policy learning using high-level hu-man advice (Kuhlmann et al., 2004; Squire et al.,2015; Harrison et al., 2017). Recently, natural lan-guage explanations have been used to augment la-

beled examples for concept learning (Srivastavaet al., 2017) and to help induce programs that solvealgebraic word problems (Ling et al., 2017). Ourwork is similar in that natural language is used asadditional supervision during learning, however,our natural language annotations consist of userfeedback on system predictions instead of expla-nations of the training data.

8 Conclusion and Future Work

In this work, we proposed a novel task of learninga semantic parser directly from end-users’ open-ended natural language feedback during a conver-sation. The key advantage of being able to learnfrom natural language feedback is that it opens thedoor to learning continuously through natural in-teractions with the user, but it also presents a chal-lenge of how to interpret such feedback. In thiswork we introduced an effective approach that si-multaneously learns two parsers: one parser thatinterprets natural language questions and a secondparser that also interprets natural language feed-back regarding errors made by the first parser.

Our work is, however, limited to interpretingfeedback contained in a single utterance. A naturalgeneralization of learning from natural languagefeedback is to view it as part of an integrated dia-log system capable of both interpreting feedbackand asking appropriate questions to solicit thisfeedback (e.g., connecting to the work by (Pad-makumar et al., 2017)). We hope that the prob-lem we introduce in this work, together with thedataset that we release, inspires the community todevelop models that can learn language (e.g., se-mantic parsers) through flexible natural languageconversation with end users.

Acknowledgements

This research was supported in part by AFOSRunder grant FA95501710218, and in part by theCMU InMind project funded by Oath.

ReferencesYoav Artzi and Luke Zettlemoyer. 2011. Bootstrap-

ping semantic parsers from conversations. In Pro-ceedings of the conference on empirical methods innatural language processing, pages 421–432. Asso-ciation for Computational Linguistics.

Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su-pervised learning of semantic parsers for mapping

1685

instructions to actions. Transactions of the Associa-tion of Computational Linguistics, 1:49–62.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on freebase fromquestion-answer pairs. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing, pages 1533–1544.

Jonathan Berant and Percy Liang. 2014. Semanticparsing via paraphrasing. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), vol-ume 1, pages 1415–1425.

Avrim Blum and Tom Mitchell. 1998. Combining la-beled and unlabeled data with co-training. In Pro-ceedings of the eleventh annual conference on Com-putational learning theory, pages 92–100. ACM.

Satchuthananthavale RK Branavan, Harr Chen, Luke SZettlemoyer, and Regina Barzilay. 2009. Reinforce-ment learning for mapping instructions to actions.In Proceedings of the Joint Conference of the 47thAnnual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processingof the AFNLP: Volume 1-Volume 1, pages 82–90.Association for Computational Linguistics.

SRK Branavan, David Silver, and Regina Barzilay.2012. Learning to win by reading manuals in amonte-carlo framework. Journal of Artificial Intel-ligence Research, 43:661–704.

James Clarke, Dan Goldwasser, Ming-Wei Chang, andDan Roth. 2010. Driving semantic parsing fromthe world’s response. In Proceedings of the four-teenth conference on computational natural lan-guage learning, pages 18–27. Association for Com-putational Linguistics.

Jacob Eisenstein, James Clarke, Dan Goldwasser, andDan Roth. 2009. Reading to learn: Constructingfeatures from semantic abstracts. In Proceedingsof the 2009 Conference on Empirical Methods inNatural Language Processing: Volume 2-Volume 2,pages 958–967. Association for Computational Lin-guistics.

Dan Goldwasser and Dan Roth. 2014. Learning fromnatural instructions. Machine learning, 94(2):205–232.

Brent Harrison, Upol Ehsan, and Mark O Riedl. 2017.Guiding reinforcement learning exploration usingnatural language. arXiv preprint arXiv:1707.08616.

Geoffrey E Hinton. 1999. Products of experts.

Geoffrey E Hinton. 2002. Training products of expertsby minimizing contrastive divergence. Neural com-putation, 14(8):1771–1800.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung,Jayant Krishnamurthy, and Luke Zettlemoyer. 2017.Learning a neural semantic parser from user feed-back. arXiv preprint arXiv:1704.08760.

Robin Jia and Percy Liang. 2016. Data recombina-tion for neural semantic parsing. arXiv preprintarXiv:1606.03622.

Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard-ner. 2017. Neural semantic parsing with type con-straints for semi-structured tables. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 1516–1526.

Jayant Krishnamurthy and Tom M Mitchell. 2012.Weakly supervised training of semantic parsers. InProceedings of the 2012 Joint Conference on Empir-ical Methods in Natural Language Processing andComputational Natural Language Learning, pages754–765. Association for Computational Linguis-tics.

Gregory Kuhlmann, Peter Stone, Raymond Mooney,and Jude Shavlik. 2004. Guiding a reinforcementlearner with natural language advice: Initial resultsin robocup soccer. In The AAAI-2004 workshop onsupervisory control of learning and adaptive sys-tems.

Tom Kwiatkowski, Luke Zettlemoyer, Sharon Gold-water, and Mark Steedman. 2010. Inducing proba-bilistic ccg grammars from logical form with higher-order unification. In Proceedings of the 2010 con-ference on empirical methods in natural languageprocessing, pages 1223–1233. Association for Com-putational Linguistics.

Chen Liang, Jonathan Berant, Quoc Le, Kenneth DForbus, and Ni Lao. 2016. Neural symbolicmachines: Learning semantic parsers on free-base with weak supervision. arXiv preprintarXiv:1611.00020.

Percy Liang, Michael I Jordan, and Dan Klein. 2013.Learning dependency-based compositional seman-tics. Computational Linguistics, 39(2):389–446.

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun-som. 2017. Program induction by rationale genera-tion: Learning to solve and explain algebraic wordproblems. arXiv preprint arXiv:1705.04146.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025.

Dipendra Kumar Misra, Kejia Tao, Percy Liang, andAshutosh Saxena. 2015. Environment-driven lex-icon induction for high-level instructions. In Pro-ceedings of the 53rd Annual Meeting of the Associ-

1686

ation for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers), volume 1,pages 992–1002.

Karthik Narasimhan, Tejas Kulkarni, and ReginaBarzilay. 2015. Language understanding for text-based games using deep reinforcement learning.arXiv preprint arXiv:1506.08941.

Aishwarya Padmakumar, Jesse Thomason, and Ray-mond J Mooney. 2017. Integrated learning of dialogstrategies and semantic parsing. In Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume1, Long Papers, volume 1, pages 547–557.

Panupong Pasupat and Percy Liang. 2015. Compo-sitional semantic parsing on semi-structured tables.arXiv preprint arXiv:1508.00305.

Siva Reddy, Mirella Lapata, and Mark Steedman. 2014.Large-scale semantic parsing without question-answer pairs. Transactions of the Association ofComputational Linguistics, 2(1):377–392.

Shawn Squire, Stefanie Tellex, Dilip Arumugam, andLei Yang. 2015. Grounding english commands toreward functions. Robotics: Scienceand Systems.

Shashank Srivastava, Igor Labutov, and Tom Mitchell.2017. Joint concept learning and semantic parsingfrom natural language explanations. In Proceedingsof the 2017 Conference on Empirical Methods inNatural Language Processing, pages 1527–1536.

Jesse Thomason, Shiqi Zhang, Raymond J Mooney,and Peter Stone. 2015. Learning to interpret nat-ural language commands through human-robot dia-log. In IJCAI, pages 1923–1929.

Yushi Wang, Jonathan Berant, and Percy Liang. 2015.Building a semantic parser overnight. In Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 1: Long Papers), volume 1, pages1332–1342.

John Wieting, Mohit Bansal, Kevin Gimpel, andKaren Livescu. 2015. Towards universal para-phrastic sentence embeddings. arXiv preprintarXiv:1511.08198.

Chang Xu, Dacheng Tao, and Chao Xu. 2013. Asurvey on multi-view learning. arXiv preprintarXiv:1304.5634.

John M Zelle and Raymond J Mooney. 1996. Learn-ing to parse database queries using inductive logicprogramming. In Proceedings of the national con-ference on artificial intelligence, pages 1050–1055.

Luke S Zettlemoyer and Michael Collins. 2005. Learn-ing to map sentences to logical form: Structuredclassification with probabilistic categorial gram-mars. In UAI.

Luke S Zettlemoyer and Michael Collins. 2009. Learn-ing context-dependent mappings from sentences tological form. In Proceedings of the Joint Confer-ence of the 47th Annual Meeting of the ACL and the4th International Joint Conference on Natural Lan-guage Processing of the AFNLP: Volume 2-Volume2, pages 976–984. Association for ComputationalLinguistics.

1687

0 1 2 3

Number of corrections expressed in feedback utterance

0

500

1000

1500

2000

Num

ber

ofqu

esti

on/f

eedb

ack

pair

soriginal parse y incorrect

original parse y correct

Figure 5: A histogram of the number of correctionsexpressed in a natural language feedback utterance inour data (0 corrections means that the user affirmed theoriginal parse as correct). We partition the feedback ut-terances according to whether the original parse y thatthe feedback was provided towards is known to be cor-rect (i.e., matches the gold standard logical form parse).

A Analysis of natural language feedback

The dataset we collect creates a unique opportu-nity to study the nature and the limitations of thecorrective feedback that users generate in responseto an incorrect parse. One of our hypotheses statedin Section 1 is that natural language affords usersto express richer feedback than for example possi-ble with binary (correct/incorrect) mechanism, byallowing users to explicitly refer to and fix whatthey see as incorrect with the original prediction.In this section we analyze the feedback utterancesin our dataset to gain deeper insight into the typesof feedback users generate, and the possible limi-tations of natural language as a source of supervi-sion for semantic parsers.

Figure 5 breaks down the collected feedback bythe number of corrections expressed in the feed-back utterance. The number of corrections rangesfrom 0 (no corrections, i.e., worker considers theoriginal parse y to be the correct parse of u) tomore than 3 corrections.7 The number of correc-tions is self-annotated by the workers who writethe feedback – we instruct workers to count a cor-rection constrained to a single predicate or a singleentity as a single correction, and tally all such cor-rections in their feedback utterance after they have

7Note that in cases where the user indicates that they made0 corrections, the feedback utterance that they write is oftena variation of a confirmation such as “that’s right” or “that’scorrect”

4 5 6 7 8 9 10 11Number of predicates in correct logical form

0

1

2

3

4

5

6

7

%of

ques

tion

/fee

dbac

kpa

irs

User feedback error rate

User feedback false positives

User feedback false negatives

Figure 6: Analysis of noise in feedback utterances con-tained in our dataset. Feedback false positives refersto feedback that incorrectly identifies the wrong origi-nal parse y as correct. Feedback false negatives refersto feedback that incorrectly identifies a correct originalparse y as wrong. Users are more likely to generatefalse positives (i.e., miss the error) in their feedbackfor parses of more complex utterances (as measured bythe number of predicates in the gold-standard logicalform).

written it8. Because we also know the ground truthof whether the original parse y (i.e., parse that theuser provided feedback towards) was correct (i.e.,y matches gold standard parse y), we can partitionthe number of corrections by whether the origi-nal parse y was correct (green bars in Figure 5)or incorrect (red bars), allowing us to evaluate theaccuracy of some of the feedback.

From Figure 5, we observe that users providefeedback that ranges in the number of corrections,with the majority of feedback utterances makingone correction to the original logical form y (only4 feedback utterances expressed more than 3 cor-rections). Although we do not have gold-standardannotation for the true number of errors in theoriginal logical form y, we can nevertheless ob-tain some estimate of the noise in natural languagefeedback by analyzing cases where we know thatthe original logical form y was correct, yet the usergenerated feedback with at least one correction.We will refer to such feedback instances as feed-back false negatives. Similarly, cases where theoriginal logical form y is incorrect, yet the userprovided no corrections in their feedback, we re-fer to as feedback false positives. The number offeedback false negatives and false positives can be

8we give these instructions to workers in an easy to un-derstand explanation without invoking technical jargon

1688

obtained directly from Figure 5. Generally, we ob-serve that users are more likely to provide morefalse negatives (≈ 4% of all feedback) than falsepositives (≈ 1%) in the feedback they generate.

It is also instructive to consider what factorsmay contribute to the observed noise in user gener-ated natural language feedback. Our hypothesis isthat more complex queries (i.e., longer utterancesand longer logical forms) may result in a greatercognitive load to identify and correct the error(s)in the original parse y. In Figure 6 we investigatethis hypothesis by decomposing the percentage offeedback false positives and false negatives as afunction of the number of predicates in the goldstandard logical form (i.e., one that the user is try-ing to recover by making corrections in the origi-nal logical form y). Our main observation is thatusers tend to generate more false positives (i.e.,incorrectly identify an originally incorrect logicalform y as correct) when the target logical form islonger (i.e., the query utterance u is more com-plex). The number of false negatives (i.e., incor-rectly identifying a correct logical form as incor-rect) is relatively unaffected by the complexity ofthe query (i.e., number of predicates in the tar-get logical form). One conclusion that we candraw from this analysis is that we can expect user-generated feedback to miss errors in more com-plex queries, and models that learn from users’natural language feedback need to have a degreeof robustness to such noise.

B Implementation Details

We tokenize user utterances (questions) and feed-back using the Stanford CoreNLP package. In allexperiments, we use 300-dimensional word em-beddings, initialized with word vectors trained us-ing the Paraphrase Database PPDB (Wieting et al.,2015), and we use 128 hidden units for LSTMs.All parameters are initialized uniformly at ran-dom. We train all the models using Adam withinitial learning rate 10−4 and apply L2 gradientnorm clipping with a threshold of 10. In all theexperiments, we pre-train the task parser and thefeedback parser for 20 epochs, and then switchto semi-supervised training for 10 epochs. Pre-training takes roughly 30 minutes and the semi-supervised training process takes up to 7 hours ona Titan X GPU.

1689

Conv

ersa

tiona

l Con

text

Utt

eran

ceFe

edba

ckCo

nver

satio

nal C

onte

xtU

tter

ance

Feed

back

Befo

re F

eedb

ack

(inco

rrec

t par

se)

Afte

r Fee

dbac

k(c

orre

ct p

arse

)

Figure 7: Visualization of the attention mechanism in the task parser (top) and the feedback parser (bottom) forparsing the same utterance. We partition the input to the parser into three groups: the original utterance u beingparsed (blue), the conversational context (green) and the feedback utterance f (red). This example was parsedincorrectly before incorporating feedback, but parsed correctly after its incorporation.

1690

Befo

re F

eedb

ack

(inco

rrec

t par

se)

Afte

r Fee

dbac

k(c

orre

ct p

arse

)Co

nver

satio

nal C

onte

xtU

tter

ance

Feed

back

Conv

ersa

tiona

l Con

text

Utt

eran

ceFe

edba

ck

Figure 8: Visualization of the attention mechanism in the task parser (top) and the feedback parser (bottom) forparsing the same utterance. We partition the input to the parser into three groups: the original utterance u beingparsed (blue), the conversational context (green) and the feedback utterance f (red). This example was parsedincorrectly before incorporating feedback, but parsed correctly after its incorporation.

learning to learn semantic parsers from natural language ... · semantic parsing is a problem of...

Documents