modeling behavior in truth value judgment task experiments

11
Proceedings of the Society for Computation in Linguistics Proceedings of the Society for Computation in Linguistics Volume 3 Article 3 2020 Modeling Behavior in Truth Value Judgment Task Experiments Modeling Behavior in Truth Value Judgment Task Experiments Brandon Waldon Stanford University, [email protected] Judith Degen Stanford University, [email protected] Follow this and additional works at: https://scholarworks.umass.edu/scil Part of the Computational Linguistics Commons, Psycholinguistics and Neurolinguistics Commons, and the Semantics and Pragmatics Commons Recommended Citation Recommended Citation Waldon, Brandon and Degen, Judith (2020) "Modeling Behavior in Truth Value Judgment Task Experiments," Proceedings of the Society for Computation in Linguistics: Vol. 3 , Article 3. DOI: https://doi.org/10.7275/sg32-aq30 Available at: https://scholarworks.umass.edu/scil/vol3/iss1/3 This Paper is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Proceedings of the Society for Computation in Linguistics by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected].

Upload: others

Post on 20-Dec-2021

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling Behavior in Truth Value Judgment Task Experiments

Proceedings of the Society for Computation in Linguistics Proceedings of the Society for Computation in Linguistics

Volume 3 Article 3

2020

Modeling Behavior in Truth Value Judgment Task Experiments Modeling Behavior in Truth Value Judgment Task Experiments

Brandon Waldon Stanford University, [email protected]

Judith Degen Stanford University, [email protected]

Follow this and additional works at: https://scholarworks.umass.edu/scil

Part of the Computational Linguistics Commons, Psycholinguistics and Neurolinguistics Commons,

and the Semantics and Pragmatics Commons

Recommended Citation Recommended Citation Waldon, Brandon and Degen, Judith (2020) "Modeling Behavior in Truth Value Judgment Task Experiments," Proceedings of the Society for Computation in Linguistics: Vol. 3 , Article 3. DOI: https://doi.org/10.7275/sg32-aq30 Available at: https://scholarworks.umass.edu/scil/vol3/iss1/3

This Paper is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Proceedings of the Society for Computation in Linguistics by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected].

Page 2: Modeling Behavior in Truth Value Judgment Task Experiments

Modeling Behavior in Truth Value Judgment Task Experiments

Brandon WaldonStanford University

[email protected]

Judith DegenStanford University

[email protected]

Abstract

Truth Value Judgment Task experiments(TVJTs) are a common means of investigat-ing pragmatic competence, particularly withregards to scalar inference. We present a novelquantitative linking function from pragmaticcompetence to participant behavior on TVJTs,based upon a Bayesian probabilistic modelof linguistic production. Our model capturesa range of observed phenomena on TVJTs,including intermediate responses on a non-binary scale, population and individual-levelvariation, participant endorsement of false ut-terances, and variation in response due to so-called scalar diversity.

1 Introduction

In Truth Value Judgment Task experiments(TJVTs), participants are asked whether a givensentence is, e.g., ‘right’ or ‘wrong’ (or ‘true’ or‘false’, etc.), often in a context of evaluation. Inthe field of experimental pragmatics, participantjudgments in TVJT paradigms have been partic-ularly important for investigating pragmatic com-petence, especially as it relates to scalar implica-ture (Noveck, 2001; Noveck and Posada, 2003;Bott and Noveck, 2004; De Neys and Schaeken,2007; Geurts and Pouscoulous, 2009; Chemla andSpector, 2011; Degen and Goodman, 2014; De-gen and Tanenhaus, 2015). On the traditional viewof pragmatic competence and its link to TVJT re-sponses, scalar implicature is assumed - followingGrice (1975) - to be a binary and categorical phe-nomenon, in the sense that a given utterance is as-sumed to categorically either give rise to an impli-cature or not, depending on contextual, cognitive,and linguistic factors. To experimentalists oper-ating on this assumption, a participant’s judgmenton a particular trial in a TVJT reflects whether ornot a scalar implicature was computed in context.

For example, a ‘wrong’ judgment of the sen-tence John ate pizza or a sandwich, in a context inwhich the stronger utterance alternative John atepizza and a sandwich is true and equally relevant,is typically interpreted as a “pragmatic” judgment:participants must have recognized that in such acontext, the or-sentence is true yet underinforma-tive. Pragmatically enriching it to John didn’t eatboth pizza and a sandwich via scalar inferencemakes it contextually false. Conversely, an answerof ‘right’ on this view reflects a “literal” semanticinterpretation whereby the implicature is not com-puted (i.e. John ate pizza or a sandwich - and pos-sibly both).

This linking assumption underpins the vast ma-jority of TVJT literature relating to scalar infer-ence (Noveck, 2001; Papafragou and Musolino,2003; Geurts and Pouscoulous, 2009; Doran et al.,2012; Potts et al., 2015). In an early example, Pa-pafragou and Musolino (2003) observe that chil-dren accept true but underinformative sentences ina TVJT at a relatively high rate relative to adults,and that this rate is modulated by the particularlinguistic scale invoked on a given trial of the ex-periment (i.e. some/all vs. finish/start vs. cardinalnumbers). The authors argue from this result thatscalar implicature computation is dependent uponlinguistic scale as well as on a child’s recognitionof the communicative goals of her interlocutor.

Though widely employed, this linking assump-tion for TVJTs is associated with a host of prob-lems discussed by Jasbi et al. (2019). Follow-ing those authors as well as Tanenhaus (2004),we take these open problems to be indicative ofa larger issue in linguistics, namely that the link-ing hypotheses which bridge linguistic theory andexperimentally-elicited behavior are often under-developed, underspecified, or (in some cases) ab-sent in experimental studies. In the service ofproviding a proof of concept for how this is-

10Proceedings of the Society for Computation in Linguistics (SCiL) 2020, pages 10-19.

New Orleans, Louisiana, January 2-5, 2020

Page 3: Modeling Behavior in Truth Value Judgment Task Experiments

sue may be addressed by future researchers, wepropose and evaluate a novel account of partici-pant response in TVJT paradigms based on an ex-plicit and quantitatively specified linking functionrooted in a probabilistic theory of pragmatic com-petence. The general idea is that participants’ re-sponses in TVJT experiments are related to theprobability with which a cooperative pragmaticspeaker would have produced the observed utter-ance (e.g., John ate pizza or a sandwich) in or-der to communicate the meaning presented to par-ticipants as fact (e.g., that John ate both pizzaand a sandwich). This probabilistic productionbased view departs substantially from the previouswidespread assumption that truth-value judgmentsare a measure of interpretation.

Before turning to the specifics of the account,we briefly review some of the open problems inthe TVJT literature that motivate the re-thinkingof linking functions in TVJT paradigms:

Intermediate judgments: When providedmore than two response options in a TVJT, a siz-able proportion of participants rates underinfor-mative sentences using the intermediate responseoptions - for example, as only ‘kind of [right/ wrong]’, or ‘neither [right nor wrong]’. Kat-sos and Bishop (2011), for example, providedparticipants with three response options and ob-served substantial selection of the intermediate op-tion. They interpreted the choice of this inter-mediate option as being the result of the com-putation of an implicature, but a priori, there isno reason to favor this linking assumption overone whereby the intermediate response is associ-ated with a literal semantic interpretation. Moregenerally, it is not clear how the outputs of a bi-nary model of scalar implicature (i.e. implicatureor ¬implicature) should relate to non-binary re-sponses on TVJTs.

Population-level variation: In order to explainbehavioral variation in contexts where one expectsa scalar inference, an adherent to the categoricalview of scalar implicature must stipulate that a)not all participants calculated the implicature; orb) some participants who calculated the implica-ture showed divergent behavior due to some in-dependent mechanism which masked the ‘correct’implicature behavior; or some combination of (a)and (b). However, and despite the prevalence ofvariation at the population level in reported TVJTexperiments, even a qualitative analysis of this

kind of variation is largely absent from the exper-imental scalar implicature literature.

Scalar diversity: Doran et al. (2012), followingPapafragou and Musolino (2003) inter alia, reportthat judgments of true but underinformative sen-tences vary according to the particular linguisticmaterial contained within the sentence, in partic-ular the relevant linguistic scale. They concludethat variation among scalar implicatures is a func-tion of the scale itself (see also van Tiel et al. 2014for further support for scale-based scalar diversityin a non-TVJT paradigm).

Whether this variation is truly due to inherentfeatures of the linguistic scale (or, e.g., prior worldknowledge, or other linguistic material, or otherconfounding features of the experimental context)is an open question which warrants investigationbeyond the scope of this paper. Below, we an-alyze data from a TVJT where different rates ofexhaustive interpretation were observed betweena putative lexical scale (<and, or>) and a putativead-hoc, context-dependent pragmatic scale. Ouranalysis of the data suggests that in this instance,(at least some) variation at the level of linguisticscale may be reduced to more general aspects ofpragmatic competence.

Endorsement of false utterances: Invariably,a proportion of participants in TVJTs acceptsstrictly false sentences. For example, in the studywe analyze below, a substantial number of par-ticipants rated conjunctions A ^ B as partiallycorrect in contexts where only A was true. Themost common approaches to this type of dataare either to use it as the basis of an exclu-sion criterion or to simply consider it meaning-less noise. Doran et al. (2012), for example, ex-clude participants whose performance deviates bymore than two standard deviations from the meanresponse on ‘control’ sentences whose semanticcontents are consistent with the context of evalua-tion (and which do not admit of potentially contra-dictory pragmatic enrichments) or whose seman-tic contents contradict the context. Katsos andBishop (2011) report that 2.5% of false sentencesin their experiment were endorsed by child partic-ipants. On the standard linking assumption, thesedata are difficult to make sense of, but we willshow that they are within the scope of a satisfac-tory analysis of TVJT behavior.

The remainder of the paper is structured as fol-lows: in Section 2, we summarize the results

11

Page 4: Modeling Behavior in Truth Value Judgment Task Experiments

Condition Response OptionsBinary ‘Right’, ‘Wrong’Ternary ‘Right’, ‘Neither’, ‘Wrong’

Quaternary ‘Right’, ‘Kinda Right’, ‘KindaWrong’,‘Wrong’

Quinary ‘Right’, ‘Kinda Right’, ‘Neither’, ‘KindaWrong’, ‘Wrong’

Table 1: Response-option conditions of Jasbi etal. (2019)’s TVJT study.

of a recently reported TVJT study that exempli-fies the features discussed above: intermediatejudgments, population-level variation, scalar di-versity, and participant endorsement of false ut-terances. Section 3 presents our novel quantita-tive model of the data from that study. Buildingon insights from the Bayesian probabilistic litera-ture on pragmatic competence (Frank and Good-man, 2012; Goodman and Stuhlmuller, 2013), wemodel participants as making judgments about asoft-optimal pragmatic speaker whose productionchoices are a function of utterances’ contextual in-formativeness. On our analysis, participants fur-thermore expect that the speaker sometimes pro-duce strictly false utterances that are nonethelesssomewhat contextually useful. We show that thisanalysis provides broader empirical coverage overthe traditional assumptions discussed above.1

2 TVJT Data

2.1 Experiment Materials, Design andProcedure

Jasbi et al. (2019) report the results of a TVJT de-signed to test whether linking hypothesis and num-ber of response options modulate the researcher’sinferences about scalar implicature rates. In theirstudy, number of response options varied betweentwo and five as a between-subjects manipulation.Conditions are summarized in Figure 1. Partici-pants (n = 200) were first shown six cards (Table2) featuring one or two of the following animals: acat, a dog, and an elephant. On every trial, partic-ipants saw one of the six cards, and a blindfoldedcartoon character Bob made guesses as to what an-imals were on the card. Participants were askedto rate Bob’s guesses using the response optionsavailable in their particular condition.

Bob made the following guess types: simpledeclaratives (e.g., There is a cat), conjunctions(e.g., There is a cat and a dog), and disjunctions

1Data and code for all analyses and graphs are availableat http://github.com/bwaldon/tvjt_linking.

Table 2: Cards used in Jasbi et al. (2019)’s TVJT.

(e.g., There is a cat or a dog). Card types werecrossed with guess types in this study such that acard containing an animal X could be presentedwith a guess of There is an X, There is an X or a Y(where Y is some animal distinct from X), There isan X and a Y, or There is a Y; cards containing twoanimals X and Y could be presented with a guessof There is an X, There is an X or a Y, There is anX and a Y, or There is a Z (where Z is some animaldistinct from X and Y).

The researchers elicited 3 judgments per partic-ipant for each combination of card and guess type.

2.2 Results and Discussion

Proportions of responses for each card-guess typein each response-option condition are shown inFigure 1, with rows presenting behavior aggre-gated across one and two-card conditions.

The results of the study illustrate the sev-eral open empirical issues associated with TVJTsmore generally. First, participants routinely re-ported intermediate judgments between ‘Right’and ‘Wrong’ in those conditions where intermedi-ate response options were available. In the Qua-ternary and Quinary response-option conditions,for example, the intermediate judgment of ‘KindaRight’ was the single most-selected response op-tion in two-animal card conditions where Bob’sguess was true but underinformative (i.e. eithera simple delcarative or a disjunction).

The results also exemplify the issue ofpopulation-level variation: for example, al-though behavioral patterns are otherwise fairlycategorical in the Binary condition, participantjudgments were roughly split between ‘Right’ and‘Wrong’ for underinformative uses of disjunc-tion on two-animal card conditions. A visual in-spection of the results suggests even more varia-tion in the population as number of response op-tions increase. The authors furthermore reportedindividual-level variation: qualitatively similartrials (e.g. two trials involving underinforma-tive disjunction) sometimes received different re-

12

Page 5: Modeling Behavior in Truth Value Judgment Task Experiments

Figure 1: Model predictions (light bars) plotted against empirical results (dark bars) from Jasbi et al.’s (2019)TVJT study. Error bars indicate 95% multinomial confidence intervals. Red and green bars indicate false and truetrials, respectively; blue bars indicate implicature trials.

sponses from the same participant.Comparison of judgments of true but underin-

formative simple declaratives (i.e. There is an X)to judgments of true but underinformative disjunc-tions (i.e. There is an X or a Y) on two-animalcard conditions revealed some amount of scalardiversity. Following Horn (1972), exposure to thedisjunctive connective or canonically activates aninformationally-stronger scalemate and as a prag-matic alternative to give rise to an exclusive in-terpretation. In contrast, the pragmatic scale inthe case of the simple declarative is constructedin a more context-dependent manner. To illus-trate, in a two-animal card context where the cardfeatures both a cat and a dog, the listener consid-ers a partially-ordered pragmatic scale of cat anddog, cat, and dog, where the conjunction outranksits scalemates in terms of informational strength.Thus, an utterance of cat activates cat and dog asan alternative to give rise to an the exhaustive in-terpretation (There is only a cat on the card).

In the Binary and Ternary conditions, under-

informative uses of or resulted in substantiallyhigher rates of ‘Wrong’ responses than did under-informative simple declaratives, suggesting thatat the population level, or was interpreted moreexhaustively than the simple declarative. How-ever, this pattern was reversed in the Quaternaryand Quinary conditions, in which underinforma-tive simple declaratives were more likely to beconsidered only ‘Kinda Right’ and less likely to beconsidered simply ‘Right’ compared to underin-formative disjunctions. This pattern suggests thatin the Quaternary and Quinary conditions, simpledeclaratives were interpreted more exhaustivelythan disjunctions.

Finally, the data in the Quaternary and Quinaryconditions also reveal substantial participant en-dorsement of false utterances. Note specifi-cally that in one-animal card trials, the conjunctiveguess (e.g. cat and dog) is strictly false; thus, wemight naıvely expect a priori that participants cat-egorically judge these utterances to be ‘Wrong’ inall conditions. Yet when given the option to rate

13

Page 6: Modeling Behavior in Truth Value Judgment Task Experiments

this sentence ‘Kinda Right’ or ‘Kinda Wrong’,participants often did so. In all other conditionswhere the utterance was strictly false (e.g. a guessof elephant for a card containing a cat or a cat anddog), behavior was effectively categorical. That is,rates of endorsement of false utterances varied ac-cording to the particular way in which the sentencewas false in context.

In sum, the data collected by Jasbi et al. (2019)reflect a range of behavioral patterns unaccountedfor by the traditional categorical view of scalarinference and corresponding standard linking as-sumptions. Below, we report an analysis of theirdata that aims to predict these phenomena.

3 Analysis

3.1 Cognitive model

Our analysis implements a proposal outlined byJasbi et al. (2019), couched in the RationalSpeech Act (RSA) framework (Frank and Good-man, 2012; Goodman and Stuhlmuller, 2013).RSA provides a Bayesian, probabilistic accountof pragmatic competence. In RSA, the pragmaticinferences drawn by listeners are represented asprobability distributions over meanings which thespeaker plausibly intended to convey with a givenobserved utterance. The probability of this lis-tener (L1) attributing an intended meaning m to aspeaker who produces an utterance u is calculatedfrom a prior probability distribution over potentialworld states Pw as well as from L1’s expectationsabout the linguistic behavior of the speaker S1.

PL1(m|u) / PS1(u|m) · Pw(m)

PS1 is modeled as a probability distribution overpossible utterances given the speaker’s commu-nicative intentions m. This speaker produces ut-terances that soft-maximize utility, where utilityis defined via a tradeoff between an utterance’scost C and its contextual informativeness, calcu-lated from the representation of a literal listener L0

whose interpretation of an utterance u is in turn afunction of the truth conditional meaning [[u]](m)and of her prior expectations Pw(m) regarding thelikelihood of possible world states. The extent towhich the speaker maximizes utility is modulatedby a parameter ↵ – the greater ↵, the more thespeaker produces utterances that maximize utility.

PS1(u|m) / e↵(ln L0(m|u)�C(u))

PL0(m|u) / [[u]](m) · Pw(m)

In RSA (and contra the traditional view), prag-

matic inferences are not categorical computationsof enriched meanings over the semantic denota-tions of utterances. For example, exclusive inter-pretations of or are represented in RSA as a pos-itive shift in the posterior probability of an exclu-sive meaning, relative to its prior probability.

In other words, ‘implicature’ is not a theoreti-cal construct in the RSA framework, absent addi-tional stipulations regarding how to go from prob-ability distributions to binary, categorical infer-ences. This is an advantage: providing a proba-bilistic representation of both the speaker’s utter-ance choices and the listener’s resulting posteriorbeliefs after observing an utterance puts us onestep closer to accounting for the quantitative be-havioral patterns observed in tasks such as TVJTs.

3.2 Behavioral model

Jasbi et al. (2019) proposed but did not system-atically test a simple linking hypothesis: ratherthan providing one response if an implicature iscomputed and another if it isn’t, a participant ina TVJT experiment provides a particular responseto an utterance u if the probability of u given ameaning represented by m lies within a particularprobability interval on the distribution PS1(u|m).2

The participant is modeled as a responder R, whoin a binary forced-choice task between ‘Right’and ‘Wrong’ responds ‘Right’ to an utterance u inworld m just in case PS1(u|m) meets or exceedssome probability threshold ✓:

R(u, m, ✓) =

(‘Right’ iff PS1(u|m) � ✓

‘Wrong’ otherwise

The model is extended straightforwardly to anexperiment in which participants have a third re-sponse option (e.g. ‘Neither’), as in the Ternarycondition. In this case, the model specifies twoprobability thresholds: ✓1, the minimum standardfor an utterance in a given world state to count as‘Right’, and ✓2, the minimum standard for ‘Nei-ther’. Thus, in the Ternary condition:

R(u, m, ✓) =

8><>:

‘Right’ iff PS1(u|m) � ✓1

‘Neither’ iff ✓1 > PS1(u|m) � ✓2

‘Wrong’ otherwiseApplying a similar logic allows for the speci-

fication of linking hypotheses for TVJTs with an

2Following Degen and Goodman (2014), the authors ar-gue that conceptually, behavior on TVJTs is better modeledas a function of an agent’s representation of a pragmaticspeaker rather than of a pragmatic listener.

14

Page 7: Modeling Behavior in Truth Value Judgment Task Experiments

arbitrary number of response options.The intuition behind the threshold model is as

follows: participants should disprefer utterancesthat are relatively unexpected. Thus, high S1 pro-duction probability for a given utterance in contextmakes it more likely that the utterance receives apositive evaluation in the TVJT – expressed byordered response options above ‘Wrong’. Con-versely, the more unexpected an utterance is, themore likely it is to be judged as ‘Wrong’. Underin-formative utterances of the sort that have tradition-ally been used to assess ‘implicature rates’ are pre-cisely the kinds of utterances that are unexpectedfrom informative speakers and are therefore likelyto be rated as ‘Wrong’.

Here, we assess the quality of this linking hy-pothesis on the dataset from Jasbi et al. (2019).To that end, we first specify the space of possi-ble meanings and utterances that inform a partici-pant’s pragmatic competence in this task. We as-sume that participants have uniform prior expecta-tions of seeing any of the six possible cards in theexperiment. We further assume that participantshave uniform prior expectations of a speaker pro-ducing any of the four utterance types with whicha card may have been crossed. For example, ifthe card featured either just a cat or both a cat anda dog, we represent the participant as having uni-form prior expectations of a speaker producing theguesses elephant, cat, dog, cat and dog, or cat ordog (that is, we do not posit a cost asymmetry be-tween possible utterances).3

For illustrative purposes, the ‘Simple Bayesian’bars in Figure 2 display marginal distributionsover possible utterances produced by S1 giventhese assumptions for the utterance and meaningspriors, as well as an arbitary value of 1 for the op-timality parameter ↵, and given that the speakerintends either to communicate the meaning that(just) a cat is on the card or that both a cat anda dog are. The speaker distributions reveal twoconceptual issues for the threshold response modelproposed by Jasbi et al (2019).

First, the probability of S1 producing the strictlyfalse guess of cat and dog should be zero if thecard contains just a cat. This is because the lit-eral listener probability PL0 of inferring the ‘onlycat’ meaning given cat and dog is zero by virtue

3We include dog as a possible guess because we positthat participants have no reason a priori to expect the othertrue and underinformative simple declarative - cat - over thisequally informative guess in two-animal card conditions.

of the fact that the utterance is strictly false in thisworld state. Thus, any model of response that isa function of PS1 as specified predicts that partic-ipants categorically rate the cat and dog guess as‘Wrong’ in this context, contrary to what is ob-served in the Quaternary and Quinary conditions.

Second, the probability of producing disjunc-tions is lower than the probability of produc-ing simple declarative guesses in two-animal cardcontexts. This asymmetry is advantageous inthe case of the Binary and Ternary responsedata: assuming a threshold for ‘Right’ posi-tioned between PS1(cat or dog|cat and dog) andPS1(cat|cat and dog), we predict correctly thatunderinformative simple declaratives should bejudged ‘Right’ more often than underinformativedisjunctions. But the asymmetry in S1 probabil-ities therefore predicts the wrong pattern of re-sponses on corresponding trials in the Quaternaryand Quinary conditions.

We argue that these two seemingly disparateissues can be mediated by a common solution.In particular, we propose a revision to the sim-ple Bayesian inference story above, wherebypragmatically-competent listeners either expectspeaker productions as directly sampled from thePS1 distribution, or that those utterance productionprobabilities inform a second conditional proba-bility distribution of utterances given utterances,the ‘Partial Truth’ utterance distribution PSPT

:

PSPT(u0|u) / P

m2JuKPS1(u

0|m)4

The ‘Partial Truth’ distribution is a generalizedway of modeling a speaker who makes assertionsthat are sometimes strictly false in light of her in-tended meaning. Recall that the semantic con-tent of any possible utterance choice made by S1

is a set of possible worlds and is therefore con-sistent with meanings unintended by the speaker.SPT models the speaker’s soft-optimal produc-tion probabilities given these unintended mean-ings, renormalizing the pragmatic speaker’s pro-duction probabilities over all possible worlds con-sistent with utterance choices sampled from PS1 .

4For our implementation of SPT , we restrict the distribu-tion such that u0 must entail (or be entailed by) u in order tohave probability above 0. Without this restriction, SPT couldin principle assign high probability to utterances which haveno relevance to the question under discussion (i.e. “What ani-mals are on the card?”), by virtue of those utterances’ asserta-bility in worlds consistent with u. A systematic explorationof the linguistic alternatives available to S1 (as well as SPT )is a question we must leave to future work.

15

Page 8: Modeling Behavior in Truth Value Judgment Task Experiments

Figure 2: Simulated S1 production probabilities.

To illustrate: suppose a speaker intends to com-municate that many (but not all) of the X are Y,and has quantifier choices many and all. The onlypossible utterance choice for the simple BayesianS1 speaker is many, which is semantically consis-tent with the intended meaning. But the lower-bounded quantifier many is also semantically con-sistent with an ‘all of the X are Y’ meaning, whichin turn is consistent with the utterance choice all.By SPT , we have some nonzero expectation thatthe speaker will use all to communicate the ‘many(but not all) of the X are Y’ meaning.5 Thus, apragmatic listener who hears all from the ‘Par-tial Truth’ speaker will have a nonzero expectationthat all should receive an imprecise, non-maximalinterpretation. In other words, SPT provides ageneralized way of formalizing ‘loose-talk’ pro-duction behavior (Lasersohn, 1999).6

The ‘Partial Truth’ bars in Figure 2 visual-ize marginal distributions over utterances givenan arbitrary 0.6 probability that the speaker sam-ples from the PSPT

distribution after samplingfrom PS1 . The ‘Partial Truth’ speaker assignsnonzero probability to a guess of cat and dogeven when the speaker’s intended meaning is thesingle-animal cat card, largely due to the fact thatthe optimal guess in this context (cat) is truth-conditionally consistent with a two-animal cardthat makes cat and dog both true and pragmat-ically optimal.7 Moreover, this speaker assigns

5The effect of this is similar to the use of QUD projectionfunctions for hyperbolic interpretations (Kao et al., 2014).

6Formalizing this production behavior is different fromanalyzing why imprecision exists (indeed, is pervasive) in lin-guistic communication. For the time being, we present this‘loose-talk’ speaker model without a thorough assessment ofits explanatory power.

7Because cat or dog is a possible S1 production, and thischoice lies in an entailment relation with the simple declar-

greater probability to a guess of cat or dog in two-animal contexts and down-weights the probabilityof producing simply cat: the optimal utterance inthis context (cat and dog) is consistent with sev-eral world states in which the disjunction cat ordog is assertable and with relativley fewer worldsin which cat is assertable.

3.3 Quantitative model evaluation

We now turn to a quantitative assessment of thethreshold model of response, having addressedtwo ways in which the unenriched S1 represen-tation would fail to qualitatively capture behav-ioral patterns in Jasbi et al (2019)’s TVJT study.Additionally, following Jasbi et al., we recognizethat if threshold values were made to be com-pletely invariant across trials of the experiment,then the model would make the undesirable pre-diction that every participant should have exactlythe same response in a given trial type. To allowfor population-level variation, the model respon-der makes a response by comparing the speakerprobability against thresholds that are generatedfrom sampling from Gaussian distributions. Wethus allow for both population-level and individuallevel-variation, on the assumption that this sam-pling procedure takes place whenever a participantis asked to evaluate an utterance in the TVJT.8

In order to evaluate the RSA-based thresh-old model, we conducted a Bayesian data analy-sis. This allowed us to simultaneously generatemodel predictions and infer likely parameter val-ues, by conditioning on the TVJT data from Jasbiet al. (separately for each of the four response-option conditions of the experiment) and integrat-ing over the free parameters. Each model assumesuniform priors over utterances and world states asabove. We infer the Gaussian threshold distribu-tion parameters and alpha optimality parametersfrom uniform priors over parameter values usingMCMC sampling (observing - for every sampleof possible parameter values the expected propor-tion of responses in that trial type and comparingthat distribution to the empirically-observed pat-tern of response).9 Additionally, for the Quater-

ative guess dog, we also assign some probability to dog as aguess in this context - albeit lower probability than is assignedto the conjunctive guess cat and dog.

8We also introduce a random noise term in the parameterestimation such that the simulated responder makes randomguesses on 1% of trials. This noise term is removed whenrunning the model forward to make predictive estimations.

9We used WebPPL (Goodman and Stuhlmuller, 2014) for

16

Page 9: Modeling Behavior in Truth Value Judgment Task Experiments

Binary condition↵ � µ✓1

1.22 0.125 0.073Ternary condition↵ � µ✓1 µ✓2

1.38 0.076 0.061 0.011Quaternary condition↵ � µ✓1 µ✓2 µ✓3 PT2.75 0.159 0.277 0.101 0.048 0.797

Quinary condition↵ � µ✓1 µ✓2 µ✓3 µ✓4 PT4.38 0.099 0.184 0.042 0.005 0.002 0.437

Table 3: MAP estimates obtained from Bayesian dataanalysis, where ↵ is the optimality parameter, � andµ are Gaussian threshold distribution parameters, andPT is the probability with which the speaker samplesfrom PSP T

rather than directly from PS1.

nary and Quinary conditions, we infer from a uni-form prior the probability with which the speakersamples from PSPT

after sampling from PS1 . Theintuition for restricting the ‘Partial Truth’ manipu-lation to these conditions is that the behavioral pat-terns which this manipulation is intended to coverare only observed in these conditions.10

Posterior distributions over the parameter val-ues are displayed in Figure 3, and model pre-dictions using maximum a posteriori (MAP) esti-mates of the parameter values (Table 3) are plot-ted against Jasbi et al. (2019)’s results in Fig-ure 1. Qualitatively, the model addresses each ofthe desiderata for an empirically adequate linkingfunction discussed above. In all conditions, themodel makes predictions for the full range of re-sponse options available to participants – thus ad-dressing the issue of intermediate judgments. Atthe same time, the model addresses the issue ofpopulation-level variation: sampling thresholdvalues from Gaussian distributions allows differ-ent judgments in the population for a given utter-ance (while keeping the speaker production prob-ability of that utterance constant).

Recall that in the Quaternary and Quinary con-ditions, there was an asymmetry in the judgmentof underinformative disjunctions versus underin-

MCMC inference, with 5000 samples (plus a lag of 10 iter-ations between samples) and a burn-in time of 20,000 itera-tions. We computed maximum a posteriori values from themarginal posterior distributions over parameter values usingthe density function in R.

10We speculate that there may be a link between increasingthe number of response options and participants’ increasedexpectation of Partial Truth speaker behavior, which mayhave been strengthened by the fact that the Quaternary andQuinary conditions explicitly made reference to gradient lev-els of correctness (i.e. ‘Kinda Right’ / ‘Kinda Wrong’). Butthis speculation warrants future investigation.

formative simple declaratives. The model makesuse of the ‘Partial Truth’ speaker function in orderto adjust the underlying speaker production proba-bilities - and hence the distribution of predicted re-sponse options - for these utterances. The ‘PartialTruth’ function also boosts the production prob-ability of strictly false conjunctions, allowing themodel to predict responses other than ‘Wrong’ forthis trial type. Thus, the ‘Partial Truth’ enrich-ment helps to address both scalar diversity andendorsement of false utterances.11

The correlation between empirical observationsand model predictions is high (Adj. R2 > 0.9in all conditions), suggesting that the thresholdresponder model is a good model of TVJT be-havior overall. Nevertheless, the model makessome undesirable predictions. For example, itover-predicts rates of ‘Neither’ responses in theQuinary condition. Empirically, this responsetended to be disfavored relative to positive andnegative response options, for example in the caseof strictly false cat and dog guesses. The modelassumes that the labeling of the response optionsshould have no particular effect on selection, butfuture work should engage with this assumption.

4 Discussion and Conclusion

Based on a single underlying probabilistic modelof pragmatic competence, the presented thresh-old responder model provides a level of empiri-cal coverage for TVJT data unavailable to existinglinking models rooted in the categorical view ofscalar implicature. The contribution of this paperis twofold: methodologically, we present this anal-ysis as a proof-of-concept approach to modelingTVJT data for researchers in experimental seman-tics/pragmatics. We see the presented behavioralmodel as a starting point for future quantitative an-alytic work in the TVJT domain – a model againstwhich future models may be assessed.12

On the theoretical side, the cognitive model thatforms the basis for the behavioral model is non-neutral in its assumptions. In particular, it as-sumes that TVJT behavior is the result of rea-soning about probabilistic utterance choices that

11We leave further investigation of the ‘Partial Truth’ func-tion - in particular its extension to an analysis of linguisticimprecision as sketched above - to future work.

12For example, one could in principle link the thresh-old model to pragmatic listener probabilities of meaningsgiven utterances rather than to speaker production probabili-ties given intended meanings (as we do in this paper).

17

Page 10: Modeling Behavior in Truth Value Judgment Task Experiments

Figure 3: Normalized marginal posterior distributions over parameter values for the threshold responder model ineach experimental condition. Note that the posterior distribution for the optimality parameter ↵ has been rescaledfor the purposes of this visualization.

are the result of trading off (contextual) utteranceinformativeness and cost. Under this view, notonly does TVJT behavior not quantify implica-ture rates; the very notion of an implicature evap-orates. Rather than finding this undesirable, webelieve that this framework allows for more rigor-ous engagement with the complexities of linkingtheoretical constructs to behavior (see also Franke2016), an area of some dearth in experimental se-mantics/pragmatics.

ReferencesLewis Bott and Ira Noveck. 2004. Some utterances

are underinformative: The onset and time courseof scalar inferences. Journal of Memory and Lan-guage, 51(3):437–457.

Emmanuel Chemla and Benjamin Spector. 2011. Ex-perimental evidence for embedded scalar implica-tures. Journal of Semantics, 28(3):359 – 400.

Wim De Neys and Walter Schaeken. 2007. When peo-ple are more logical under cognitive load - dual taskimpact on scalar implicature. Experimental Psy-chology, 54(2):128–133.

Judith Degen and Noah D Goodman. 2014. Lost yourmarbles? The puzzle of dependent measures in ex-perimental pragmatics. In Proceedings of the 36thAnnual Conference of the Cognitive Science Society,pages 397–402.

Judith Degen and Michael K. Tanenhaus. 2015. Pro-cessing scalar implicature A constraint-based ap-proach. Cognitive Science, 39(4):667–710.

Ryan Doran, Gregory Ward, Meredith Larson, YaronMcNabb, and Rachel E. Baker. 2012. A novel exper-imental paradigm for distinguishing between what issaid and what is implicated. Language, 88:124–154.

Michael C. Frank and Noah D Goodman. 2012. Pre-dicting pragmatic reasoning in language games. Sci-ence, 336:998.

Michael Franke. 2016. Task types, link functions &probabilistic modeling in experimental pragmatics.In Preproceedings of Trends in Experimental Prag-matics.

Bart Geurts and Nausicaa Pouscoulous. 2009. Em-bedded implicatures?!? Semantics and Pragmatics,2:1–34.

Noah D Goodman and Andreas Stuhlmuller. 2013.Knowledge and implicature: modeling language un-derstanding as social cognition. Topics in CognitiveScience, 5(1):173–84.

18

Page 11: Modeling Behavior in Truth Value Judgment Task Experiments

Noah D Goodman and Andreas Stuhlmuller. 2014.The Design and Implementation of ProbabilisticProgramming Languages. http://dippl.org.Accessed: 2019-8-8.

Herbert Paul Grice. 1975. Logic and conversation.Syntax and Semantics, 3:41–58.

Laurence Horn. 1972. On the Semantic Propertiesof the Logical Operators in English. Ph.D. thesis,UCLA.

Masoud Jasbi, Brandon Waldon, and Judith Degen.2019. Linking hypothesis and number of responseoptions modulate inferred scalar implicature rate.Frontiers in Psychology, 10:189.

Justine Kao, J. Wu, Leon Bergen, and Noah D Good-man. 2014. Nonliteral understanding of num-ber words. Proceedings of the National Academyof Sciences of the United States of America,111(33):12002–12007.

Napoleon Katsos and Dorothy V M Bishop. 2011.Pragmatic tolerance: implications for the acquisi-tion of informativeness and implicature. Cognition,120(1):67–81.

Peter Lasersohn. 1999. Pragmatic halos. Language,pages 522–551.

Ira Noveck. 2001. When children are more logical thanadults: experimental investigations of scalar impli-cature. Cognition, 78(2):165–188.

Ira Noveck and Andres Posada. 2003. Characterizingthe time course of an implicature: an evoked poten-tials study. Brain and Language, 85(2):203–210.

Anna Papafragou and Julien Musolino. 2003. Scalarimplicatures: experiments at the semanticspragmat-ics interface. Cognition, 86:253–282.

Christopher Potts, Daniel Lassiter, Roger Levy, andMichael C Frank. 2015. Embedded implicaturesas pragmatic inferences under compositional lexicaluncertainty. Journal of Semantics, 33(1975):755–802.

Michael K. Tanenhaus. 2004. On-line sentence pro-cessing: past, present and future. In Manuel Car-reiras and Charles Clifton, editors, On-line sen-tence processing: ERPS, eye movements and be-yond, pages 371–392. Psychology Press, London,UK.

Bob van Tiel, Emiel van Miltenburg, Natalia Ze-vakhina, and Bart Geurts. 2014. Scalar diversity.Journal of Semantics.

19