true few-shot learning with prompts – a real-world perspective

True Few-Shot Learning with Prompts – A Real-World Perspective

Timo Schick and Hinrich Schütze

Center for Information and Language Processing (CIS), LMU Munich, Germany

[email protected], [email protected]

Abstract

Prompt-based approaches are strong at few-shot learning. However, Perez et al. (2021)have recently cast doubt on their perfor-mance because they had difficulty gettinggood results in a “true” few-shot setting inwhich prompts and hyperparameters cannotbe tuned on a dev set. In view of this, we con-duct an extensive study of PET, a method thatcombines textual instructions with example-based finetuning. We show that, if correctlyconfigured, PET performs strongly in a truefew-shot setting, i.e., without a dev set. Cru-cial for this strong performance is PET’s abil-ity to intelligently handle multiple prompts.We then put our findings to a real-world testby running PET on RAFT, a benchmark oftasks taken directly from realistic NLP appli-cations for which no labeled dev or test setsare available. PET achieves a new state ofthe art on RAFT and performs close to non-expert humans for 7 out of 11 tasks. Theseresults demonstrate that prompt-based learn-ers like PET excel at true few-shot learningand underpin our belief that learning from in-structions will play an important role on thepath towards human-like few-shot learningcapabilities.

1 Introduction

With pretrained language models (LMs) gettingever larger (Radford et al., 2019; Raffel et al., 2020;Brown et al., 2020; Fedus et al., 2021), instruction-based learning has emerged as a powerful methodfor few-shot text classification (e.g., Jiang et al.,2020; Schick and Schütze, 2021a,c; Brown et al.,2020; Wei et al., 2021; Sanh et al., 2021). Thekey idea is to give an LM access to descriptivenames for all possible outputs and to short promptsexplaining the task to be solved. In settings whereat most a few dozen examples are available, thissimple idea leads to substantial improvements over

ADE B77 NIS OSE Over SOT SRI TAI ToS TEH TC0

25

50

75

100

PETHumanPETHuman

82

59

86

65

9182

49

6458

48

82

Figure 1: PET achieves near-human performance for7 out of 11 tasks of the RAFT benchmark (Alex et al.,2021), for which labeled development and test sets arenot available. This demonstrates that prompt-basedlearners like PET, if correctly configured, excel at truefew-shot learning, i.e., without any tuning of instruc-tions or hyperparameters on a development set.

various baselines (Schick and Schütze, 2021a,c;Gao et al., 2021; Tam et al., 2021).

However, recent work has questioned the strongfew-shot performance of instruction-based ap-proaches, arguing in particular that the consideredsettings are often not true few-shot settings (Perezet al., 2021; Logan IV et al., 2021) mainly fortwo reasons: For one, some approaches (e.g., Xieet al., 2019; Zhang et al., 2020; Chen et al., 2020;Tam et al., 2021) make use of large developmentsets to optimize hyperparameters. Beyond that, itis argued that manually designed instructions re-quire manual tuning on development sets to achievestrong performance (Perez et al., 2021; Logan IVet al., 2021). Indeed, performance can vary largely– and in mostly unpredictable ways – across dif-ferent instructions (Jiang et al., 2020; Schick andSchütze, 2021a); this issue even persists after fine-tuning a model on hundreds of instructions (Sanhet al., 2021). Even separate from this problem, theneed for human involvement is generally seen as ahuge drawback of manually designed instructions(Shin et al., 2020; Lester et al., 2021). Thus, severalrecent works abandon them in favor of automati-

arX

iv:2

111.

1344

0v1

[cs

.CL

] 2

6 N

ov 2

021

cally generated prompts (Shin et al., 2020; Gaoet al., 2021; Hambardzumyan et al., 2021; Li andLiang, 2021; Lester et al., 2021).

Contrary to this trend, we argue that whencorrectly configured, prompt-based approachesachieve strong performance even in true few-shotsettings and that there is no problem in using man-ually designed instructions per se. On the opposite,such instructions are often relatively easy to specifyif one is familiar with the task to be solved, they pro-vide an intuitive interface to convey task-specificknowledge, and if properly used, they consistentlyimprove model performance in few-shot settings.To provide empirical support for these claims, werevisit PET (Schick and Schütze, 2021a) – a methodfor combining instructions with example-basedfinetuning whose key feature is that it allows usersto specify multiple instructions for a single task– and thoroughly examine its performance withhuman-made instructions in true few-shot settings.In order to simulate a real-world scenario as bestas possible, we proceed in two steps: First, we con-duct an extensive study of PET using three Englishacademic datasets to analyze its ability to performtrue few-shot learning in a controlled environmentand to derive best practices regarding the choice ofinstructions and other hyperparameters. We thenput our findings to the test and evaluate PET on alarge variety of different real-world tasks from theRAFT benchmark (Alex et al., 2021), for which nolabeled development or test sets are available, en-forcing a true few-shot setting (Perez et al., 2021).On average, PET clearly outperforms all baselineson this dataset and comes surprisingly close to theperformance of non-expert humans (see Figure 1),demonstrating that instruction-based learning cansuccessfully be applied to real-world tasks in truefew-shot settings.

In summary, the main contributions of this workare as follows:

• We investigate the performance of PET forvarious models, tasks and training set sizes,its ability to cope with different instructionsand its robustness to hyperparameter choicesin true few-shot settings.

• We show how PET can be used when no unla-beled data is available and propose a variantfor efficient classification in scenarios withmany different classes.

• We apply PET to RAFT (Alex et al., 2021),

a benchmark of real-world tasks where it ob-tains a new state of the art and achieves near-human performance for 7 out of 11 tasks intrue few-shot settings.

2 Related Work

As a precursor to instruction-based learning, someworks have investigated ways of informing classi-fiers about the meaning of different output classesboth for text (Chang et al., 2008; Veeranna et al.,2016; Zhou et al., 2018) and image classification(Norouzi et al., 2014; Romera-Paredes and Torr,2015); actually providing instructions in the formof short prompts was first proposed by Radfordet al. (2019). This idea has since been applied tosolve a wide range of different NLP tasks withoutany task-specific training data (Puri and Catanzaro,2019; Opitz, 2019; Davison et al., 2019; Schicket al., 2021; Schick and Schütze, 2021; Wei et al.,2021; Sanh et al., 2021). While most approachesrephrase tasks as a language modeling problem,some use prompts to reformulate them as differ-ent tasks for which large amounts of training dataare available (Levy et al., 2017; McCann et al.,2018; Yin et al., 2019; Sun et al., 2021; Sainz et al.,2021). Instruction-based learning has also beenused in few-shot settings; popular variants includein-context learning, where the model’s parametersare fixed and examples are provided as additionalcontext (Brown et al., 2020; Lu et al., 2021; Kumarand Talukdar, 2021; Min et al., 2021), finetuningthe entire model (Schick and Schütze, 2021a,c; Gaoet al., 2021; Tam et al., 2021), and prompt tuning,where only the instruction itself is optimized (Shinet al., 2020; Hambardzumyan et al., 2021; Li andLiang, 2021; Lester et al., 2021).

Several works investigating the limitationsand drawbacks of instruction-based few-shot ap-proaches find that current LMs are mostly unableto understand complex instructions that go beyondshort prompts or simple questions (Efrat and Levy,2020; Weller et al., 2020; Webson and Pavlick,2021) and that they are highly sensitive to the exactwording of the instructions provided (Jiang et al.,2020; Schick and Schütze, 2021a; Elazar et al.,2021). In a similar vein, Perez et al. (2021) andLogan IV et al. (2021) argue that prior work over-estimates few-shot performance as manual prompttuning is required to achieve good performance.Accordingly, some works attempt to obtain eitherprompts (Shin et al., 2020; Gao et al., 2021; Liand Liang, 2021; Lester et al., 2021) or meaningful

I really enjoyed this movie. Question: Is thisa positive movie review? Answer: [MASK].

I really enjoyed this movie. It was [MASK].

( [MASK] ) I really enjoyed this movie.

I really enjoyed this movie.

+ = good − = bad

+ = Yes − = No

+ = positive − = negative

INPUT

PATTERNS VERBALIZERS

Figure 2: Different choices of patterns and corresponding verbalizers for classifying movie reviews as positive(+) or negative (−). The input is first converted into a cloze question using the pattern; classification is done bycomputing the output whose verbalization is the most likely substitute for the mask according to the MLM.

names for output classes (Schick et al., 2020; Gaoet al., 2021) without any human involvement.

Finally, many benchmarks have been proposedfor comparing few-shot approaches in a standard-ized way (e.g., Mishra et al., 2021; Bragg et al.,2021; Xu et al., 2021; Ye et al., 2021; Alex et al.,2021). As our focus is on the real-world applica-bility of few-shot methods, we evaluate PET onthe RAFT benchmark (Alex et al., 2021), whichmeasures performance in applied settings.

3 Pattern-Exploiting Training

We briefly review pattern-exploiting training (PET)(Schick and Schütze, 2021a,c), the method we usefor instruction-based text classification. At its core,PET combines textual instructions with regular fine-tuning using labeled examples. To that end, usersmust specify one or more patterns which convert aninput example x into a cloze question so that it canreadily be processed by a masked language model(MLM) (Devlin et al., 2019).1 These patterns cantake on very different forms; some examples areshown in Figure 2. In addition, users must informthe model about the meaning of all output classes;this is done with a verbalizer that assigns a naturallanguage expression to each output y (see Figure 2,right). We refer to the combination of a pattern andverbalizer as a pattern-verbalizer pair (PVP).

Given a single PVP, let p(y | x) be the probabil-ity that an MLM assigns to y’s verbalization in thecloze question obtained by applying the pattern tox, normalized over all y. The MLM is finetuned onlabeled examples (x, y) by minimizing the cross-entropy loss between p(y | x) and y.

If a user specifies multiple PVPs, individual mod-

1We use the term prompt to refer to a short sequence oftokens that typically contains some form of instruction; theterm pattern is used to denote the function that adds a promptto an input.

els are trained for each pair. Similar to knowledgedistillation (Hinton et al., 2015), they are then usedto annotate unlabeled examples for training a fi-nal classifier with a regular sequence classificationhead (Devlin et al., 2019). We use the weightedvariant of PET without auxiliary language model-ing; see Schick and Schütze (2021a) for details.

Multi-Token Verbalizers When using encoder-only MLMs (Devlin et al., 2019; Liu et al., 2019;Lan et al., 2020), one limitation of PET is that theverbalization of each output class must correspondto a single token. Schick and Schütze (2021c) pro-pose a fix for this that uses multiple mask tokens,but their approach is very inefficient. As discussedby Schick and Schütze (2021b), an alternative so-lution is to instead use an encoder-decoder LM(Lewis et al., 2020; Raffel et al., 2020). In Sec-tion 5, we propose yet another solution that allowsus to stick with encoder-only MLMs while beingmuch more efficient than the approach of Schickand Schütze (2021c).

Iterative PET We also experiment with iPET

(Schick and Schütze, 2021a), an iterative variantof PET that employs self-training (e.g., Scudder,1965; Yarowsky, 1995; Brin, 1999; McClosky et al.,2006) to train several generations of models ondatasets of increasing size. To this end, an ensem-ble of MLMs is trained as in regular PET and thenused to assign labels to unlabeled examples; a newensemble is trained on the so-obtained data. Thisprocess is repeated for multiple iterations, wherethe number of annotated examples is increased ineach iteration. We refer to Schick and Schütze(2021a) for more details.

4 True Few-Shot Learning with PET

We conduct a variety of experiments to answer siximportant research questions (Q1–Q6) regarding

the extent to which true few-shot learning is pos-sible with PET. Beyond that, the purpose of ourexperiments is to establish a set of best practices forour real-world experiments on RAFT (Alex et al.,2021). Before discussing individual experiments,we describe the underlying setup.

Tasks and Datasets While they are heavily usedin prior work (e.g., Brown et al., 2020; Schick andSchütze, 2021c; Logan IV et al., 2021; Websonand Pavlick, 2021), we decide against tasks anddatasets from the GLUE (Wang et al., 2018) andSuperGLUE benchmarks (Wang et al., 2019) asthey are very different from what we expect to seein real-world applications. Instead, we experimentwith AG’s News, Yelp Reviews Full Star and Ya-hoo Questions (Zhang et al., 2015) as these datasetsrepresent classification tasks in three different do-mains that resemble real-world settings. Further,it is relatively easy to come up with a variety ofinstructions for each of these tasks, making it morestraightforward to experiment with a large numberof different patterns.

We consider settings with n = 10 and n = 100training examples. For each n, we generate fivedifferent training sets per task by randomly sam-pling examples from the original training sets whileensuring that the number of examples is about thesame for each possible output class. In addition,for both n = 10 and n = 100, we sample 1,000unlabeled examples from the original training sets.We repeat all of our experiments for all five trainingsets and, by default, report average performance.

PVPs We manually write a total of 23 patternsper task, all of which can be categorized into oneof the following groups:2

• NULL: Following Logan IV et al. (2021),these patterns simply insert a mask token.

• PUNC: Similar to some patterns of Schickand Schütze (2021a), these patterns only addpunctuation characters and a mask token.

• PROMPTS: Patterns in this group add shortprompts – typically consisting of no more thanthree words – to the input, similar to Radfordet al. (2019) and Schick and Schütze (2021a).

• Q&A: These patterns rephrase the task as aquestion q and append

Question: q Answer: [MASK].2The full set of PVPs can be found in Appendix A.

to the input, similar to Brown et al. (2020) andSchick et al. (2021).

For all patterns, we use only a single verbalizerwhich we adopt from Schick and Schütze (2021a).While varying the verbalizer may also lead to in-teresting insights, finding a large amount of rea-sonable verbalizers is challenging for some tasksas there is often a single, natural choice (e.g., theactual category names for AG’s News and YahooQuestions).

Hyperparameters We consider a setting similarto that of Schick and Schütze (2021a,c) and, unlessotherwise specified, use the default settings of thePET library.3 As our experiments require traininghundreds of models, we make a few changes toreduce environmental impact (Strubell et al., 2019)and computational cost: We use the base variantof RoBERTa (Liu et al., 2019) as underlying LM,we train only one model per PVP, and we reducethe training steps for all individual models and thefinal classifier to 100 and 1,000, respectively.

Monitoring Finetuning pretrained LMs can beunstable on small datasets (Devlin et al., 2019;Dodge et al., 2020), sometimes leading to verypoor performance. Luckily, we can detect suchfinetuning issues to some extent even without alabeled test set using the following two checks:

• TRAIN SET UNDERFITTING: We check fortraining runs that result in less than 50% ac-curacy on the training set. As finetuning onup to 100 examples typically leads to perfectpredictions on the training set, this is a clearindicator of a failed finetuning run.

• CONSTANT PREDICTIONS: Another strongindicator of unsuccessful training is when thefinetuned model predicts the same output classfor all inputs; we check this both on the train-ing set and on the unlabeled set.

Whenever one of these events occurs, we restarttraining using a different seed for initializing allrandom number generators.

Q1: How can we find the best pattern – or dowe even need to?

Even slightly different patterns can lead to very dif-ferent performance (Jiang et al., 2020; Schick and

3See https://github.com/timoschick/pet

https://github.com/timoschick/pet

2 0 5 1 7 6 10 8 14 3 151321 9 161122 4 1217192018 PET iPET

0.65

0.70

0.75

0.80

0.85

AG’s News (n = 10)

2 0 1 11 7 9 1514 5 1316 3 1210 6 211820 8 22 4 1917 PET iPET

0.82

0.84

0.86

0.88

AG’s News (n = 100)

0 1 2 3 6 13 4 10 5 7 16171415 8 9 20182111121922 PET iPET

0.35

0.40

0.45

0.50

0.55Yelp Full (n = 10)

1 0 3 7 2 4 6 101316 8 5 221718211514191220 9 11 PET iPET

0.480.500.520.540.560.580.60

Yelp Full (n = 100)

2 1 0 8 12 7 10141611 5 15 4 20 6 13 9 3 2221191718 PET iPET

0.350.400.450.500.550.600.65

Yahoo Questions (n = 10)

2 0 1 101412 8 7 111621 3 4 19202215 9 5 6 131718 PET iPET

0.60

0.62

0.64

0.66

0.68

Yahoo Questions (n = 100)

Figure 3: Performance of individual patterns, PET and iPET on all tasks considered. Accuracy is shown on they-axis; the x-axis shows individual pattern ids where color is used to distinguish the different pattern categories(NULL, PUNC, PROMPT, Q&A). Small bullets ( ) correspond to individual training sets, large bullets ( ) correspondto average performance. Average performance across all patterns is shown as a dashed gray line.

Schütze, 2021a; Schick et al., 2021; Webson andPavlick, 2021; Sanh et al., 2021, i.a.) and popularmodel selection criteria can not reliably identifypatterns that achieve similar performance to thebest one (Perez et al., 2021). We thus investigateto what extent PET can eliminate the need to findthe best instruction even in extreme settings wherethere are dozens of candidates to choose from.

Setup Using our default setup, we train individ-ual models for each PVP and a final PET model;we also train an iPET model for 3 iterations.

Results Performance of individual models foreach pattern and of the distilled models obtainedusing PET and iPET is shown in Figure 3. Inter-estingly, sorting all pattern groups by their averageperformance gives the exact same order for eachtask and training set size: NULL patterns clearlyperform worst, followed by PUNC and PROMPT;Q&A gives the best average results. Contrary tofindings of Logan IV et al. (2021), this shows thatLMs can benefit a lot from manually written in-structions even if combined with finetuning.

Crucially, PET’s performance is much higherthan average performance of individual patterns;further, it consistently outperforms even the best

pattern, verifying that PET indeed removes the needto find the best pattern. While iPET gives clearimprovements for n = 10, it performs worse thanPET for n = 100. The reason for this may be thatwe use a much smaller set of unlabeled examplesthan prior work (Schick and Schütze, 2021a,c).

Q2: Does performance of different patternstransfer across models?

While our results for Q1 show a consistent orderof pattern groups for different training set sizesand tasks, an important question for real-worldapplications is whether the same finding also holdsfor different model sizes, and even entirely differentmodels.

Setup We consider BERT (Devlin et al., 2019)RoBERTa (Liu et al., 2019) and ALBERT (Lanet al., 2020) as underlying language models;4 weexperiment with the base and large variants. Foreach model and size, we repeat the exact sameexperiment as for Q1.

4For Yahoo, we do not consider BERT as it uses a vocabu-lary that does not assign a single token to each verbalization.

n = 10AG’s News

NULL

PUNC

PROMPT

Q&APET

RoBERTa (base)

RoBERTa (large)

BERT (base)

BERT (large)

ALBERT (base)

ALBERT (large)

0.84

0.86

0.80

0.78

0.80

0.76

0.92

0.92

0.86

0.84

0.89

0.88

0.96

0.95

0.96

0.93

0.92

0.91

0.97

0.97

0.97

0.96

0.93

0.97

1.00

1.00

1.00

1.00

1.00

1.00

n = 100

NULL

PUNC

PROMPT

Q&APET

0.97

0.98

0.92

0.93

0.96

0.95

0.99

0.99

0.97

0.98

0.98

0.99

0.99

0.99

0.98

0.98

0.99

0.99

0.99

0.98

0.99

0.99

0.99

1.00

1.00

1.00

1.00

1.00

1.00

1.00

Yelp Full

NULL

PUNC

PROMPT

Q&APET

RoBERTa (base)

RoBERTa (large)

BERT (base)

BERT (large)

ALBERT (base)

ALBERT (large)

0.79

0.78

0.73

0.71

0.71

0.71

0.88

0.84

0.82

0.82

0.86

0.87

0.94

0.90

0.93

0.91

0.92

0.90

0.98

0.94

0.88

0.92

0.94

0.92

1.00

1.00

1.00

1.00

1.00

1.00

NULL

PUNC

PROMPT

Q&APET

0.85

0.89

0.80

0.84

0.80

0.78

0.91

0.96

0.88

0.89

0.89

0.87

0.94

0.96

0.93

0.93

0.91

0.89

0.95

0.98

0.92

0.94

0.91

0.90

1.00

1.00

1.00

1.00

1.00

1.00

Yahoo Qu.

NULL

PUNC

PROMPT

Q&APET

RoBERTa (base)

RoBERTa (large)

BERT (base)

BERT (large)

0.74

0.73

0.59

0.63

0.89

0.85

0.84

0.86

0.90

0.88

0.84

0.87

0.93

0.90

0.87

0.94

1.00

1.00

1.00

1.00

NULL

PUNC

PROMPT

Q&APET

0.91

0.93

0.90

0.93

0.95

0.95

0.95

0.98

0.95

0.96

0.96

0.99

0.96

0.97

0.96

0.98

1.00

1.00

1.00

1.00

Figure 4: Relative performance of individual patterngroups and PET for different models and sizes. Scoresare normalized so that the best performance for eachtask, number of examples, model and size is 1.0.

Results Figure 4 shows the performance of eachpattern group (i.e., average performance of all indi-vidual patterns in this group) and PET; scores arenormalized so that the best-performing approachfor each task, training set size and model getsa score of 1.0. Except for very few exceptions,our findings from Q1 regarding the relative perfor-mance of pattern groups and PET (NULL < PUNC

< PROMPT < Q&A < PET) also hold for differ-ent models and sizes. In general, the performanceof individual patterns strongly correlates betweendifferent models and sizes (Spearman’s ρ ≥ 0.7except in one case).

Q3: Does PET still work if some PVPs are notwell understood?

While Q1 and Q2 have shown that PET performseven better than the best PVP if a large set of high-quality PVPs is given, a potential concern is that

0 2 4 6 8 10 12 14 16 18 20 NP+P NP0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00n = 10

AG’s Yelp Yahoo

0 2 4 6 8 10 12 14 16 18 20 NP+P NP0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00n = 100

AG’s Yelp Yahoo

Figure 5: Performance of PET with three randomlyselected patterns when adding noise PVPs; the x-axisshows the number of noise PVPs added. We also showperformance of using only noise PVPs with PET (NP+P)and their average performance (NP).

performance may be much worse if the LM fails tounderstand a fair amount of patterns and verbaliz-ers (e.g., because they are in a very different stylefrom the model’s pretraining data). For real-worldscenarios, it is thus relevant to know how such “bad”PVPs affect the performance of PET.

Setup It is difficult to obtain large quantities ofbad instructions as they might occur in real-worldscenarios. As a proxy, we resort to noise patternsthat add random tokens to the input, serving asa lower bound for truly bad patterns. In concreteterms, we add up to three randomly sampled tokensbefore and after the input.5 We also create noiseverbalizers by assigning uniformly selected tokensto each output class. Using this process, we obtain20 different intentionally bad PVPs per task. Foreach task, we start with 3 randomly selected, high-quality patterns from our original set of manuallydesigned instructions, add noise PVPs one by oneand investigate the effect on performance.

Results The effect of adding noise PVPs isshown in Figure 5. Interestingly, for both n = 10and n = 100, performance remains almost con-

5If there are multiple input texts, we shuffle their order andadditionally add 0–3 tokens in between them.

1 2 3 4 50.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

1.01n = 10

AG’sYelpYahoo

1 2 3 4 5

n = 100

AG’sYelpYahoo

Figure 6: Relative performance of PET with only asubset of patterns compared to that achieved using all23 manually designed patterns. The x-axis shows thenumber of patterns used.

stant even if more than half of the used PVPs arenoise PVPs, demonstrating that PET is very robustto “bad” instructions. Figure 5 also shows perfor-mance when using only noise PVPs; except forAG’s News with n = 100, this leads to substan-tially worse performance than PET with a smallamount of manually designed instructions.

Q4: How many patterns are required for goodperformance?

Orthogonal to Q3, it is also important to know howmany high-quality prompts are at least requiredto achieve satisfactory performance. This is ofgreat practical importance because coming up withdozens of different PVPs for a task may take asignificant amount of time, that could otherwise bespent annotating further examples.

Setup We generate 10 random permutations ofall 23 patterns per task. For each permutation andtraining set, we use the same setup as in Q1 tocompute the average performance obtained withPET when using only the first i patterns, where iranges from 1 to 5.

Results Average performance of PET trainedwith the first i patterns is shown in Figure 6, rel-ative to the performance of PET trained with all23 patterns. For all tasks and training set sizes, aslittle as four different patterns are already sufficientto achieve performance very close to that of PET

trained with all 23 patterns. Surprisingly, PET’sperformance is much higher than the average per-formance of a model trained on individual patternseven with i = 1. This indicates that the process ofknowledge distillation using unlabeled data is alsobeneficial when using only a single instruction.

Q5: How important are values for other hyper-parameters?

For true few-shot settings, the same set of hyper-parameter values should ideally achieve good per-formance across different tasks; this enables us toadopt these values for new tasks without requir-ing manual tuning on task-specific validation sets.We therefore investigate how different choices forcommon hyperparameters affect the performanceof PET; in concrete terms, we consider learningrate, training steps and batch size.

Setup Based on choices found in previous work,we consider different learning rates from theset {10−4, 5 · 10−4, 10−5, 5 ·10−5, 10−6}, trainingsteps from {10, 25, 50, 100, 250, 500, 1000} andbatch sizes from {1, 2, 4, 8, 16, 32}. Learning rateand batch size are changed for the individual mod-els and the final classifier simultaneously; the num-ber of training steps is varied only for individualmodels. We modify each hyperparameter indepen-dently, keeping all other parameters at their defaultvalues (i.e., a learning rate of 10−5, 100 steps anda batch size of 4).

Results All results are shown in Figure 7. Fortraining steps and batch size, performance is rela-tively stable across a wide range of different val-ues, with more steps and larger batch sizes typ-ically leading to slightly better performance (es-pecially for n = 100). Learning rate clearly hasthe strongest impact on performance, but valuesof 10−5 and 5 · 10−5 consistently give the best re-sults across all tasks considered; those are also thevalues typically used for finetuning in prior work(Devlin et al., 2019; Liu et al., 2019).

Q6: Do we really need large amounts of unla-beled data?

One drawback of PET compared to using individ-ual PVPs is the additional need for unlabeled data;while such data is often available in real-worldsettings, there are cases in which even unlabeledexamples are hard to obtain. As recent work showsthat pretrained LMs are capable of generating dataof reasonable quality themselves (Anaby-Tavoret al., 2020; Papanikolaou and Pierleoni, 2020;Yang et al., 2020; Mohapatra et al., 2020; Kumaret al., 2021; Schick and Schütze, 2021), we investi-gate whether unlabeled data can simply be replacedby synthetic examples.

1e-4 5e-4 1e-5 5e-5 1e-60.00

0.20

0.40

0.60

0.80

1.00LR (n = 10)

AG’sYelpYahoo

1e-4 5e-4 1e-5 5e-5 1e-6

LR (n = 100)

10 25 50 100 250 500 10000.40

0.50

0.60

0.70

0.80

0.90

Steps (n = 10)

10 25 50 100 250 500 1000

Steps (n = 100)

1 2 4 8 16 320.40

0.50

0.60

0.70

0.80

0.90

Batch size (n = 10)

1 2 4 8 16 32

Batch size (n = 100)

Figure 7: Performance of PET (solid lines) and averageperformance of individual models (dotted lines) for dif-ferent learning rates (LR), training steps (Steps), andbatch sizes. For reasons of readability, the legend isshown in the top left plot only.

Setup We use GPT2-XL (Radford et al., 2019)to generate synthetic unlabeled data. To this end,we provide one or two randomly chosen trainingexamples without labels as in-context examples(Brown et al., 2020) and let the model generate oneadditional example. For two inputs x1 and x2, theinput given to the model is

Example 1: x1 Example 2: x2 Example 3:

where denotes two consecutive line breaks. Ifan input consists of two texts, we simply concate-nate them using the sequence +++ as a separator.

We generate 10,000 examples for n = 10 and30,000 examples for n = 100 using top-p sampling(Holtzman et al., 2020) with p = 0.9. For eachinput, we stop the generation process as soon asthe model generates two consecutive line breaks.

We discard all examples for which the model doesnot generate two consecutive line breaks within128 tokens; for datasets with text pairs, we alsodiscard examples where the model fails to generatethe sequence separator (+++).

As the datasets obtained with this method maybe highly imbalanced regarding the distributionof (unknown) labels, we also experiment with abalanced variant: We use the ensemble of modelstrained on individual PVPs to assign labels to eachexample and only keep so many examples per la-bel that the resulting dataset – which is used fortraining the final classifier – is balanced.

Results Figure 8 shows the performance of indi-vidual patterns as well as PET and iPET both withreal and synthetic unlabeled data. Except for iPET

on Yahoo Questions with n = 10, using syntheticdata consistently achieves accuracies within onepoint of using real data, with our balanced versionperforming slightly better. Moreover, for n = 10using synthetic data even improves accuracy insome cases. This shows that in the absence ofunlabeled examples, synthetic data obtained fromgenerative language models can serve as a drop-inreplacement without substantially degrading per-formance.

5 PET for Real-World Tasks

We use our insights from Section 4 to apply PET

to the RAFT benchmark, a collection of 11 di-verse real-world tasks whose automated solutionhas inherent value to someone (Alex et al., 2021).These tasks pose various challenges to few-shotapproaches as they require some amount of domainexpertise and the ability to understand detailed in-structions, to process long inputs and to handle alarge number of output classes.

Tasks and Datasets The RAFT benchmark in-cludes 11 tasks from different domains: ADE Cor-pus V2 (ADE), Banking77 (B77), NeurIPS im-pact statement risks (NIS), OneStopEnglish (OSE),Overruling (Over), Semiconductor org types (SOT),Systematic review inclusion (SRI), TAI safety re-search (TAI), Terms of Service (ToS), TweetEvalHate (TEH) and Twitter complaints (TC). For a de-tailed overview of all tasks, we refer to Alex et al.(2021). Each task comes with 50 labeled trainingexamples; in accordance with the RAFT rules, weadditionally make use of the unlabeled data (rang-ing from 150 to 5,000 examples) for PET’s distil-

n = 10 n = 1000

20

40

60

80

100

6884

778781 8882 88

81 8883 8885 8884 8885 88

AG’s News

n = 10 n = 1000

20

40

60

80

36

4943

5446

5647

5848

5848

5850

5853 5753 57

Yelp Full

n = 10 n = 1000

20

40

60

80

41

61556461 6663 68

6268

626867 6863 6764 67

Yahoo Questions

min avg max

PET PET+SD PET+SD (balanced)

iPET iPET+SD iPET+SD (balanced)

Figure 8: Minimum, average and maximum perfor-mance of individual patterns compared to regular PETand iPET as well as PET and iPET with synthetic data(+SD). Accuracies are scaled by a factor of 100 forbetter readability.

lation step. Note that the unlabeled set for RAFTis the same as the test set, meaning that unlike inSection 4, our final classifier is directly trained on(unlabeled) test examples.

PVPs Based on our findings from Q1 and Q2, weexclusively specify Q&A prompts for each task.To obtain the question q, we make minimal changesto the original instructions of Alex et al. (2021);we rephrase all binary classification tasks as yes/noquestions to facilitate finding a suitable verbalizer.For example, we rephrase the instruction “Labelthe sentence based on whether it is related to anadverse drug effect (ADE).” as the question “Is thissentence related to an adverse drug effect (ADE)?”Following our results from Q4, we specify 4 PVPsper task. In case of binary classification, we usetwo different patterns that either include or omitthe full task specification of Alex et al. (2021)6 andcombine them with both a yes/no verbalizer and a

6For a full list of all task specifications, see https://github.com/oughtinc/raft-baselines.

true/false verbalizer. The full set of PVPs for alltasks can be found in Appendix B.

Hyperparameters We mostly keep the defaultvalues for hyperparameters used throughout Sec-tion 4 as our experiments for Q5 show that theseperform well for all tasks considered. However, wemake the following changes:

• We replace RoBERTa (base) with ALBERT(xxlarge, v2). While being much slower totrain, ALBERT was shown to outperformRoBERTa both in regular and few-shot set-tings (Lan et al., 2020; Schick and Schütze,2021c; Logan IV et al., 2021).

• For tasks with more than 4,000 unlabeled ex-amples, we finetune the distilled model for2,000 steps instead of 1,000 steps. Otherwise,some examples would not be seen at all duringtraining.7

• Following Schick and Schütze (2021a,c) wetrain three individual models per PVP. Thisimproves robustness as performance can varylargely between individual finetuning runs.

Handling Many Labels The B77 dataset con-tains online banking customer service queries an-notated with one of 77 possible intents. This largeamount of different outputs leads to several issuesfor PET: First, it is impossible to specify a mean-ingful verbalizer that maps each intent to a singletoken. We initially experimented with the multi-mask version of PET (Schick and Schütze, 2021c),but found it to be too inefficient to get results in areasonable amount of time. Therefore, we tried thefollowing solution: We rephrase the task as binaryclassification, where for each pair of query x and in-tent y, the task is to decide whether y is the correctintent for x. For each original training example(x, y), we generate one example (x, y,True) andfour examples (x, y′,False) with randomly sam-pled, wrong intents y′. As this increases the amountof data fivefold, we finetune each individual modelfor 500 steps instead of 100 steps.

While this approach solves our problem, it isstill not particularly efficient: Reframing the taskas binary classification means that for each input,77 forward passes are required to find the correct

7Training for 1,000 steps with a batch size of 4 means thatat most 4,000 examples can be processed.

https://github.com/oughtinc/raft-baselines

https://github.com/oughtinc/raft-baselines

Method ADE B77 NIS OSE Over SOT SRI TAI ToS TEH TC Avg

GPT-2 60.0 12.1 56.1 24.5 49.8 38.0 49.2 61.2 49.8 31.1 72.3 45.8GPT-Neo 45.2 14.9 40.8 34.3 68.1 40.6 49.3 60.5 56.5 55.4 63.6 48.1AdaBoost 54.3 02.3 62.6 47.5 83.8 45.5 50.6 55.6 56.0 44.3 62.5 51.4snlt 60.3 24.8 58.5 30.2 83.1 33.6 49.2 62.6 54.0 44.9 79.1 52.8GPT-3 68.6 29.9 67.9 43.1 93.7 76.9 51.6 65.6 57.4 52.6 82.1 62.7SetFit 72.6 53.8 87.2 52.1 90.7 68.2 49.3 62.8 62.0 53.2 83.7 66.9PET 82.2 59.3 85.7 64.6 90.8 81.6 49.3 63.8 57.6 48.3 82.4 69.6

Human 83.0 60.7 85.7 64.6 91.7 90.8 46.8 60.9 62.7 72.2 89.7 73.5

Table 1: Performance of various baselines and PET on the RAFT benchmark (Alex et al., 2021); shown numbers aremacro F1 scores multiplied by 100. Best model performance is shown in bold, best overall performance (includinghuman annotators) is underlined. The final column shows average performance across all 11 tasks.

intent. We thus train the final model as a regu-lar classifier with 77 different output classes; fortraining this classifier on an input x, we set thetarget probability of each output y proportional tothe probability of True being the correct output for(x, y) according to our ensemble of binary classi-fiers.

Finally, another issue is that with 50 labeledexamples, at least 27 labels are not covered in thetraining set at all; this may bias a model to neverpredict any of these labels. To alleviate this issue,we train two generations of models using iPET. Fortraining the second generation, we obtain trainingdata covering all possible labels as follows: Foreach label, we pick the two examples from ourset of unlabeled data for which this label is mostlikely according to the first generation; the sameapproach was used by Schick and Schütze (2021a)for applying iPET in zero-shot settings.

Of course, the nature of RAFT makes it impossi-ble to measure the impact of any of these choices.While we could conduct experiments similar tothose in Section 4, none of the datasets consideredtherein has a similar structure to B77; as our mod-ifications affect only one out of 11 tasks, we thusdecided to not perform any further analysis.

Monitoring We checked for TRAIN SET UN-DERFITTING and CONSTANT PREDICTIONS as inSection 4 to detect finetuning issues. Unlike forour experiments in Section 4, on RAFT we encoun-tered some issues that could not be resolved simplyby retraining with a different seed:

• We observed TRAIN SET UNDERFITTING forthe final classifier on B77. This may be dueto the classification head for 77 classes intro-ducing many new parameters; we tried train-

ing the final model for 5,000 steps instead of2,000 steps, which fixed this issue.

• We observed CONSTANT PREDICTIONS forthe ToS training set. Doubling the number oftraining steps resolved this problem.

• Finally, we also observed CONSTANT PRE-DICTIONS on the unlabeled data of SRI. Uponmanually inspecting the training set, we ob-served that all but one out of 50 exampleshave the same label. As all models alreadyclassified the training set perfectly, we left thesetup for our SRI submission unchanged.

Results For all 11 tasks, results of PET and var-ious baselines are shown in Table 1.8 As can beseen, PET performs better than all other approacheson average, achieving near-human performance for7 out of 11 tasks. Note however that non-experthumans perform worse than a majority baselineSRI, so results on this task should be taken with agrain of salt. PET also clearly outperforms a GPT-3 model (Brown et al., 2020) by almost 7 points,despite the latter being larger by several orders ofmagnitude.9 While PET is particularly successfulon ADE, B77 and OSE (where it outperforms GPT-3 by 13.6, 21.5 and 29.4 points, respectively), itperforms comparably bad on datasets in the law(Over, ToS) and social media (TEH, TC) domain.Our approach for handling many labels performssurprisingly well on B77 without any tuning of

8All results are taken directly from the leaderboard athttps://huggingface.co/spaces/ought/raft-leaderboard.

9We were unable to find any information on SetFit, thesecond-best performing method (e.g., which underlying LMis used or whether it makes use of any additional data), so wecannot make a fair comparison.

https://huggingface.co/spaces/ought/raft-leaderboard

https://huggingface.co/spaces/ought/raft-leaderboard

its parameters. Due to the nature of the RAFTbenchmark, we cannot perform further analysis orablation studies.

6 Discussion

Our experimental results in Section 4 and 5 showthat strong performance in few-shot settings isclearly possible without manual prompt tuning orhyperparameter optimization on large developmentsets; in other words, PET can successfully be ap-plied in true few-shot settings. While we believethat it should be an important goal of future workto make LMs more robust to different instructions,even with current models it is relatively easy tosuccessfully apply PET when following a few sim-ple principles – such as rephrasing the task in aQ&A format, using simple vocabulary and single-token verbalizers where possible, and specifyingat least a handful of different patterns. In light ofthese findings, we also hope that future work willnot view human involvement in prompt design asa drawback of instruction-based approaches, butrather as an exciting possibility to communicatewith models in ways other than exclusively throughexamples.

There are various limitations to our study. First,a major obstacle to actually applying PET in real-world applications is that we do not know a priorihow well it performs for a given task; we there-fore believe an important next step is to investigatemethods for estimating performance without ac-cess to large test sets – for example, through modelcalibration (Desai and Durrett, 2020; Jiang et al.,2021) – in real-world settings. In addition, wedid not fully explore the capabilities of PET; forexample, we did not investigate domain-adaptivepretraining (Gururangan et al., 2020) and auxiliarylanguage modeling (Chronopoulou et al., 2019),both of which were shown to be helpful by Schickand Schütze (2021a). We also did not quantifythe impact of our decisions regarding B77 and theeffectiveness of our monitoring and only consid-ered English models and datasets. Finally, we didnot examine PET’s performance beyond aggregatescores. While this is not feasible on RAFT due tothe nature of this dataset, performing such analysiseither with other datasets or with methods such asthe ones proposed by Ribeiro et al. (2020) wouldbe relevant future work to understand real-worldcapabilities of instruction-based approaches morecomprehensively.

7 Conclusion

In light of recent work casting doubt on the perfor-mance of prompt-based approaches in true few-shotsettings (Perez et al., 2021), we have conducted anextensive study of PET. In a controlled environ-ment, we found that manually designed instructionsconsistently outperform null prompts, with Q&A-style prompts performing best (Q1, Q2). Acrossdifferent tasks, models and training set sizes, PET

consistently outperforms even the best individualprompt (Q1, Q2). We have also shown that PET

is robust to uninformative prompts and to differ-ent choices of hyperparameters (Q3, Q5), that aslittle as four prompts are sufficient to reach goodperformance (Q4), and that synthetic examples canbe used to replace large amounts of unlabeled data(Q6). On the basis of these insights, we appliedPET to a benchmark of real-world tasks, whereit achieves near-human performance for 7 out of11 tasks without any tuning on a development set,demonstrating the power of instruction-based ap-proaches in true few-shot settings.

References

Neel Alex, Eli Lifland, Lewis Tunstall, AbhishekThakur, Pegah Maham, C. Jess Riedel, EmmieHine, Carolyn Ashurst, Paul Sedille, Alexis Car-lier, Michael Noetel, and Andreas Stuhlmüller.2021. RAFT: A real-world few-shot text classifi-cation benchmark. In Thirty-fifth Conference onNeural Information Processing Systems Datasetsand Benchmarks Track (Round 2).

Ateret Anaby-Tavor, Boaz Carmeli, Esther Gold-braich, Amir Kantor, George Kour, Segev Shlo-mov, Naama Tepper, and Naama Zwerdling.2020. Do not have enough data? Deep learningto the rescue! Proceedings of the AAAI Con-ference on Artificial Intelligence, 34(05):7383–7390.

Jonathan Bragg, Arman Cohan, Kyle Lo, andIz Beltagy. 2021. FLEX: Unifying evaluationfor few-shot NLP. In Thirty-Fifth Conference onNeural Information Processing Systems.

Sergey Brin. 1999. Extracting patterns and re-lations from the world wide web. In TheWorld Wide Web and Databases, pages 172–183,Berlin, Heidelberg. Springer Berlin Heidelberg.

https://openreview.net/forum?id=bgWHz41FMB7

https://openreview.net/forum?id=bgWHz41FMB7

https://doi.org/10.1609/aaai.v34i05.6233

https://doi.org/10.1609/aaai.v34i05.6233

https://openreview.net/forum?id=_WnGcwXLYOE

https://openreview.net/forum?id=_WnGcwXLYOE

http://ilpubs.stanford.edu:8090/421/1/1999-65.pdf

http://ilpubs.stanford.edu:8090/421/1/1999-65.pdf

Tom Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared D Kaplan, Prafulla Dhariwal,Arvind Neelakantan, Pranav Shyam, Girish Sas-try, Amanda Askell, Sandhini Agarwal, ArielHerbert-Voss, Gretchen Krueger, Tom Henighan,Rewon Child, Aditya Ramesh, Daniel Ziegler,Jeffrey Wu, Clemens Winter, Chris Hesse, MarkChen, Eric Sigler, Mateusz Litwin, Scott Gray,Benjamin Chess, Jack Clark, Christopher Berner,Sam McCandlish, Alec Radford, Ilya Sutskever,and Dario Amodei. 2020. Language models arefew-shot learners. In Advances in Neural Infor-mation Processing Systems, volume 33, pages1877–1901. Curran Associates, Inc.

Ming-Wei Chang, Lev Ratinov, Dan Roth, andVivek Srikumar. 2008. Importance of seman-tic representation: Dataless classification. InProceedings of the 23rd National Conferenceon Artificial Intelligence - Volume 2, AAAI’08,page 830–835. AAAI Press.

Jiaao Chen, Zichao Yang, and Diyi Yang. 2020.MixText: Linguistically-informed interpolationof hidden space for semi-supervised text clas-sification. In Proceedings of the 58th AnnualMeeting of the Association for ComputationalLinguistics, pages 2147–2157, Online. Associa-tion for Computational Linguistics.

Alexandra Chronopoulou, Christos Baziotis, andAlexandros Potamianos. 2019. An embarrass-ingly simple approach for transfer learning frompretrained language models. In Proceedingsof the 2019 Conference of the North Ameri-can Chapter of the Association for Computa-tional Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers), pages2089–2095, Minneapolis, Minnesota. Associa-tion for Computational Linguistics.

Joe Davison, Joshua Feldman, and Alexander Rush.2019. Commonsense knowledge mining frompretrained models. In Proceedings of the 2019Conference on Empirical Methods in NaturalLanguage Processing and the 9th InternationalJoint Conference on Natural Language Process-ing (EMNLP-IJCNLP), pages 1173–1178, HongKong, China. Association for ComputationalLinguistics.

Shrey Desai and Greg Durrett. 2020. Calibrationof pre-trained transformers. In Proceedings of

the 2020 Conference on Empirical Methods inNatural Language Processing (EMNLP), pages295–302, Online. Association for ComputationalLinguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. In Proceedings of the 2019 Con-ference of the North American Chapter of theAssociation for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Longand Short Papers), pages 4171–4186, Minneapo-lis, Minnesota. Association for ComputationalLinguistics.

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, AliFarhadi, Hannaneh Hajishirzi, and Noah Smith.2020. Fine-tuning pretrained language mod-els: Weight initializations, data orders, andearly stopping. Computing Research Repository,arXiv:2002.06305.

Avia Efrat and Omer Levy. 2020. The turk-ing test: Can language models understand in-structions? Computing Research Repository,arXiv:2010.11982.

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Ab-hilasha Ravichander, Eduard Hovy, HinrichSchütze, and Yoav Goldberg. 2021. Measur-ing and improving consistency in pretrained lan-guage models. Computing Research Repository,arXiv:2102.01017.

William Fedus, Barret Zoph, and Noam Shazeer.2021. Switch transformers: Scaling to tril-lion parameter models with simple and effi-cient sparsity. Computing Research Repository,arXiv:2101.03961.

Tianyu Gao, Adam Fisch, and Danqi Chen. 2021.Making pre-trained language models better few-shot learners. In Proceedings of the 59th AnnualMeeting of the Association for ComputationalLinguistics and the 11th International Joint Con-ference on Natural Language Processing (Vol-ume 1: Long Papers), pages 3816–3830, Online.Association for Computational Linguistics.

Suchin Gururangan, Ana Marasovic, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, DougDowney, and Noah A. Smith. 2020. Don’t stoppretraining: Adapt language models to domains

https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

https://aaaipress.org/Papers/AAAI/2008/AAAI08-132.pdf

https://aaaipress.org/Papers/AAAI/2008/AAAI08-132.pdf

https://doi.org/10.18653/v1/2020.acl-main.194



https://doi.org/10.18653/v1/N19-1213

https://doi.org/10.18653/v1/N19-1213

https://doi.org/10.18653/v1/N19-1213

https://doi.org/10.18653/v1/D19-1109

https://doi.org/10.18653/v1/D19-1109

https://doi.org/10.18653/v1/2020.emnlp-main.21


https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://arxiv.org/abs/2002.06305



http://arxiv.org/abs/2010.11982









https://doi.org/10.18653/v1/2021.acl-long.295




and tasks. In Proceedings of the 58th AnnualMeeting of the Association for ComputationalLinguistics, pages 8342–8360, Online. Associa-tion for Computational Linguistics.

Karen Hambardzumyan, Hrant Khachatrian, andJonathan May. 2021. WARP: Word-level Adver-sarial ReProgramming. In Proceedings of the59th Annual Meeting of the Association for Com-putational Linguistics and the 11th InternationalJoint Conference on Natural Language Process-ing (Volume 1: Long Papers), pages 4921–4933,Online. Association for Computational Linguis-tics.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.2015. Distilling the knowledge in a neu-ral network. Computing Research Repository,arXiv:1503.02531.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes,and Yejin Choi. 2020. The curious case of neuraltext degeneration. In International Conferenceon Learning Representations.

Zhengbao Jiang, Jun Araki, Haibo Ding, and Gra-ham Neubig. 2021. How Can We Know WhenLanguage Models Know? On the Calibrationof Language Models for Question Answering.Transactions of the Association for Computa-tional Linguistics, 9:962–977.

Zhengbao Jiang, Frank F. Xu, Jun Araki, and Gra-ham Neubig. 2020. How can we know whatlanguage models know? Transactions of the As-sociation for Computational Linguistics, 8:423–438.

Sawan Kumar and Partha Talukdar. 2021. Reorder-ing examples helps during priming-based few-shot learning. In Findings of the Association forComputational Linguistics: ACL-IJCNLP 2021,pages 4507–4518, Online. Association for Com-putational Linguistics.

Varun Kumar, Ashutosh Choudhary, and EunahCho. 2021. Data augmentation using pre-trained transformer models. Computing Re-search Repository, arXiv:2003.02245.

Zhenzhong Lan, Mingda Chen, Sebastian Good-man, Kevin Gimpel, Piyush Sharma, and RaduSoricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations.

In International Conference on Learning Repre-sentations.

Brian Lester, Rami Al-Rfou, and Noah Constant.2021. The power of scale for parameter-efficientprompt tuning. Computing Research Repository,arXiv:2104.08691.

Omer Levy, Minjoon Seo, Eunsol Choi, and LukeZettlemoyer. 2017. Zero-shot relation extrac-tion via reading comprehension. In Proceedingsof the 21st Conference on Computational Nat-ural Language Learning (CoNLL 2017), pages333–342, Vancouver, Canada. Association forComputational Linguistics.

Mike Lewis, Yinhan Liu, Naman Goyal, MarjanGhazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.2020. BART: Denoising sequence-to-sequencepre-training for natural language generation,translation, and comprehension. In Proceedingsof the 58th Annual Meeting of the Association forComputational Linguistics, pages 7871–7880,Online. Association for Computational Linguis-tics.

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for gen-eration.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,Mandar Joshi, Danqi Chen, Omer Levy, MikeLewis, Luke Zettlemoyer, and Veselin Stoy-anov. 2019. RoBERTa: A robustly optimizedBERT pretraining approach. Computing Re-search Repository, arXiv:1907.11692.

Robert L. Logan IV, Ivana Balaževic, Eric Wal-lace, Fabio Petroni, Sameer Singh, and Sebas-tian Riedel. 2021. Cutting down on prompts andparameters: Simple few-shot learning with lan-guage models. Computing Research Repository,arXiv:2106.13353.

Yao Lu, Max Bartolo, Alastair Moore, Sebas-tian Riedel, and Pontus Stenetorp. 2021. Fan-tastically ordered prompts and where to findthem: Overcoming few-shot prompt ordersensitivity. Computing Research Repository,arXiv:2104.08786.

Bryan McCann, Nitish Shirish Keskar, CaimingXiong, and Richard Socher. 2018. The natural






https://openreview.net/forum?id=rygGQyrFvH

https://openreview.net/forum?id=rygGQyrFvH

https://doi.org/10.1162/tacl_a_00407





https://doi.org/10.18653/v1/2021.findings-acl.395





https://openreview.net/forum?id=H1eA7AEtvS

https://openreview.net/forum?id=H1eA7AEtvS



https://doi.org/10.18653/v1/K17-1034

https://doi.org/10.18653/v1/K17-1034

















language decathlon: Multitask learning as ques-tion answering. Computing Research Reposi-tory, arXiv:1806.08730.

David McClosky, Eugene Charniak, and MarkJohnson. 2006. Effective self-training for pars-ing. In Proceedings of the Human LanguageTechnology Conference of the NAACL, MainConference, pages 152–159, New York City,USA. Association for Computational Linguis-tics.

Sewon Min, Mike Lewis, Hannaneh Hajishirzi,and Luke Zettlemoyer. 2021. Noisy channellanguage model prompting for few-shot textclassification. Computing Research Repository,arXiv:2108.04106.

Swaroop Mishra, Daniel Khashabi, Chitta Baral,and Hannaneh Hajishirzi. 2021. Cross-task gen-eralization via natural language crowdsourcinginstructions. Computing Research Repository,arXiv:2104.08773.

Biswesh Mohapatra, Gaurav Pandey, Danish Con-tractor, and Sachindra Joshi. 2020. Simulatedchats for task-oriented dialog: Learning to gener-ate conversations from instructions. ComputingResearch Repository, arXiv:2010.10216.

Mohammad Norouzi, Tomas Mikolov, Samy Ben-gio, Yoram Singer, Jonathon Shlens, AndreaFrome, Greg S. Corrado, and Jeffrey Dean. 2014.Zero-shot learning by convex combination ofsemantic embeddings.

Juri Opitz. 2019. Argumentative relation classifi-cation as plausibility ranking. In Preliminaryproceedings of the 15th Conference on NaturalLanguage Processing (KONVENS 2019): LongPapers, pages 193–202, Erlangen, Germany. Ger-man Society for Computational Linguistics &Language Technology.

Yannis Papanikolaou and Andrea Pierleoni. 2020.DARE: Data augmented relation extractionwith GPT-2. Computing Research Repository,arXiv:2004.13845.

Ethan Perez, Douwe Kiela, and Kyunghyun Cho.2021. True few-shot learning with languagemodels. In Thirty-Fifth Conference on NeuralInformation Processing Systems.

Raul Puri and Bryan Catanzaro. 2019. Zero-shot text classification with generative lan-guage models. Computing Research Repository,arXiv:1912.10165.

Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Lan-guage models are unsupervised multitask learn-ers. Technical report, Open AI.

Colin Raffel, Noam Shazeer, Adam Roberts,Katherine Lee, Sharan Narang, Michael Matena,Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Ex-ploring the limits of transfer learning with a uni-fied text-to-text transformer. Journal of MachineLearning Research, 21(140):1–67.

Marco Tulio Ribeiro, Tongshuang Wu, CarlosGuestrin, and Sameer Singh. 2020. Beyond ac-curacy: Behavioral testing of NLP models withCheckList. In Proceedings of the 58th AnnualMeeting of the Association for ComputationalLinguistics, pages 4902–4912, Online. Associa-tion for Computational Linguistics.

Bernardino Romera-Paredes and Philip Torr. 2015.An embarrassingly simple approach to zero-shotlearning. In International Conference on Ma-chine Learning, pages 2152–2161.

Oscar Sainz, Oier Lopez de Lacalle, Gorka Labaka,Ander Barrena, and Eneko Agirre. 2021. Labelverbalization and entailment for effective zeroand few-shot relation extraction. In Proceedingsof the 2021 Conference on Empirical Methods inNatural Language Processing, pages 1199–1212,Online and Punta Cana, Dominican Republic.Association for Computational Linguistics.

Victor Sanh, Albert Webson, Colin Raffel,Stephen H. Bach, Lintang Sutawika, ZaidAlyafeai, Antoine Chaffin, Arnaud Stiegler,Teven Le Scao, Arun Raja, Manan Dey,M Saiful Bari, Canwen Xu, Urmish Thakker,Shanya Sharma Sharma, Eliza Szczechla, Tae-woon Kim, Gunjan Chhablani, Nihal Nayak, De-bajyoti Datta, Jonathan Chang, Mike Tian-JianJiang, Han Wang, Matteo Manica, Sheng Shen,Zheng Xin Yong, Harshit Pandey, Rachel Baw-den, Thomas Wang, Trishala Neeraj, Jos Rozen,Abheesht Sharma, Andrea Santilli, ThibaultFevry, Jason Alan Fries, Ryan Teehan, StellaBiderman, Leo Gao, Tali Bers, Thomas Wolf,and Alexander M. Rush. 2021. Multitask



https://www.aclweb.org/anthology/N06-1020

https://www.aclweb.org/anthology/N06-1020
















https://openreview.net/forum?id=ShnM-rRh4T

https://openreview.net/forum?id=ShnM-rRh4T




https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf



http://jmlr.org/papers/v21/20-074.html






http://proceedings.mlr.press/v37/romera-paredes15.pdf

http://proceedings.mlr.press/v37/romera-paredes15.pdf

https://aclanthology.org/2021.emnlp-main.92




prompted training enables zero-shot task gen-eralization. Computing Research Repository,arXiv:2110.08207.

Timo Schick, Helmut Schmid, and Hinrich Schütze.2020. Automatically identifying words that canserve as labels for few-shot text classification.In Proceedings of the 28th International Confer-ence on Computational Linguistics, pages 5569–5578, Barcelona, Spain (Online). InternationalCommittee on Computational Linguistics.

Timo Schick and Hinrich Schütze. 2021. Generat-ing datasets with pretrained language models. InProceedings of the 2021 Conference on Empir-ical Methods in Natural Language Processing(EMNLP). Association for Computational Lin-guistics.

Timo Schick and Hinrich Schütze. 2021a. Exploit-ing cloze questions for few shot text classifi-cation and natural language inference. In Pro-ceedings of the 16th Conference of the Euro-pean Chapter of the Association for Computa-tional Linguistics, Kyiv, Ukraine (Online). Inter-national Committee on Computational Linguis-tics.

Timo Schick and Hinrich Schütze. 2021b. Few-shot text generation with pattern-exploiting train-ing. In Proceedings of the 2021 Conference onEmpirical Methods in Natural Language Pro-cessing (EMNLP). Association for Computa-tional Linguistics.

Timo Schick and Hinrich Schütze. 2021c. It’s notjust size that matters: Small language modelsare also few-shot learners. In Proceedings of the2021 Conference of the North American Chap-ter of the Association for Computational Lin-guistics: Human Language Technologies, pages2339–2352, Online. Association for Computa-tional Linguistics.

Timo Schick, Sahana Udupa, and Hinrich Schütze.2021. Self-diagnosis and self-debiasing: A pro-posal for reducing corpus-based bias in NLP.Transactions of the Association for Computa-tional Linguistics.

H Scudder. 1965. Probability of error of someadaptive pattern-recognition machines. IEEETransactions on Information Theory, 11(3):363–371.

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV,Eric Wallace, and Sameer Singh. 2020. Auto-Prompt: Eliciting Knowledge from LanguageModels with Automatically Generated Prompts.In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing(EMNLP), pages 4222–4235, Online. Associa-tion for Computational Linguistics.

Emma Strubell, Ananya Ganesh, and Andrew Mc-Callum. 2019. Energy and policy considerationsfor deep learning in NLP. In Proceedings ofthe 57th Annual Meeting of the Association forComputational Linguistics, pages 3645–3650,Florence, Italy. Association for ComputationalLinguistics.

Yi Sun, Yu Zheng, Chao Hao, and Hangping Qiu.2021. NSP-BERT: A prompt-based zero-shotlearner through an original pre-training task–next sentence prediction. Computing ResearchRepository, arXiv:2109.03564.

Derek Tam, Rakesh R Menon, Mohit Bansal,Shashank Srivastava, and Colin Raffel. 2021.Improving and simplifying pattern exploitingtraining. Computing Research Repository,arXiv:2103.11955.

Sappadla Prateek Veeranna, Jinseok Nam,Eneldo Loza Mencıa, and Johannes Fürnkranz.2016. Using semantic similarity for multi-labelzero-shot classification of text documents.In Proceeding of European Symposium onArtificial Neural Networks, ComputationalIntelligence and Machine Learning. Bruges,Belgium: Elsevier, pages 423–428.

Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill,Omer Levy, and Samuel Bowman. 2019. Super-glue: A stickier benchmark for general-purposelanguage understanding systems. In Advancesin Neural Information Processing Systems, vol-ume 32. Curran Associates, Inc.

Alex Wang, Amanpreet Singh, Julian Michael,Felix Hill, Omer Levy, and Samuel Bowman.2018. GLUE: A multi-task benchmark and anal-ysis platform for natural language understanding.In Proceedings of the 2018 EMNLP WorkshopBlackboxNLP: Analyzing and Interpreting Neu-ral Networks for NLP, pages 353–355, Brussels,



https://www.aclweb.org/anthology/2020.coling-main.488

https://www.aclweb.org/anthology/2020.coling-main.488









https://doi.org/10.18653/v1/2021.naacl-main.185








https://doi.org/10.18653/v1/P19-1355

https://doi.org/10.18653/v1/P19-1355






https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2016-174.pdf

https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2016-174.pdf

https://proceedings.neurips.cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf



https://doi.org/10.18653/v1/W18-5446

https://doi.org/10.18653/v1/W18-5446

Belgium. Association for Computational Lin-guistics.

Albert Webson and Ellie Pavlick. 2021. Do prompt-based models really understand the meaning oftheir prompts? Computing Research Repository,arXiv:2109.01247.

Jason Wei, Maarten Bosma, Vincent Y. Zhao,Kelvin Guu, Adams Wei Yu, Brian Lester, NanDu, Andrew M. Dai, and Quoc V. Le. 2021.Finetuned language models are zero-shot learn-ers.

Orion Weller, Nicholas Lourie, Matt Gardner, andMatthew E. Peters. 2020. Learning from taskdescriptions. In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1361–1375, Online.Association for Computational Linguistics.

Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. 2019. Un-supervised data augmentation for consistencytraining. Computing Research Repository,arXiv:1904.12848.

Liang Xu, Xiaojing Lu, Chenyang Yuan, XuanweiZhang, Huilin Xu, Hu Yuan, Guoao Wei, XiangPan, Xin Tian, Libo Qin, and Hu Hai. 2021. Few-CLUE: A chinese few-shot learning evaluationbenchmark. Computing Research Repository,arXiv:2107.07498.

Yiben Yang, Chaitanya Malaviya, Jared Fernandez,Swabha Swayamdipta, Ronan Le Bras, Ji-PingWang, Chandra Bhagavatula, Yejin Choi, andDoug Downey. 2020. Generative data augmen-tation for commonsense reasoning. In Findingsof the Association for Computational Linguis-tics: EMNLP 2020, pages 1008–1025, Online.Association for Computational Linguistics.

David Yarowsky. 1995. Unsupervised word sensedisambiguation rivaling supervised methods. In33rd Annual Meeting of the Association for Com-putational Linguistics, pages 189–196, Cam-bridge, Massachusetts, USA. Association forComputational Linguistics.

Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren.2021. CrossFit: A few-shot learning challengefor cross-task generalization in NLP. ComputingResearch Repository, arXiv:2104.08835.

Wenpeng Yin, Jamaal Hay, and Dan Roth.2019. Benchmarking zero-shot text classifica-tion: Datasets, evaluation and entailment ap-proach. In Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 3914–3923, Hong Kong, China.Association for Computational Linguistics.

Jingqing Zhang, Yao Zhao, Mohammad Saleh, andPeter Liu. 2020. PEGASUS: Pre-training withextracted gap-sentences for abstractive summa-rization. In Proceedings of the 37th Interna-tional Conference on Machine Learning, vol-ume 119 of Proceedings of Machine LearningResearch, pages 11328–11339, Virtual. PMLR.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for textclassification. In C. Cortes, N. D. Lawrence,D. D. Lee, M. Sugiyama, and R. Garnett, edi-tors, Advances in Neural Information ProcessingSystems 28, pages 649–657. Curran Associates,Inc.

Ben Zhou, Daniel Khashabi, Chen-Tse Tsai, andDan Roth. 2018. Zero-shot open entity typingas type-compatible grounding. In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 2065–2076, Brussels, Belgium. Association for Com-putational Linguistics.

A PVPs for AG’s, Yelp, Yahoo

Below, we list the 23 patterns and the single verbal-izer used for each task throughout Section 4. Wegroup all patterns by their category.

A.1 AG’s NewsInputs

• x1: The headline of the news article to beclassified.

• x2: The text body of the news article to beclassified.

NULL Patterns

• [MASK] x1 x2

• x1 x2 [MASK]

• x1 [MASK] x2














https://doi.org/10.18653/v1/2020.findings-emnlp.90

https://doi.org/10.18653/v1/2020.findings-emnlp.90

https://doi.org/10.3115/981658.981684

https://doi.org/10.3115/981658.981684



https://doi.org/10.18653/v1/D19-1404

https://doi.org/10.18653/v1/D19-1404

https://doi.org/10.18653/v1/D19-1404

http://proceedings.mlr.press/v119/zhang20ae.html



http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf

http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf

https://doi.org/10.18653/v1/D18-1231

https://doi.org/10.18653/v1/D18-1231

PUNC Patterns

• [MASK] : x1 x2

• [MASK] - x1 x2

• [MASK] . x1 x2

• ( [MASK] ) x1 x2

• x1 ( [MASK] ) x2

• x1 x2 ( [MASK] )

PROMPT Patterns

• x1 x2 Category: [MASK].

• x1 x2 Class: [MASK].

• x1 x2 Topic: [MASK].

• x1 x2 Theme: [MASK].

• x1 x2 Category: [MASK]

• x1 x2 Class: [MASK]

• x1 x2 Topic: [MASK]

• x1 x2 Theme: [MASK]

• [MASK] News: x1 x2

• [MASK] NEWS: x1 x2

Q&A Patterns

• x1 x2 Question: What is the topic of thisarticle? Answer: [MASK].

• x1 x2 Question: What is the category ofthis article? Answer: [MASK].

• x1 x2 Question: What is the topic of thisarticle? Answer: [MASK]

• x1 x2 Question: What is the category ofthis article? Answer: [MASK]

Verbalizer World 7→World, Sports 7→ Sports,Business 7→ Business, Science/Tech 7→ Tech

A.2 Yelp Reviews Full StarInputs

• x: The review to be classified.

NULL Patterns

• [MASK] x

• x [MASK]

PUNC Patterns

• [MASK] : x

• [MASK] - x

• [MASK] . x

• ( [MASK] ) x

• x ( [MASK] )

PROMPT Patterns

• x It was [MASK]

• x All in all, it was [MASK]

• x In summary, it was [MASK]

• x It was [MASK].

• x All in all, it was [MASK].

• x In summary, it was [MASK].

• x The restaurant was [MASK]

• x All in all, the restaurant was [MASK]

• x In summary, the restaurant was [MASK]

• x The restaurant was [MASK].

• x All in all, the restaurant was [MASK].

• x In summary, the restaurant was [MASK].

Q&A Patterns

• xQuestion: What does the customer thinkof this restaurant? Answer: It is [MASK].

• xQuestion: What does the customer thinkof this place? Answer: It is [MASK].

• xQuestion: What does the customer thinkof this restaurant? Answer: It is [MASK]

• xQuestion: What does the customer thinkof this place? Answer: It is [MASK]

Verbalizer 1 7→ terrible, 2 7→ bad, 3 7→ okay,4 7→ good, 5 7→ great

A.3 Yahoo QuestionsInputs

• x1: The question to be classified.

• x2: The top answer to the question.

NULL Patterns

• [MASK] x1 x2

• x1 x2 [MASK]

• x1 [MASK] x2

PUNC Patterns

• [MASK] : x1 x2

• [MASK] - x1 x2

• [MASK] . x1 x2

• ( [MASK] ) x1 x2

• x1 ( [MASK] ) x2

• x1 x2 ( [MASK] )

PROMPT Patterns

• x1 x2 Category: [MASK].

• x1 x2 Class: [MASK].

• x1 x2 Topic: [MASK].

• x1 x2 Theme: [MASK].

• x1 x2 Category: [MASK]

• x1 x2 Class: [MASK]

• x1 x2 Topic: [MASK]

• x1 x2 Theme: [MASK]

• [MASK] Question: x1 x2

• [MASK] QUESTION: x1 x2

Q&A Patterns

• x1 x2 Question: What is the topic of thisquestion? Answer: [MASK].

• x1 x2 Question: What is the category ofthis question? Answer: [MASK].

• x1 x2 Question: What is the topic of thisquestion? Answer: [MASK]

• x1 x2 Question: What is the category ofthis question? Answer: [MASK]

Verbalizer Society & Culture 7→ Society, Sci-ence & Mathematics 7→ Science, Health 7→Health, Education & Reference 7→ Education,Computers & Internet 7→ Computer, Sports7→ Sports, Business & Finance 7→ Business,Entertainment & Music 7→ Entertainment, Fam-ily & Relationships 7→ Relationship, Politics &Government 7→ Politics

B PVPs for RAFT

Below, we list the PVPs used for all tasks in theRAFT benchmark. For each task t, we make useof the original task description Dt that we copyverbatim from Alex et al. (2021).10 We use twovertical bars (||) to mark boundaries between textsegments. If multiple patterns and verbalizers arespecified, each verbalizer is used in combinationwith each pattern.

B.1 ADE

Task Description Label the sentence basedon whether it is related to an adverse drug ef-fect (ADE). Details are described below:Drugs: Names of drugs and chemicals thatinclude brand names, trivial names, abbrevia-tions and systematic names were annotated.Mentions of drugs or chemicals should strictlybe in a therapeutic context. This category doesnot include the names of metabolites, reactionbyproducts, or hospital chemicals (e.g. surgi-cal equipment disinfectants).Adverse effect: Mentions of adverse effectsinclude signs, symptoms, diseases, disorders,acquired abnormalities, deficiencies, organdamage or death that strictly occur as a con-sequence of drug intake.

Inputs

• x: The text to be classified.

Patterns

• Dt || x Question: Is this sentence relatedto an adverse drug effect (ADE)? Answer:[MASK].

• x Question: Is this sentence related toan adverse drug effect (ADE)? Answer:[MASK].

10See https://github.com/oughtinc/raft-baselines/tree/master/example_prompts.

https://github.com/oughtinc/raft-baselines/tree/master/example_prompts

https://github.com/oughtinc/raft-baselines/tree/master/example_prompts

Verbalizers

• not ADE-related 7→ No, ADE-related 7→Yes

• not ADE-related 7→ False, ADE-related 7→True

B.2 B77

Task Description The following is a bankingcustomer service query. Classify the queryinto one of the 77 categories available.

Inputs


• y: The correct intent for the given text.

Patterns

• Dt || x Question: Is y the correct categoryfor this query? Answer: [MASK].

• x Question: Is y the correct category forthis query? Answer: [MASK].

Verbalizers

• False 7→ No, True 7→ Yes

• False 7→ False, True 7→ True

B.3 NIS

Task Description Label the impact statementbased on whether it mentions a harmful appli-cation of the research done in the paper. Makesure the statement is sufficient to concludethere are harmful applications of the researchbeing done, not a past risk that this researchis solving.

Inputs


Patterns

• Dt || x Question: Does this impact state-ment mention a harmful application? An-swer: [MASK].

• x Question: Does this impact statementmention a harmful application? Answer:[MASK].

Verbalizers

• doesn’t mention a harmful application 7→No, mentions a harmful application 7→ Yes

• doesn’t mention a harmful application 7→False, mentions a harmful application 7→True

B.4 OSE

Task Description The following is an articlesourced from The Guardian newspaper, andrewritten by teachers to suit three levels ofadult English as Second Language (ESL)learners: elementary, intermediate, and ad-vanced. Predict the level of the article.

Inputs


Patterns

• Dt || x Question: What is the level of thisarticle? Answer: [MASK].

• x Question: What is the level of this arti-cle? Answer: [MASK].

• Dt || x Question: Is the level of this ar-ticle “elementary”, “intermediate” or “ad-vanced”? Answer: [MASK].

• x Question: Is the level of this article “ele-mentary”, “intermediate” or “advanced”?Answer: [MASK].

Verbalizers

• elementary 7→ elementary, intermediate7→ intermediate, advanced 7→ advanced

B.5 Over

Task Description In law, an overruling sen-tence is a statement that nullifies a previouscase decision as a precedent, by a constitu-tionally valid statute or a decision by the sameor higher ranking court which establishes a dif-ferent rule on the point of law involved. Labelthe sentence based on whether it is overrulingor not.

Inputs


Patterns

• Dt || x Question: Is this sentence overrul-ing? Answer: [MASK].

• x Question: Is this sentence overruling?Answer: [MASK].

Verbalizers

• not overruling 7→ No, overruling 7→ Yes

• not overruling 7→ False, overruling 7→ True

B.6 SOT

Task Description The dataset is a list of insti-tutions that have contributed papers to semi-conductor conferences in the last 25 years, ascatalogued by IEEE and sampled randomly.The goal is to classify the institutions into oneof three categories: "university", "company" or"research institute".

Inputs

• x1: The title of the paper to be classified.

• x2: The name of the organization to be classi-fied.

Patterns

• Dt || Organization name: x1 Paper title:x2 Question: What is the category of thisinstitution? Answer: [MASK].

• Organization name: x1 Paper title: x2Question: What is the category of thisinstitution? Answer: [MASK].

• Dt || Paper title: x2 Organization name:x1 Question: What is the category of thisinstitution? Answer: [MASK].

• Paper title: x2 Organization name: x1Question: What is the category of thisinstitution? Answer: [MASK].

Verbalizers

• company 7→ company, research institute7→ institute, university 7→ university

B.7 SRITask Description Identify whether this papershould be included in a meta-review which in-cludes the findings of systematic reviews oninterventions designed to promote charitabledonations.Included reviews should describe monetarycharitable donations, assess any populationof participants in any context, and be peer re-viewed and written in English.They should not report new data, be non-systematic reviews, consider cause-relatedmarketing or other kinds of prosocial be-haviour.

Inputs


• x2: The abstract of the paper to be classified.

• x3: The journal of the paper to be classified.

Patterns

• Dt || Title: x1 Abstract: x2 Journal: x3Question: Should this paper be includedin a meta-review which includes the find-ings of systematic reviews on interven-tions designed to promote charitable do-nations? Answer: [MASK].

• Title: x1 Abstract: x2 Journal: x3 Ques-tion: Should this paper be included in ameta-review which includes the findingsof systematic reviews on interventions de-signed to promote charitable donations?Answer: [MASK].

Verbalizers

• not included 7→ No, included 7→ Yes

• not included 7→ False, included 7→ True

B.8 TAITask Description Transformative AI (TAI) isdefined as AI that precipitates a transition com-parable to (or more significant than) the agri-cultural or industrial revolution. Label a paperas "TAI safety research" if:1. The contents of the paper are directly mo-tivated by, and substantively inform, the chal-lenge of ensuring good outcomes for TAI,2. There is substantive content on AI safety,

not just AI capabilities,3. The intended audience is the community ofresearchers,4. It meets a subjective threshold of serious-ness/quality,5. Peer review is not required.

Inputs


• x2: The abstract of the paper to be classified.

Patterns

• Dt || Title: x1 Abstract: x2 Question: Isthis paper a TAI safety research paper?Answer: [MASK].

• Title: x1 Abstract: x2 Question: Is this pa-per a TAI safety research paper? Answer:[MASK].

Verbalizers

• not TAI safety research 7→ No, TAI safetyresearch 7→ Yes

• not TAI safety research 7→ False, TAIsafety research 7→ True

B.9 ToS

Task Description Label the sentence from aTerms of Service based on whether it is poten-tially unfair. If it seems clearly unfair, mark itas potentially unfair.According to art. 3 of the Directive 93/13 onUnfair Terms in Consumer Contracts, a con-tractual term is unfair if: 1) it has not beenindividually negotiated; and 2) contrary to therequirement of good faith, it causes a signifi-cant imbalance in the parties rights and obli-gations, to the detriment of the consumer.

Inputs


Patterns

• Dt || x Question: Is this sentence poten-tially unfair? Answer: [MASK].

• x Question: Is this sentence potentiallyunfair? Answer: [MASK].

Verbalizers

• not potentially unfair 7→ No, potentially un-fair 7→ Yes

• not potentially unfair 7→ False, potentiallyunfair 7→ True

B.10 TEHTask Description Label whether the followingtweet contains hate speech against either im-migrants or women. Hate Speech (HS) is com-monly defined as any communication that dis-parages a person or a group on the basis ofsome characteristic such as race, color, eth-nicity, gender, sexual orientation, nationality,religion, or other characteristics.

Inputs


Patterns

• Dt || x Question: Does this tweet containhate speech against either immigrants orwomen? Answer: [MASK].

• x Question: Does this tweet containhate speech against either immigrants orwomen? Answer: [MASK].

Verbalizers

• not hate speech 7→ No, hate speech 7→Yes

• not hate speech 7→ False, hate speech 7→True

B.11 TCTask Description A complaint presents astate of affairs which breaches the writer’sfavorable expectation. Label the tweet textbased on whether it contains a complaint.

Inputs


Patterns

• Dt || x Question: Does this tweet text con-tain a complaint? Answer: [MASK].

• x Question: Does this tweet text containa complaint? Answer: [MASK].

Verbalizers

• no complaint 7→ No, complaint 7→ Yes

• no complaint 7→ False, complaint 7→ True

true few-shot learning with prompts – a real-world perspective

Documents