gpt-3: few-shot learning with a giant language model · 2020. 12. 16. · gpt-3: few-shot learning...

GPT-3: Few-Shot Learning with a Giant Language ModelMelanie Subbiah

1

OpenAI Columbia University

What is the goal?

3

Humans learn new tasks through demonstrations and instructions.

We’d like general-purpose agents that could do the same.

What is the goal?

4

Humans learn new tasks through demonstrations and instructions.

We’d like general-purpose agents that can do the same.

Typical Approach

5

Disadvantages to Fine-tuning

6

• Creates a task-specific model

• Requires large high-quality supervised datasets

• more likely to exploit spurious correlations

Few-shot transfer is more similar to how humans learn a new task





7






8Yogatama, et al. Learning and Evaluating General Linguistic Intelligence. 2019


What is an alternative?

9Radford, et al. Language Models are Unsupervised Multitask Learners. 2019

Can we further improve on this level of generation and generalization?

10

Can we further improve on this level of generation and generalization?

11

GPT-3 175 Billion parameters

Critical Aspects of GPT-3

• Model Size

• Training Objective

12

Model Size

13Kaplan, et al. Scaling Laws for Neural Language Models. 2020

Model Size

14

Transformers scale well!

Kaplan, et al. Scaling Laws for Neural Language Models. 2020

Motivating the Training Objective

15

Predict the next word in a sequence.

Alec Radford @ Berkeley 4/15/20


16

P(“The cat sat on the mat.”) = ???



17

P(“The cat sat on the mat.”) = ???


“But it must be recognized that the notion of ‘probability of a sentence’ is an entirely useless one, under any known interpretation of this term.”

- Noam Chomsky, 1969


18

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)


P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis


19




P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)



Grammar

World Knowledge

Arithmetic

Sentiment Analysis


20




P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)



Grammar

World Knowledge

Arithmetic

Sentiment Analysis


21




P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)



Grammar

World Knowledge

Arithmetic

Sentiment Analysis


22




P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)



Grammar

World Knowledge

Arithmetic

Sentiment Analysis


23




P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)



Grammar

World Knowledge

Arithmetic

Sentiment Analysis


24




P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)


P(“5 stars” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Addition

Sentiment Analysis


25




P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)


P(“5 stars” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Addition

Sentiment Analysis

Approach

26

Model

http://jalammar.github.io/illustrated-gpt2/ 27

Model

Model

The trophy didn’t fit in the suitcase because

the trophy was too big.

28

Model Sizes

29

Para

met

er C

ount

(in

milli

ons)

0

2250

4500

6750

9000

11250

13500

15750

18000

Small Medium Large XL 2.7B 6.7B 13B 175B

Model Sizes

Para

met

er C

ount

(in

milli

ons)

0

2250

4500

6750

9000

11250

13500

15750

18000


BERT

-Lar

ge

T5-L

arge

Meg

atro

n

30

Devlin, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018 Raffel, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 2019 Shoeybi, et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. 2019 Microsoft. Turing-NLG: A 17-Billion Parameter Language Model by Microsoft. 2020

T-NLG

Model SizesPa

ram

eter

Cou

nt (i

n m

illion

s)

0

22500

45000

67500

90000

112500

135000

157500

180000


31

Model SizesPa

ram

eter

Cou

nt (i

n m

illion

s)

1

10

100

1000

10000

100000

1000000


32

Model SizesPa

ram

eter

Cou

nt (i

n m

illion

s)

1

10

100

1000

10000

100000

1000000


Batch Size

Learning Rate

33

Compute

34

Dataset

• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication

• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings

• Books1 & Books2 - internet-based books

• Wikipedia - English-language Wikipedia

35https://commoncrawl.org/the-data

Dataset





36Radford, et al. Language Models are Unsupervised Multitask Learners. 2019

Dataset





37

Dataset





38

Dataset Mix

Toke

ns (i

n bi

llions

)

0

100

200

300

400

500

Common Crawl WebText2 Books1 Books2 Wkipedia test

39

Dataset Mix

Wei

ght i

n tra

inin

g m

ix (p

erce

nt)

0

20

40

60

80

100


40

Dataset Mix

Epoc

hs e

laps

ed (f

or 3

00B

toke

ns)

0

0.8

1.6

2.4

3.2

4


41

Evaluations

42

Let’s try it!

43

tdaeef = ?

Let’s try it!

44

Please unscramble the letters into a word and write that word. tdaeef = ?

Let’s try it!

45

Please unscramble the letters into a word and write that word. tdaeef = ?Zero-Shot

Let’s try it!

46

Please unscramble the letters into a word and write that word. pcirlaroc = reciprocal tdaeef = ?One-Shot

Let’s try it!

47

Please unscramble the letters into a word and write that word. pcirlaroc = reciprocal elapac = palace tdaeef = ?

Few-Shot

VS.

51

Metalearning

52

Sample Output

Model

The trophy didn’t fit in the suitcase because

the trophy was too big.

53

LM-Likelihood

Model

The trophy didn’t fit in the suitcase because the trophy was too big.

54

2.78 3.45 10.00 0.50 25.12

Methods of Evaluation

55

Randomly select K examples from the training dataset to build the context

Feed each context + possible completion through the model separately

Sample from the model up to a

newline

Normalize LM likelihood over the

completion and select the completion with

the highest likelihood

BLEU

F1

Exact-match accuracy

Multiple-choice

Free-form


56




newline




BLEU

F1


Multiple-choice

Free-form


57




newline




BLEU

F1


Multiple-choice

Free-form


58




newline




BLEU

F1


Multiple-choice

Free-form


59




newline




BLEU

F1


Multiple-choice

Free-form

Complete List of Tasks

60

Reading Comprehension • QuAC• SQuADv2• DROP• CoQA• RACE

Language Modeling • PTB

Close and Completion • ROC Stories• HellaSwag• LAMBADA

Trivia-style Questions • NaturalQs• WebQs• TriviaQA

Translation • En <-> Fr• En <-> De• En <-> Ro

Winograd-style • Winograd• Winogrande

Commonsense Reasoning • PiQA• ARC• OpenBookQA

Comprehensive Benchmarks • SuperGLUE

Inference • ANLI• RTE

Synthetic and Qualitative • Arithmetic• Word scrambling• Character-level manipulation• SAT analogies• Article generation• Learning and using novel words• Correcting English grammar

Summary of Performance

61

Task Class Few-Shot Performance

Close, Completion, and Language Modeling Very Good

Question Answering / Knowledge Base Very Good

Translation Good

Winograd / Winogrande Good

Commonsense Reasoning Mixed

Reading Comprehension Mixed

SuperGLUE Mixed

NLI Poor

Bias Probes Poor

Dario Amodei @ NeurIPS 12/7/20

Strengths

62Joshi, et al. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. 2017

Strengths

63Paperno, et al. The lambada dataset: Word prediction requiring a broad discourse contex. 2016

Strengths

64Paperno, et al. The lambada dataset: Word prediction requiring a broad discourse contex. 2016

Strengths

65

Strengths

66

Task Accuracy Without Commas

Accuracy With Commas

4 Digit Addition 25.5% 91.1%

4 Digit Subtraction 26.9% 89.7%

5 Digit Addition 9.3% 90.2%

5 Digit Subtraction 9.9% 82.2%

6 Digit Addition 3% 78.5%

6 Digit Subtraction 3% 73.9%

Dario Amodei @ NeurIPS 12/7/20, and Gwern Branwen!

3456 -> 3,456

Limitations

67Nie, et al. Adversarial nli: A new benchmark for natural language understanding. 2019

Limitations

68Choi, et al. Quac : Question answering in context. 2018; Dua, et al. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. 2019

Key Insights

69

Few-shot transfer to new tasks is possible without any gradient updates, and it presents a flexible framework for specifying new tasks to a model.

70

Bigger models can learn more from context

71

Bigger models have more emergent abilities

72

More context helps up to a point

73Wang, et al. Superglue: A stickier benchmark for general-purpose language understanding systems. 2019

Performance continues to scale with compute

74

Lingering Questions

75

Lingering Questions

• Methods of Evaluation

• Training Datasets and Memorization

• Real-World Applications

76


77

78

Reading Comprehension • QuAC• SQuADv2• DROP• CoQA• RACE

Language Modeling • PTB

Close and Completion • ROC Stories• HellaSwag• LAMBADA

Trivia-style Questions • NaturalQs• WebQs• TriviaQA

Translation • En <-> Fr• En <-> De• En <-> Ro

Winograd-style • Winograd• Winogrande

Commonsense Reasoning • PiQA• ARC• OpenBookQA

Comprehensive Benchmarks • SuperGLUE

Inference • ANLI• RTE

Synthetic and Qualitative • Arithmetic• Word scrambling• Character-level manipulation• SAT analogies• Article generation• Learning and using novel words• Correcting English grammar


• What would it take to feel confident that a model possessed a complex ability?

• Can we build comprehensive benchmarks so that we could identify the set of abilities a model possesses?

• How do we evaluate one of the model’s biggest strengths - creative generation?


79





80


81




Training Datasets and Memorization

82

• Quality of Data

• Duplication of Benchmarks

Training Datasets and Memorization - Data Quality

83

CommonCrawl filtering

1. Train a classifier to distinguish between unfiltered CommonCrawl and WebText/Books/Wikipedia

2. Sample filtered CommonCrawl with higher probability of selection based on classifier score of quality


84

CommonCrawl filtering

1. Train a classifier to distinguish between unfiltered CommonCrawl and WebText/Books/Wikipedia

2. Sample filtered CommonCrawl with higher probability of selection based on classifier score of quality


85

How can we better define and identify high quality data?

• Gender

• Race

• Religion

“The detective was a _______” —> 83% male

“The competent detective was a _______”

“The incompetent detective was a _______”

86

Training Datasets and Memorization - Harmful Data

• Gender

• Race

• Religion

87


Male-biased Descriptive Words •Large•Mostly•Lazy•Fantastic•Eccentric•Protect•Jolly•Stable•Personable•Survive

Female-biased Descriptive Words •Optimistic•Bubbly•Naughty•Easy-going•Petite•Tight•Pregnant•Gorgeous•Sucked•Beautiful

• Gender

• Race

• Religion

88


• Gender

• Race

• Religion

89


How do we make sure models trained on huge amounts of web data don’t get the chance to memorize eval benchmarks?

90

Training Datasets and Memorization - Eval Memorization

Removing benchmarks from training data

1. Look for overlap in phrases between benchmarks and training documents

2. Found a quarter of benchmarks had over 50% overlap with the training dataset!

3. Remove training documents that overlap with eval benchmarks

4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data

91







92







93







94







95


96


Real-World Applications

Important considerations

1. Potential for harmful outputs

2. Reliability of performance

97


98

• Semantic search

• Turn a script into a novel

• Turn a sentence into an email

• Smart formatting and code generation

• Emoji storytelling

https://andrewmayneblog.wordpress.com

Real-World Applications - Emoji Storytelling

99https://andrewmayneblog.wordpress.com

Real-World Applications - Emoji Storytelling

100https://andrewmayneblog.wordpress.com


101

• What are the useful applications of a model like GPT-3?

• Are there times when GPT-3 can be convincing enough, even if not perfectly reliable?


• What are the useful applications of a model like GPT-3?

• Are there times when GPT-3 can be convincing enough, even if not perfectly reliable?

102

103

Real-World Applications - Writing News

Real-World Applications - Writing News

104

• Language modeling performance appears to continue to scale with compute

• Large models can transfer few-shot to new tasks without any fine-tuning

• There are many complexities to evaluations, training datasets, and applications for large models

105

Conclusion




106

Conclusion




107

Conclusion

Questions?

109

gpt-3: few-shot learning with a giant language model · 2020. 12. 16. · gpt-3: few-shot learning...

Documents