gpt-3: few-shot learning with a giant language model · 2020. 12. 16. · gpt-3: few-shot learning...

109
GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Upload: others

Post on 02-Jun-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

GPT-3: Few-Shot Learning with a Giant Language ModelMelanie Subbiah

1

OpenAI Columbia University

Page 2: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

2

Page 3: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

What is the goal?

3

Humans learn new tasks through demonstrations and instructions.

We’d like general-purpose agents that could do the same.

Page 4: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

What is the goal?

4

Humans learn new tasks through demonstrations and instructions.

We’d like general-purpose agents that can do the same.

Page 5: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Typical Approach

5

Page 6: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Disadvantages to Fine-tuning

6

• Creates a task-specific model

• Requires large high-quality supervised datasets

• more likely to exploit spurious correlations

Few-shot transfer is more similar to how humans learn a new task

Page 7: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

• Creates a task-specific model

• Requires large high-quality supervised datasets

• more likely to exploit spurious correlations

Few-shot transfer is more similar to how humans learn a new task

7

Disadvantages to Fine-tuning

Page 8: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

• Creates a task-specific model

• Requires large high-quality supervised datasets

• more likely to exploit spurious correlations

Few-shot transfer is more similar to how humans learn a new task

8Yogatama, et al. Learning and Evaluating General Linguistic Intelligence. 2019

Disadvantages to Fine-tuning

Page 9: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

What is an alternative?

9Radford, et al. Language Models are Unsupervised Multitask Learners. 2019

Page 10: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Can we further improve on this level of generation and generalization?

10

Page 11: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Can we further improve on this level of generation and generalization?

11

GPT-3 175 Billion parameters

Page 12: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Critical Aspects of GPT-3

• Model Size

• Training Objective

12

Page 13: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Model Size

13Kaplan, et al. Scaling Laws for Neural Language Models. 2020

Page 14: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Model Size

14

Transformers scale well!

Kaplan, et al. Scaling Laws for Neural Language Models. 2020

Page 15: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Motivating the Training Objective

15

Predict the next word in a sequence.

Alec Radford @ Berkeley 4/15/20

Page 16: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Motivating the Training Objective

16

P(“The cat sat on the mat.”) = ???

Alec Radford @ Berkeley 4/15/20

Page 17: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Motivating the Training Objective

17

P(“The cat sat on the mat.”) = ???

Alec Radford @ Berkeley 4/15/20

“But it must be recognized that the notion of ‘probability of a sentence’ is an entirely useless one, under any known interpretation of this term.”

- Noam Chomsky, 1969

Page 18: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Motivating the Training Objective

18

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Page 19: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Motivating the Training Objective

19

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Page 20: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Motivating the Training Objective

20

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Page 21: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Motivating the Training Objective

21

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Page 22: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Motivating the Training Objective

22

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Page 23: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Motivating the Training Objective

23

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Page 24: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Motivating the Training Objective

24

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 stars” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Addition

Sentiment Analysis

Page 25: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Motivating the Training Objective

25

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 stars” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Addition

Sentiment Analysis

Page 26: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Approach

26

Page 27: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Model

http://jalammar.github.io/illustrated-gpt2/ 27

Page 28: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Model

Model

The trophy didn’t fit in the suitcase because

the trophy was too big.

28

Page 29: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Model Sizes

29

Para

met

er C

ount

(in

milli

ons)

0

2250

4500

6750

9000

11250

13500

15750

18000

Small Medium Large XL 2.7B 6.7B 13B 175B

Page 30: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Model Sizes

Para

met

er C

ount

(in

milli

ons)

0

2250

4500

6750

9000

11250

13500

15750

18000

Small Medium Large XL 2.7B 6.7B 13B 175B

BERT

-Lar

ge

T5-L

arge

Meg

atro

n

30

Devlin, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018 Raffel, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 2019 Shoeybi, et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. 2019 Microsoft. Turing-NLG: A 17-Billion Parameter Language Model by Microsoft. 2020

T-NLG

Page 31: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Model SizesPa

ram

eter

Cou

nt (i

n m

illion

s)

0

22500

45000

67500

90000

112500

135000

157500

180000

Small Medium Large XL 2.7B 6.7B 13B 175B

31

Page 32: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Model SizesPa

ram

eter

Cou

nt (i

n m

illion

s)

1

10

100

1000

10000

100000

1000000

Small Medium Large XL 2.7B 6.7B 13B 175B

32

Page 33: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Model SizesPa

ram

eter

Cou

nt (i

n m

illion

s)

1

10

100

1000

10000

100000

1000000

Small Medium Large XL 2.7B 6.7B 13B 175B

Batch Size

Learning Rate

33

Page 34: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Compute

34

Page 35: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Dataset

• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication

• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings

• Books1 & Books2 - internet-based books

• Wikipedia - English-language Wikipedia

35https://commoncrawl.org/the-data

Page 36: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Dataset

• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication

• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings

• Books1 & Books2 - internet-based books

• Wikipedia - English-language Wikipedia

36Radford, et al. Language Models are Unsupervised Multitask Learners. 2019

Page 37: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Dataset

• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication

• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings

• Books1 & Books2 - internet-based books

• Wikipedia - English-language Wikipedia

37

Page 38: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Dataset

• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication

• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings

• Books1 & Books2 - internet-based books

• Wikipedia - English-language Wikipedia

38

Page 39: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Dataset Mix

Toke

ns (i

n bi

llions

)

0

100

200

300

400

500

Common Crawl WebText2 Books1 Books2 Wkipedia test

39

Page 40: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Dataset Mix

Wei

ght i

n tra

inin

g m

ix (p

erce

nt)

0

20

40

60

80

100

Common Crawl WebText2 Books1 Books2 Wkipedia test

40

Page 41: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Dataset Mix

Epoc

hs e

laps

ed (f

or 3

00B

toke

ns)

0

0.8

1.6

2.4

3.2

4

Common Crawl WebText2 Books1 Books2 Wkipedia test

41

Page 42: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Evaluations

42

Page 43: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Let’s try it!

43

tdaeef = ?

Page 44: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Let’s try it!

44

Please unscramble the letters into a word and write that word. tdaeef = ?

Page 45: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Let’s try it!

45

Please unscramble the letters into a word and write that word. tdaeef = ?Zero-Shot

Page 46: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Let’s try it!

46

Please unscramble the letters into a word and write that word. pcirlaroc = reciprocal tdaeef = ?One-Shot

Page 47: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Let’s try it!

47

Please unscramble the letters into a word and write that word. pcirlaroc = reciprocal elapac = palace tdaeef = ?

Few-Shot

Page 48: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

48

Page 49: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

49

Page 50: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

50

Page 51: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

VS.

51

Page 52: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Metalearning

52

Page 53: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Sample Output

Model

The trophy didn’t fit in the suitcase because

the trophy was too big.

53

Page 54: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

LM-Likelihood

Model

The trophy didn’t fit in the suitcase because the trophy was too big.

54

2.78 3.45 10.00 0.50 25.12

Page 55: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Methods of Evaluation

55

Randomly select K examples from the training dataset to build the context

Feed each context + possible completion through the model separately

Sample from the model up to a

newline

Normalize LM likelihood over the

completion and select the completion with

the highest likelihood

BLEU

F1

Exact-match accuracy

Multiple-choice

Free-form

Page 56: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Methods of Evaluation

56

Randomly select K examples from the training dataset to build the context

Feed each context + possible completion through the model separately

Sample from the model up to a

newline

Normalize LM likelihood over the

completion and select the completion with

the highest likelihood

BLEU

F1

Exact-match accuracy

Multiple-choice

Free-form

Page 57: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Methods of Evaluation

57

Randomly select K examples from the training dataset to build the context

Feed each context + possible completion through the model separately

Sample from the model up to a

newline

Normalize LM likelihood over the

completion and select the completion with

the highest likelihood

BLEU

F1

Exact-match accuracy

Multiple-choice

Free-form

Page 58: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Methods of Evaluation

58

Randomly select K examples from the training dataset to build the context

Feed each context + possible completion through the model separately

Sample from the model up to a

newline

Normalize LM likelihood over the

completion and select the completion with

the highest likelihood

BLEU

F1

Exact-match accuracy

Multiple-choice

Free-form

Page 59: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Methods of Evaluation

59

Randomly select K examples from the training dataset to build the context

Feed each context + possible completion through the model separately

Sample from the model up to a

newline

Normalize LM likelihood over the

completion and select the completion with

the highest likelihood

BLEU

F1

Exact-match accuracy

Multiple-choice

Free-form

Page 60: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Complete List of Tasks

60

Reading Comprehension • QuAC• SQuADv2• DROP• CoQA• RACE

Language Modeling • PTB

Close and Completion • ROC Stories• HellaSwag• LAMBADA

Trivia-style Questions • NaturalQs• WebQs• TriviaQA

Translation • En <-> Fr• En <-> De• En <-> Ro

Winograd-style • Winograd• Winogrande

Commonsense Reasoning • PiQA• ARC• OpenBookQA

Comprehensive Benchmarks • SuperGLUE

Inference • ANLI• RTE

Synthetic and Qualitative • Arithmetic• Word scrambling• Character-level manipulation• SAT analogies• Article generation• Learning and using novel words• Correcting English grammar

Page 61: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Summary of Performance

61

Task Class Few-Shot Performance

Close, Completion, and Language Modeling Very Good

Question Answering / Knowledge Base Very Good

Translation Good

Winograd / Winogrande Good

Commonsense Reasoning Mixed

Reading Comprehension Mixed

SuperGLUE Mixed

NLI Poor

Bias Probes Poor

Dario Amodei @ NeurIPS 12/7/20

Page 62: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Strengths

62Joshi, et al. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. 2017

Page 63: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Strengths

63Paperno, et al. The lambada dataset: Word prediction requiring a broad discourse contex. 2016

Page 64: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Strengths

64Paperno, et al. The lambada dataset: Word prediction requiring a broad discourse contex. 2016

Page 65: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Strengths

65

Page 66: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Strengths

66

Task Accuracy Without Commas

Accuracy With Commas

4 Digit Addition 25.5% 91.1%

4 Digit Subtraction 26.9% 89.7%

5 Digit Addition 9.3% 90.2%

5 Digit Subtraction 9.9% 82.2%

6 Digit Addition 3% 78.5%

6 Digit Subtraction 3% 73.9%

Dario Amodei @ NeurIPS 12/7/20, and Gwern Branwen!

3456 -> 3,456

Page 67: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Limitations

67Nie, et al. Adversarial nli: A new benchmark for natural language understanding. 2019

Page 68: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Limitations

68Choi, et al. Quac : Question answering in context. 2018; Dua, et al. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. 2019

Page 69: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Key Insights

69

Page 70: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Few-shot transfer to new tasks is possible without any gradient updates, and it presents a flexible framework for specifying new tasks to a model.

70

Page 71: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Bigger models can learn more from context

71

Page 72: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Bigger models have more emergent abilities

72

Page 73: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

More context helps up to a point

73Wang, et al. Superglue: A stickier benchmark for general-purpose language understanding systems. 2019

Page 74: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Performance continues to scale with compute

74

Page 75: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Lingering Questions

75

Page 76: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Lingering Questions

• Methods of Evaluation

• Training Datasets and Memorization

• Real-World Applications

76

Page 77: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Methods of Evaluation

77

Page 78: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

78

Reading Comprehension • QuAC• SQuADv2• DROP• CoQA• RACE

Language Modeling • PTB

Close and Completion • ROC Stories• HellaSwag• LAMBADA

Trivia-style Questions • NaturalQs• WebQs• TriviaQA

Translation • En <-> Fr• En <-> De• En <-> Ro

Winograd-style • Winograd• Winogrande

Commonsense Reasoning • PiQA• ARC• OpenBookQA

Comprehensive Benchmarks • SuperGLUE

Inference • ANLI• RTE

Synthetic and Qualitative • Arithmetic• Word scrambling• Character-level manipulation• SAT analogies• Article generation• Learning and using novel words• Correcting English grammar

Methods of Evaluation

Page 79: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

• What would it take to feel confident that a model possessed a complex ability?

• Can we build comprehensive benchmarks so that we could identify the set of abilities a model possesses?

• How do we evaluate one of the model’s biggest strengths - creative generation?

Methods of Evaluation

79

Page 80: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

• What would it take to feel confident that a model possessed a complex ability?

• Can we build comprehensive benchmarks so that we could identify the set of abilities a model possesses?

• How do we evaluate one of the model’s biggest strengths - creative generation?

Methods of Evaluation

80

Page 81: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Methods of Evaluation

81

• What would it take to feel confident that a model possessed a complex ability?

• Can we build comprehensive benchmarks so that we could identify the set of abilities a model possesses?

• How do we evaluate one of the model’s biggest strengths - creative generation?

Page 82: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Training Datasets and Memorization

82

• Quality of Data

• Duplication of Benchmarks

Page 83: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Training Datasets and Memorization - Data Quality

83

CommonCrawl filtering

1. Train a classifier to distinguish between unfiltered CommonCrawl and WebText/Books/Wikipedia

2. Sample filtered CommonCrawl with higher probability of selection based on classifier score of quality

Page 84: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Training Datasets and Memorization - Data Quality

84

CommonCrawl filtering

1. Train a classifier to distinguish between unfiltered CommonCrawl and WebText/Books/Wikipedia

2. Sample filtered CommonCrawl with higher probability of selection based on classifier score of quality

Page 85: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Training Datasets and Memorization - Data Quality

85

How can we better define and identify high quality data?

Page 86: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

• Gender

• Race

• Religion

“The detective was a _______” —> 83% male

“The competent detective was a _______”

“The incompetent detective was a _______”

86

Training Datasets and Memorization - Harmful Data

Page 87: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

• Gender

• Race

• Religion

87

Training Datasets and Memorization - Harmful Data

Male-biased Descriptive Words •Large•Mostly•Lazy•Fantastic•Eccentric•Protect•Jolly•Stable•Personable•Survive

Female-biased Descriptive Words •Optimistic•Bubbly•Naughty•Easy-going•Petite•Tight•Pregnant•Gorgeous•Sucked•Beautiful

Page 88: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

• Gender

• Race

• Religion

88

Training Datasets and Memorization - Harmful Data

Page 89: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

• Gender

• Race

• Religion

89

Training Datasets and Memorization - Harmful Data

Page 90: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

How do we make sure models trained on huge amounts of web data don’t get the chance to memorize eval benchmarks?

90

Training Datasets and Memorization - Eval Memorization

Page 91: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Removing benchmarks from training data

1. Look for overlap in phrases between benchmarks and training documents

2. Found a quarter of benchmarks had over 50% overlap with the training dataset!

3. Remove training documents that overlap with eval benchmarks

4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data

91

Training Datasets and Memorization - Eval Memorization

Page 92: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Removing benchmarks from training data

1. Look for overlap in phrases between benchmarks and training documents

2. Found a quarter of benchmarks had over 50% overlap with the training dataset!

3. Remove training documents that overlap with eval benchmarks

4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data

92

Training Datasets and Memorization - Eval Memorization

Page 93: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Removing benchmarks from training data

1. Look for overlap in phrases between benchmarks and training documents

2. Found a quarter of benchmarks had over 50% overlap with the training dataset!

3. Remove training documents that overlap with eval benchmarks

4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data

93

Training Datasets and Memorization - Eval Memorization

Page 94: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Removing benchmarks from training data

1. Look for overlap in phrases between benchmarks and training documents

2. Found a quarter of benchmarks had over 50% overlap with the training dataset!

3. Remove training documents that overlap with eval benchmarks

4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data

94

Training Datasets and Memorization - Eval Memorization

Page 95: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Removing benchmarks from training data

1. Look for overlap in phrases between benchmarks and training documents

2. Found a quarter of benchmarks had over 50% overlap with the training dataset!

3. Remove training documents that overlap with eval benchmarks

4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data

95

Training Datasets and Memorization - Eval Memorization

Page 96: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

96

Training Datasets and Memorization - Eval Memorization

Page 97: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Real-World Applications

Important considerations

1. Potential for harmful outputs

2. Reliability of performance

97

Page 98: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Real-World Applications

98

• Semantic search

• Turn a script into a novel

• Turn a sentence into an email

• Smart formatting and code generation

• Emoji storytelling

https://andrewmayneblog.wordpress.com

Page 99: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Real-World Applications - Emoji Storytelling

99https://andrewmayneblog.wordpress.com

Page 100: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Real-World Applications - Emoji Storytelling

100https://andrewmayneblog.wordpress.com

Page 101: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Real-World Applications

101

• What are the useful applications of a model like GPT-3?

• Are there times when GPT-3 can be convincing enough, even if not perfectly reliable?

Page 102: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Real-World Applications

• What are the useful applications of a model like GPT-3?

• Are there times when GPT-3 can be convincing enough, even if not perfectly reliable?

102

Page 103: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

103

Real-World Applications - Writing News

Page 104: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Real-World Applications - Writing News

104

Page 105: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

• Language modeling performance appears to continue to scale with compute

• Large models can transfer few-shot to new tasks without any fine-tuning

• There are many complexities to evaluations, training datasets, and applications for large models

105

Conclusion

Page 106: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

• Language modeling performance appears to continue to scale with compute

• Large models can transfer few-shot to new tasks without any fine-tuning

• There are many complexities to evaluations, training datasets, and applications for large models

106

Conclusion

Page 107: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

• Language modeling performance appears to continue to scale with compute

• Large models can transfer few-shot to new tasks without any fine-tuning

• There are many complexities to evaluations, training datasets, and applications for large models

107

Conclusion

Page 108: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

108

Page 109: GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning with a Giant Language Model Melanie Subbiah 1 OpenAI Columbia University

Questions?

109