comparing text representations: a theory-driven approach

13
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5527–5539 November 7–11, 2021. c 2021 Association for Computational Linguistics 5527 Comparing Text Representations: A Theory-Driven Approach Gregory Yauney Cornell University [email protected] David Mimno Cornell University [email protected] Abstract Much of the progress in contemporary NLP has come from learning representations, such as masked language model (MLM) contextual embeddings, that turn challenging problems into simple classification tasks. But how do we quantify and explain this effect? We adapt gen- eral tools from computational learning theory to fit the specific characteristics of text datasets and present a method to evaluate the compati- bility between representations and tasks. Even though many tasks can be easily solved with simple bag-of-words (BOW) representations, BOW does poorly on hard natural language inference tasks. For one such task we find that BOW cannot distinguish between real and randomized labelings, while pre-trained MLM representations show 72x greater distinction between real and random labelings than BOW. This method provides a calibrated, quantitative measure of the difficulty of a classification- based NLP task, enabling comparisons be- tween representations without requiring empir- ical evaluations that may be sensitive to ini- tializations and hyperparameters. The method provides a fresh perspective on the patterns in a dataset and the alignment of those patterns with specific labels. 1 Introduction A common theme in contemporary machine learn- ing is representation learning: a task that is compli- cated and difficult can be transformed into a simple classification task by filtering the input through a deep neural network. For example, we know empirically that it is difficult to train a classifier for natural language inference (NLI)—determining whether a sentence logically entails another—using bag-of-words features as inputs, but training the same type of classifier on the output of a pre-trained masked language model (MLM) results in much better performance (Liu et al., 2019b). Fine-tuned representations do even better. But why is switch- ing from raw text features to MLM contextual em- beddings so successful for downstream classifica- tion tasks? Probing strategies can map the syntactic and semantic information encoded in contextual embeddings (Rogers et al., 2020), but it remains difficult to compare embeddings for classification beyond simply measuring differences in task ac- curacy. What makes a given input representation easier or harder to map to a specific set of labels? In this work we adapt a tool from computational learning theory, data-dependent complexity (DDC) (Arora et al., 2019), to analyze the properties of a given text representation for a classification task. Given input vectors and an output labeling, DDC provides theoretical bounds on the performance of an idealized two-layer ReLU network. At first, this method may not seem applicable to contempo- rary NLP: this network is simple enough to prove bounds about, but does not even begin to match the complexity of current Transformer-based mod- els. Although there has been work to extend the analysis of Arora et al. (2019) to more complicated networks (Allen-Zhu et al., 2019), the simple net- work is a good approximation for a task-specific classifier head. We therefore take a different ap- proach, and use DDC to measure the properties of representations learned by networks, not the networks themselves. This approach does not re- quire training any actual classification models, and is therefore not dependent on hyperparameter set- tings, initializations, or stochastic gradient descent. Quantifying the relationship between represen- tations and labels has important practical impacts for NLP. Text data has long been known to differ from other kinds of data in its high dimensional- ity and sparsity (Joachims, 2001). We analyze the difficulty of NLP tasks with respect to two distinct factors: the complexity of patterns in the dataset, and the alignment of those patterns with the la- bels. Better ways to analyze relationships between

Upload: others

Post on 19-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparing Text Representations: A Theory-Driven Approach

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5527–5539November 7–11, 2021. c©2021 Association for Computational Linguistics

5527

Comparing Text Representations: A Theory-Driven Approach

Gregory YauneyCornell University

[email protected]

David MimnoCornell University

[email protected]

Abstract

Much of the progress in contemporary NLPhas come from learning representations, suchas masked language model (MLM) contextualembeddings, that turn challenging problemsinto simple classification tasks. But how do wequantify and explain this effect? We adapt gen-eral tools from computational learning theoryto fit the specific characteristics of text datasetsand present a method to evaluate the compati-bility between representations and tasks. Eventhough many tasks can be easily solved withsimple bag-of-words (BOW) representations,BOW does poorly on hard natural languageinference tasks. For one such task we findthat BOW cannot distinguish between real andrandomized labelings, while pre-trained MLMrepresentations show 72x greater distinctionbetween real and random labelings than BOW.This method provides a calibrated, quantitativemeasure of the difficulty of a classification-based NLP task, enabling comparisons be-tween representations without requiring empir-ical evaluations that may be sensitive to ini-tializations and hyperparameters. The methodprovides a fresh perspective on the patterns ina dataset and the alignment of those patternswith specific labels.

1 Introduction

A common theme in contemporary machine learn-ing is representation learning: a task that is compli-cated and difficult can be transformed into a simpleclassification task by filtering the input througha deep neural network. For example, we knowempirically that it is difficult to train a classifierfor natural language inference (NLI)—determiningwhether a sentence logically entails another—usingbag-of-words features as inputs, but training thesame type of classifier on the output of a pre-trainedmasked language model (MLM) results in muchbetter performance (Liu et al., 2019b). Fine-tuned

representations do even better. But why is switch-ing from raw text features to MLM contextual em-beddings so successful for downstream classifica-tion tasks? Probing strategies can map the syntacticand semantic information encoded in contextualembeddings (Rogers et al., 2020), but it remainsdifficult to compare embeddings for classificationbeyond simply measuring differences in task ac-curacy. What makes a given input representationeasier or harder to map to a specific set of labels?

In this work we adapt a tool from computationallearning theory, data-dependent complexity (DDC)(Arora et al., 2019), to analyze the properties of agiven text representation for a classification task.Given input vectors and an output labeling, DDCprovides theoretical bounds on the performanceof an idealized two-layer ReLU network. At first,this method may not seem applicable to contempo-rary NLP: this network is simple enough to provebounds about, but does not even begin to matchthe complexity of current Transformer-based mod-els. Although there has been work to extend theanalysis of Arora et al. (2019) to more complicatednetworks (Allen-Zhu et al., 2019), the simple net-work is a good approximation for a task-specificclassifier head. We therefore take a different ap-proach, and use DDC to measure the propertiesof representations learned by networks, not thenetworks themselves. This approach does not re-quire training any actual classification models, andis therefore not dependent on hyperparameter set-tings, initializations, or stochastic gradient descent.

Quantifying the relationship between represen-tations and labels has important practical impactsfor NLP. Text data has long been known to differfrom other kinds of data in its high dimensional-ity and sparsity (Joachims, 2001). We analyze thedifficulty of NLP tasks with respect to two distinctfactors: the complexity of patterns in the dataset,and the alignment of those patterns with the la-bels. Better ways to analyze relationships between

Page 2: Comparing Text Representations: A Theory-Driven Approach

5528

raw pixelsMNIST (0 vs. 1)

bag-of-wordsbicycles vs. cstheory

bag-of-wordscs vs. cstheory

bag-of-wordsSST

bag-of-wordsMNLI

pre-trained MLM embeddingsMNLI

fine-tuned MLM embeddingsMNLI

Accuracy99.8% 99.4% 75.4% 79.6% 52.8% 67.1% 96.6%

[email protected]% 43.8% 44.1% 23.9% 31.9% 98.4% 88.1%

0 10000Eigenvector0

2

4

6

8

10

Nor

m o

f pro

ject

ion

0 100000

2

4

6

8

10

0 100000

2

4

6

8

10

0 100000

2

4

6

8

10

0 100000

2

4

6

8

10

0 100000

2

4

6

8

10

0 100000

2

4

6

8

10

0 5 10 15 20DDC

5.9% of random0 5 10 15 20

34.0%0 5 10 15 20

81.0%0 5 10 15 20

66.6%0 5 10 15 20

99.6%0 5 10 15 20

85.6%0 5 10 15 20

25.2%

Figure 1: DDC shows how fine-tuning MLM embeddings turns a hard problem (NLI) into an easy problem. Atthe top, we show 2D PCA plots of the DDC Gram matrix for five classification problems, from easy (MNIST)to hard (MNLI), along with two MLM-based representations of MNLI. We then show empirical dev-set accuracyand the fraction of variance explained by the first 100 eigenvectors. DDC measures the projection of the labelsonto each eigenvector (red histogram), scaled by the inverse of the eigenvalue. The bottom row shows DDC asa proportion of DDC for random labels (gray histogram). Fine-tuned embeddings turn MNLI from a task that isindistinguishable from random guessing into one that is as easy as telling if a post is about bicycles or CS theory.

representations and labels may enable us to betterhandle problems with datasets, such as “shortcut”features that are spuriously correlated with labels(Gururangan et al., 2018; Thompson and Mimno,2018; Geirhos et al., 2020; Le Bras et al., 2020).

Our contributions are the following. First, weidentify and address several practical issues in ap-plying DDC for data-label alignment in text classi-fication problems, including better comparisons to“null” distributions to handle harder classificationproblems and enable comparisons across distinctrepresentations. Second, we define three evaluationpatterns that provide calibrated feedback for datacuration and modeling choices: For a given repre-sentation (such as MLM embeddings), are somelabelings more or less compatible with that rep-resentation? For a given target labeling, is one oranother representation more effective? How can wemeasure and explain the difficulty of text classifica-tion problems between datasets? Third, we providecase studies for each of these usages. In particu-lar, we use our method to quantify the differencebetween various localist and neural representationsof NLI datasets for classification, identifying dif-ferences between datasets and explaining the dif-ference between MLM embeddings and simplerrepresentations.1

2 Data-Dependent Complexity

Data-dependent complexity (Arora et al., 2019)combines measurements of two properties of abinary-labeled dataset: the strength of patterns in

1Code is available at: https://github.com/gyauney/data-label-alignment

the input data and the alignment of the output labelswith those patterns. Patterns in data are captured bya pairwise document-similarity (Gram) matrix. Aneigendecomposition is a representation of a matrixin terms of a set of basis vectors (the eigenvec-tors) and the relative importance of those vectors(the eigenvalues). If we can reconstruct the orig-inal matrix with high accuracy using only a feweigenvectors, their corresponding eigenvalues willbe large relative to the remaining eigenvalues. Amatrix with more complicated structure will havea more uniform sequence of eigenvalues. DDCmeasures the projections of the label vector ontoeach eigenvector, scaled by the inverse of the cor-responding eigenvalue. A label vector that canbe reconstructed with high accuracy using onlythe eigenvectors with the largest eigenvalues willtherefore have low DDC, while a label vector thatcan only be reconstructed using many eigenvectorswith small eigenvalues will have high DDC.

Motivating examples. Figure 1 shows PCAplots of Gram matrices for five datasets. Each pointrepresents a document, colored by its label. Asan informal intuition, if we can linearly separatethe classes using this 2D projection, the datasetwill definitely have low DDC. DDC can providea perspective on difficulty beyond just comparingaccuracy, especially when using a powerful classi-fier, where accuracies can be nearly perfect evenfor complicated problems. The MNIST digit clas-sification dataset (LeCun et al., 1998) and an in-tentionally easy text dataset (distinguishing postsfrom Stack Exchange forums on bicycles and CS

Page 3: Comparing Text Representations: A Theory-Driven Approach

5529

theory) are two tasks on which the simple networkstudied by Arora et al. (2019) achieves high ac-curacy: 99.8% and 99.4%, respectively. MNISTis relatively simple: 81.8% of the variance is ex-plained by the first 100 eigenvectors. DDC is lowfor MNIST because the dominant pattern of thedataset aligns with the labels. Since the eigenval-ues decay quickly, their inverses increase quickly,but the label vector projects only onto the top feweigenvectors. Any other label vector would likelyhave much higher DDC. Bicycles vs. CS theory,while also simple, is more complicated from aneigenvector perspective, with only 43.8% of vari-ance explained. Even though the eigenvalues decaymore slowly, the labels project onto enough lower-ranked eigenvectors that DDC is higher than inMNIST. Both MNIST and Bicycles vs. CS theoryare easy in this operational sense, but DDC nev-ertheless shows there is a meaningful differencewhen accuracy saturates: more complicated pat-terns must be fit in order to learn the Bicycles vs.CS theory task to high accuracy.

MNLI (Williams et al., 2018) is a much hardertext classification dataset. We get lower accuracyfor simple networks trained on two representations,bag-of-words and pre-trained MLM embeddings.In this case differences in accuracy are more infor-mative, but DDC still provides additional informa-tion, provided that we contextualize it. Compar-ing raw DDC values seems to contradict accuracy:the task with a bag-of-words representation has aDDC of 2.8 while pre-trained embeddings producea DDC of 15.9. In this case, DDC is higher notbecause it is putting more weight on lower-rankedeigenvectors (the opposite is true), but because theeigenvalues for the pre-trained embeddings dropmore quickly: 98.4% of variance is explained bythe first 100 eigenvectors. To account for this dif-ference, it is necessary to calibrate DDC by nor-malizing relative to the DDC of random labelings.The relative gap between the DDC of a real label-ing and DDC for a random labeling is much largerfor MNLI under pre-trained MLM embeddings:BOW is indistinguishable from random labels (asthe near-50% accuracy suggests) while pre-trainedembeddings distinguish MNLI labels from randomlabels at above-random performance. MNLI usingfine-tuned MLM embeddings, finally, has both loweigenvector complexity (88.1% variance explained)and allows for almost perfect classification accu-racy with low relative DDC.

From dataset to data-dependent complexity.The complexity of classification tasks is studied incomputational learning theory. Rademacher com-plexity goes beyond the worst-case characteriza-tion of VC-Dimension to measure the gap betweenhow well a family of classifiers can fit arbitrarylabels for a fixed set of inputs and how well theclassifier fits the given real labels of those inputs(Shalev-Shwartz and Ben-David, 2014).2 In thiswork, we turn this around and compare the capac-ity of a fixed classifier to fit arbitrary labels fordifferent input representations, including multiplerepresentations of the same data. Downsides ofcalculating Rademacher complexity directly are 1)in the general setting it requires taking a supremumover the family of classifiers and 2) it will triviallysaturate if the dataset is smaller than the classifier’sVC-Dimension. Arora et al. (2019) show that forlarge-width two-layer ReLU networks, the projec-tions of labels onto the eigenvectors of a Grammatrix govern both generalization error and therate of convergence of SGD. Similar spectral anal-ysis of Gram matrices has long been used in kernellearning (Cristianini et al., 2001).

The foundation of the Arora et al. (2019) data-dependent complexity measure is the Gram matrixthat measures similarity between documents. Asa model we use an overparameterized two-layerReLU network to ensure comparability with priorwork. Let xi be the `2-normalized representationof the ith document out of n, and yi be the label forthat document. We construct this matrix of pairwisedocument similarities under a ReLU kernel, wherethe similarity between documents xi and xj is

H∞ij =xᵀi xj

(π − arccos(xᵀ

i xj))

2π. (1)

This kernel is discussed in more detail in Aroraet al. (2019) and Xie et al. (2017). Letting QΛQᵀ

be the eigendecomposition of H∞ and using theidentity that (QΛQᵀ)−1 = QΛ−1Qᵀ, DDC is

DDC =

√2yᵀ(H∞)−1y

n(2)

=

√2(yᵀQ)Λ−1(Qᵀy)

n. (3)

We refer to yᵀQ as the projections of a label vec-tor onto the eigenvectors of the Gram matrix. We

2Note that we are not referring to linguistic complexity oftext, as in Bentz et al. (2016, 2017); Gutierrez-Vasques andMijangos (2020).

Page 4: Comparing Text Representations: A Theory-Driven Approach

5530

y-only representation

random labelings

random labelingslabeled dataset

x-only representation

complexity distinguishes real labels from random when representation and labels are well-aligned

representation and labels are NOT Well-aligned

Figure 2: Intuition: data-label alignment as a way tocompare representations. For this labeling of thesepoints, the x-only representation is well-aligned withthe labeling, as there is a large gap between the DDCsof the real labeling and random labelings. The y-onlyrepresentation does not distinguish between real andrandom labelings.

call a task’s labeling y the real labeling. Aroraet al. (2019) show that lower DDC implies fasterconvergence and lower generalization error.

3 Making Data-Dependent Complexity aPractical Tool for Text Data

Unlike previous work, our goal is not to provetheoretical bounds on neural networks, but to eval-uate the theoretical properties of different repre-sentations of datasets. Rather than compare DDCacross representations directly, data-label align-ment takes inspiration from Rademacher complex-ity and compares the gap between DDC of real andrandom labelings to account for different embed-ding spaces. Figure 2 provides a simplified viewof this approach, showing the DDC of real labelsrelative to DDC for random labelings for two trivialrepresentations of a synthetic dataset. We also findthat several additional adaptations from Arora et al.(2019) are required to make DDC an effective toolfor text datasets. Subsampling large datasets canprovide good approximations with reduced com-putational cost, and we show that duplicate docu-ments, which are more likely in text than in images,can significantly affect DDC if not handled.

Difficult datasets require distributions of ran-dom labels. We recommend comparing the DDCof the real labeling to the distribution of DDCs frommany random labelings. For easier tasks, there is

2.93 2.94 2.95 2.96 2.97 2.98

DDC

0.00

0.05

Prob

abilit

y

E[DDC]real labels

Figure 3: The DDC of nearly 30% of sampled randomlabelings is less than that of the real labeling for a sam-ple of the MNLI dataset represented by bags-of-words.

500 5000 10000 20000Sample size

0

5

10

15

20

DDC

500 5000 10000 20000Sample size

0%

25%

50%

75%

100%

DDC

/ E[D

DC]

MNLI (pre-trained MLM embeddings)MNLI (bag-of-words)

Bicycles vs. CS theory (bag-of-words)All

500 5000 10000 20000Sample size

0

2000

4000

6000

8000

Runn

ing

time

(s)

Figure 4: DDC (left) and ratio of real DDC to aver-age random DDC (middle) are relatively stable acrossdifferent-sized samples of large datasets, though largersamples incur increased running time (right).

wide separation between DDC values for real andrandom labels, as Arora et al. (2019) show for theMNIST 0 vs. 1 task. For more difficult tasks, com-paring the real labeling to only one random labelingcould result in wildly different answers. In a sam-ple from the MNLI dataset with text represented asbags-of-words, for example, 30% of random label-ings had lower complexity than the real labelings(Figure 3).3 Appendix B gives a bound on the num-ber of random labelings required to get an accurateestimate of the expected DDC of a random labelingthat is mainly determined by the gap between theinverses of the largest and smallest eigenvalues ofthe Gram matrix. In our experiments, the numberof random labelings ranges from a few hundred toseveral thousand; once eigenvectors have been cal-culated these are easily evaluated. We refer to thesampled estimate of expected DDC over randomlabelings as E[DDC].

Subsampling is effective for large datasets.Calculating DDC requires matrix operations thatscale more than quadratically in the number ofdata points (Pan and Chen, 1999), which are pro-hibitive to compute exactly for large datasets. Trun-cated eigendecompositions are tempting but mayunderestimate complexity for extremely difficultdatasets; we leave exploration of truncated approx-imations to future work. We recommend insteadcalculating DDC for a random subsample. For

3Every MNLI subsample we examined had at least somerandom labelings with lower DDC than the real labeling.

Page 5: Comparing Text Representations: A Theory-Driven Approach

5531

3.0 3.2 3.4 3.6 3.8 4.0 4.2

DDC

0.00.20.4

Prob

abilit

y

E[DDC] real labels

Figure 5: Two documents with nearly-identical repre-sentations entirely determine DDC of random labelingsfor a sample of the MNLI dataset as bags-of-words.

experiments in this work we fix n = 20,000; eigen-decompositions complete in a few hours. We findthat smaller values of n can be much more efficient,are relatively accurate, and do not change the rela-tive ordering for comparisons (Figure 4). We havenot yet evaluated the impact of unbalanced classes.

DDC distributions identify pathological cases.DDC can reveal potentially problematic character-istics of datasets that may not be evident under typi-cal use. Figure 5 shows results for an MNLI subsetwith bag-of-words representations that is almostidentical to the one used in Figure 3, but the his-togram of random labeling DDCs is bimodal. Thisdataset contains two documents that have identicalbag-of-words representations but different labels.When these documents are randomly assigned thesame label, DDC is just under 3.0, as in the othersubset. But when they are assigned opposite labels(as in the real labeling) the documents by them-selves are enough to increase DDC to 4.2, becausethey add weight to a low-ranked eigenvector witha very large inverse eigenvalue. We see this sensi-tivity as a feature in a setting where we are usingDDC as a diagnostic tool. In our experiments wefilter out duplicates, e.g., fewer than 0.2% in SNLI.

4 Experiments

DDC supports comparisons between datasetsand alternative labelings. We begin by demon-strating that our method reveals the relationshipbetween data and labels by evaluating multiple la-belings for two simple classification datasets withStack Exchange posts represented as bags of words.Our goal is to determine the extent to which eachlabeling is aligned with the data. Both datasetscomprise documents from two English-languageStack Exchange communities released in the StackExchange Data Dump (Stack Exchange Network,2021). First, we choose two communities we ex-pect to be easily distinguishable based on vocabu-lary: Bicycles and CS theory. Second, we choosetwo communities we expect to be more difficult to

0

0.3

Prob

abilit

y

Community Year AM/PM

bicycles and cstheory

0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

DDC

0

0.3

Prob

abilit

y

Community Year AM/PM

cs and cstheory

Figure 6: Comparing different valid labelings of StackExchange documents: labeling by community is themost different from random labelings (blue histogram).Labeling by year is still salient, but AM/PM labels arenot aligned with patterns in the documents.

distinguish: CS and CS theory. For both datasets,we consider three valid ways to partition the data,which we also expect to be from easier to harder:1) Community: each document is labeled with thecommunity to which it was posted. 2) Year: eachdocument is labeled with whether it was posted inthe years 2010-2015 or 2016-2021 (both ranges in-clusive). 3) AM/PM: each document is labeled withwhether its timestamp records that it was postedin the hours 00:00-11:59 or 12:00-23:59. For bothdatasets, we sample 20,000 documents so that eachlabeling assigns half the dataset to each class. SeeAppendix A for more details.

Figure 6 shows DDC for both datasets usingthe three valid labelings and the distribution overDDC for random labelings. As hypothesized, theDDC of the community labeling is much lower thanthat of random labelings for both tasks. What’snew, however, is that our method quantifies thedifferences in difficulty without training any classi-fiers: when comparing Bicycles posts to CS theoryposts, the real labeling is 375 standard deviationsof the random distribution below the average ran-dom DDC, but the same distance is 98 standarddeviations when comparing CS and CS theory. Sur-prisingly, for both tasks the AM/PM labeling in facthas higher DDC than all of the random labelingswe sampled. It is more than 10 standard deviationsfrom the average DDC of random labelings forboth. We hypothesize that this labeling is unusu-ally well balanced relative to the actual differencesin documents.

NLI experiment details. In the previous experi-ment we kept the data fixed and compared differentlabelings. Here we keep labelings fixed and com-pare alternative data representations. We aim todisentangle how pre-training, fine-tuning, and thefinal classification step contribute to performanceon natural language inference (NLI) tasks. Our

Page 6: Comparing Text Representations: A Theory-Driven Approach

5532

protocol for measuring data-label alignment of adataset is: 1) choose a set of representations to com-pare and remove any examples that are identical un-der any representation, 2) for each representation:sample up to 20,000 examples from the dataset andconstruct the Gram matrix, 3) calculate DDC ofthe real labeling and DDC of random labelings, 4)compare the gap between DDC of real and randomlabelings across representations. This process canbe repeated across subsamples of the dataset.

We analyze training data from three English-language datasets from the GLUE benchmark(Wang et al., 2019b): MNLI (Williams et al.,2018), QNLI (Rajpurkar et al., 2016), and WNLI(Levesque et al., 2011); along with an additionalfourth dataset: SNLI (Bowman et al., 2015). Weuse the entailment and contradiction classes. Pre-processing is in Appendix A. For baseline datarepresentations, we use localist bag-of-words andGloVe embeddings after concatenating the two sen-tences in each NLI example. For GloVe, each wordis represented by a static vector, and word vec-tors are averaged to produce one vector for theentire sentence. We use pre-trained contextualembeddings from BERT (Devlin et al., 2019) andRoBERTa-large (Liu et al., 2019b). We also usecontextual embeddings from RoBERTA-large fine-tuned on each of MNLI, QNLI, and SNLI. For eachfine-tuning, we follow Le Bras et al. (2020): pick arandom 10% of the training dataset, fine-tune withthat sample, and then discard the sample from fu-ture analyses. This allows us to evaluate the effectsof fine-tuning without trivially examining data usedfor fine-tuning. For all MLM representations, eachdocument is represented by the final hidden layerof the [CLS] token, as is standard. Models wereimplemented using Hugging Face’s Transformerslibrary (Wolf et al., 2020) with NumPy (Harriset al., 2020) and PyTorch (Paszke et al., 2019).

To compare the complexity of a real labelingwith the distribution of complexities from randomlabelings, we focus on two metrics: 1) the ratio ofthe real labeling’s DDC to the average DDC overrandom labelings: DDC

E[DDC] , and 2) the number ofstandard deviations of the random DDC distribu-tion that the real DDC is from the average DDCover random labelings (z-score): DDC−E[DDC]

σ . Thefirst compares how far real DDC is from the aver-age random DDC in terms of percentage, and thesecond compares how far real DDC is from theaverage random DDC in terms of the distribution.

0

0.3

real labels

bag-of-words

0

0.2

real labels

GloVe

0

0.2

Prob

abilit

y

real labels

BERT

0

0.1

real labels

RoBERTa-large

0 2 4 6 8 10 12 14 16 18 20DDC

0

0.2

real labels

RoBERTa-large-mnli

Representation0

5

10

15

20

DDC

(a)

Representation0

5

10

15

20

E[DD

C]

(b)

Representation0%

20%

40%

60%

80%

100%

DDC

/ E[D

DC]

(c)

Representation

300

200

100

0

# of

std

devs

DDC

isab

ove

or b

elow

E[D

DC]

(d)

Figure 7: Comparisons of real and random DDCs forone subsample of MNLI: pre-trained and fine-tunedMLM representations distinguish between real and ran-dom labelings. Summary results for (a) real values, (b)mean of random values, (c) ratio of real to average ran-dom, and (d) z-score of real from average random.

DDC supports comparisons of representationsfor NLI. Figure 7 compares the DDC of reallabels to the distributions of DDCs for randomlabelings for a subsample of 20,000 documentsfrom MNLI. MNLI was specifically designed tothwart lexical approaches, so we expect bag-of-words-based methods to do poorly. Indeed, thetwo baseline representations—bag-of-words andGloVe—do not distinguish between real and ran-dom labelings (the top row is identical to Figure 3but at a different scale). In fact, for GloVe embed-dings, the real labeling is more complex than allsampled random labelings. For both pre-trainedMLM embeddings, the value of DDC is greaterthan that of the bag-of-words representation, butthe DDC of random labelings increases even more,leading to a much wider gap between real and ran-dom labelings and lower proportional DDC.4 Asshown in Figure 1, while pre-trained RoBERTa-large embeddings enable training classifiers withgreater accuracy, they also have a much faster drop-off in eigenvalues. DDC is therefore higher forthis representation because the labels project onto“noise” eigenvectors with small eigenvalues. Butrandom labelings have even greater projectionsonto noise eigenvectors, so there is a large gapbetween real and random DDCs. Fine-tuned con-

4We also found that pre-trained RoBERTa embeddings hadnear-identical results as RoBERTa-large embeddings.

Page 7: Comparing Text Representations: A Theory-Driven Approach

5533

MNLI QNLI SNLI WNLIDataset

0

10

20

30DD

C

(a) DDC of real labelings varies by representation . . .

MNLI QNLI SNLI WNLIDataset

0

10

20

30

E[DD

C]

Baselinesbag-of-wordsGloVePre trainedBERTRoBERTa-largeFine tunedRoBERTa-large-mnliRoBERTa-large-qnliRoBERTa-large-snli

(b) . . . but average DDC for random labelings also varies insimilar ways—mostly due to eigenvalue concentrations.

MNLI QNLI SNLI WNLIDataset

0%

50%

100%

150%

DDC

/ E[D

DC]

(c) The ratio of observed real DDC to average random DDCgives a more accurate perspective.

MNLI QNLI SNLI WNLIDataset

400

300

200

100

0

# of

std

devs

DDC

isab

ove

or b

elow

E[D

DC]

(d) The z-score shows even more difference when measuringthe real labeling against the distribution of random values.

Figure 8: Comparisons between real and random labelings across representations and datasets. Error bars areacross four replications. WNLI does not have error bars because it is small enough that it does not need to besubsampled. MNLI, QNLI, and SNLI are similar. WNLI is not well-aligned with any representation we consider.

textual embeddings show the largest drop in rela-tive DDC, as well as the lowest absolute DDC forreal labels. This result is consistent with empiricaldev-set accuracies for pre-trained RoBERTa-largeand fine-tuned RoBERTa-large-mnli embeddingsof 67.1% and 96.6%, respectively, but it providesadditional quantitative perspective that does notrely on stochastic gradient descent algorithms.

Comparing to distributions of DDC values forrandom labelings, rather than just the mean, alsoprovides a new perspective. If we were to justconsider the ratio of real DDC to average randomDDC (Figure 7c), we would conclude, for example,that real labels for BERT have 80% of the DDCof random labelings, on average. But when wereconsider the distance between the real DDC andaverage random DDC in terms of the standard devi-ation of the distribution (Figure 7d), we see that thereal DDC is 71 standard deviations away from theaverage. Additionally, BERT has much lower DDCfor real labels than RoBERTa-large. When viewedin the context of the distribution, however, we cansee that BERT and RoBERTa-large representationsare nearly equal at distinguishing real from ran-dom labelings. Surprisingly, for pre-trained andfine-tuned representations, every sampled randomlabeling had higher complexity than the real label-ing; the probability that a random labeling has alower DDC than the real labeling is at most 0.001.

Comparisons between representations acrossdatasets. In addition to comparing alternativerepresentations for a single dataset, we can com-pare representations between datasets. This method

measures the ability of a neural network to pro-duce representations that are aligned with multiple,slightly different tasks. We repeat the previousexperiment for all four NLI datasets, this time cal-culating complexities of multiple 20,000-samplesubsets of each dataset. For a given sample of adataset, we compare the complexities across alldata representations. Figure 8 shows that the trendsthat we saw in one sample of MNLI are borneout across multiple samples of MNLI and acrossQNLI and SNLI. We use the same baseline and pre-trained representations as before, but also includethe output of RoBERTa-large models fine-tuned ona discarded subset of MNLI, QNLI, and SNLI.

Our first result is that MNLI is unusually difficultfor purely lexical methods. While the complexityof real labels is closest to the average random DDCfor bag-of-words and GloVe representations, forQNLI and especially SNLI, we are still able todistinguish between real and random labels. ForQNLI, the real DDCs are separated a small amountfrom the average random DDC in absolute terms,but this is nearly 10 standard deviations. The sep-aration is even more pronounced for SNLI, wherethe real complexity is separated from the averagerandom complexity by 43 and 50 standard devi-ations for bag-of-words and GloVe, respectively.This result is surprising because the NLI entail-ment and contradiction classes should not a prioribe associated with lexical patterns. It provides fur-ther evidence for previous findings that lexical andhypothesis-only approaches can achieve high accu-racy on NLI datasets (Gururangan et al., 2018).

Page 8: Comparing Text Representations: A Theory-Driven Approach

5534

DDC for pre-trained representations appears dif-ferent between BERT and RoBERTa-large, withBERT closer to GloVe, but this difference disap-pears when comparing to random labelings. WhileBERT and RoBERTa-large differ in their eigen-value distributions, we cannot reliably distinguishthem in terms of relative alignment with task labels.

Fine-tuned representations are significantly bet-ter aligned with labels than pre-trained embeddings.For MNLI, QNLI, and SNLI, the pre-trained em-beddings separate real and random DDCs morethan the baseline representations, even when thebaseline representations already achieve some sepa-ration. As expected, fine-tuned representations dis-tinguish the most between real and random labels.What’s new is that our method quantifies the extentof the increased alignment. Fine-tuned embeddingsmore than double the gap between real and randomlabelings beyond that of pre-trained embeddings,when measured by either ratio or standard devia-tions. In addition, we see some evidence of transferlearning: representations from networks fine-tunedon one NLI dataset have greater alignment withlabels on the other datasets than pre-trained repre-sentations do. But we find that MNLI and SNLIare more able to transfer to each other, while QNLIappears significantly different.

We were surprised to find how unlike WNLI isto the other datasets we consider: even contextualembeddings are not significantly more aligned withthe real labeling than the baseline representations.There are many alternative labelings that are morealigned with the structure of the data, which ac-cords with WNLI’s purpose as a hand-crafted chal-lenge dataset (Levesque et al., 2011). Our exper-iments suggest that fine-tuned RoBERTa’s 91.3%accuracy on WNLI (Liu et al., 2019b) comes fromupdating the representations during their multi-taskfine-tuning and the high capacity of the classifica-tion head. Pre-training alone is not enough.

DDC helps guide MLM embedding choices.Finally, we present a case study in which data-labelalignment provides guidance in modeling choices.MLM-based embeddings have become standard inNLP, but there remain many practical questionsabout how users should apply them. For example,users may be concerned about how the output ofa network should be fed to subsequent layers. ForBERT-based models we often use the embeddingof the [CLS] token as a single representation of adocument, but Miaschi and Dell’Orletta (2020) em-

[CLS

]Me

anCo

ncat

.

[CLS

]Me

anCo

ncat

.

[CLS

]Me

anCo

ncat

.

400

300

200

100

0

# of

std

devs

DDC

is ab

ove

or b

elow

E[D

DC]

MNLI QNLI SNLI

Pre trainedBERTRoBERTaRoBERTa-largeFine tunedRoBERTa-large-mnliRoBERTa-large-qnliRoBERTa-large-snli

Figure 9: We find no consistent advantage between 1)the hidden embedding of just the [CLS] token, 2) aver-aging the final hidden embedding of all tokens, and 3)concatenating the final hidden embedding of all tokens.

pirically evaluate contextual embeddings from av-eraging hidden token representations. In Figure 9,we compare the alignment with NLI task labels of1) the [CLS] token, 2) taking the mean of the finalhidden layer, and 3) concatenating the final hid-den layer across tokens. We find no difference thatis consistent across models and datasets, though[CLS] is modestly better at distinguishing realand random labelings for MNLI and SNLI. Any ob-served difference is less significant than the choiceto fine-tune and would not change the relative orderof models. In this case, users should feel confidentthat there is no strong alignment advantage for NLIclassification to choosing one representation overanother.

5 Related Work

We see this work as part of a more generalincrease in quantitative evaluations of datasets.Swayamdipta et al. (2020) identify subsets of NLIdatasets that are hard to learn by analyzing howa classifier’s predictions change over the courseof training. Sakaguchi et al. (2020) and Le Braset al. (2020) similarly filter NLI datasets by find-ing examples frequently misclassified by simplelinear classifiers. Rather than find difficult exam-ples within a dataset, we seek to understand howdifferent data representations impact task difficulty.Our results complement Le Bras et al. (2020) infinding that categories containing more dissimilarexamples are more complex. They find that filter-ing data under the RoBERTa representation leadsto the most difficult reduced dataset; our work sug-gests this might be because contextual embeddingsbetter differentiate the entailment and contradictionclasses in NLI tasks than baseline representations.

Minimum description length (MDL) treats a clas-sifier as a method for data compression and has

Page 9: Comparing Text Representations: A Theory-Driven Approach

5535

recently been used to measure the extent to whichrepresentations learned by MLMs capture linguis-tic structure (Voita and Titov, 2020). Perez et al.(2021) use MDL to determine if additional tex-tual features are expected to lead to easier tasks,finding, for instance, that providing answers to sub-questions in question-answering tasks decreasesdataset complexity. In contrast, our work investi-gates the relationship between different represen-tations of fixed data and labelings. Mielke et al.(2019) also use information-theoretic tools (sur-prisal/negative log-likelihood) to empirically eval-uate differences in language model effectivenessacross languages.

This work also relates to prior work on evaluat-ing the capabilities of embeddings. Non-contextualdense vector word representations encode syntaxand semantics (Pennington et al., 2014; Mikolovet al., 2013) as well as world information like gen-der biases (Bolukbasi et al., 2016). Linguistic prob-ing and comparison of MLM layers and representa-tions has identified specific capabilities of MLMs(Tenney et al., 2019; Liu et al., 2019a; Rogerset al., 2020). Many works have identified and pro-posed antidotes to the anisotropy of non-contextualword embeddings (Arora et al., 2016; Mimno andThompson, 2017; Mu et al., 2018; Li et al., 2020).Conneau et al. (2020) use a kernel approach to com-pare embeddings learned by MLMs pre-trained ondifferent languages. While analyzing the geom-etry of contextual embedding vectors remains anactive line of work (Reif et al., 2019; Ethayarajh,2019), we instead analyze the relationship betweenembeddings and the labels of downstream tasks.

6 Conclusion

We present a method for quantifying the alignmentbetween a representation of data and a set of associ-ated labels. We argue that the difficulty of a datasetis a function of the alignment between the chosenrepresentation and a labeling, relative to the distri-bution of alternative labelings, and we demonstratehow this method can be used to compare differenttext representations for classification. We usedNLI datasets as well-understood case studies in or-der to demonstrate our method, which replicatesresults from less general methods that were sur-prising when introduced, such as Gururangan et al.(2018) on lexical mutual information. Our methodsupplements traditional held-out test set accuracy:while accuracy answers which representations en-

able high performance for a task, our approachoffers more explanation of why.

We hope that future work can study noveldatasets and settings. Our method can be read-ily applied to new datasets, and it could especiallybe used to quantify the difficulty of adversariallyconstructed datasets like WinoGrande (Sakaguchiet al., 2020) and Adversarial NLI (Nie et al., 2020).Our method could be used to measure data-labelalignment while changing the information presentin each document, as in the recent work of Perezet al. (2021). The method can also be extended toboth other classification models by analyzing Grammatrices produced by different similarity kernelsand to multi-class, imbalanced, and noisy settingswhere uniformly-random labelings are not as appli-cable. More theoretical work could provide furthergeneralization guarantees.

Our method provides a new lens that can be usedin multiple ways. NLP practitioners want to designmodels that achieve high accuracy on specific tasks,and our method can identify which representationmost aligns with the task’s labeling and whethercertain processing steps are useful. Dataset de-signers, on the other hand, often seek to providechallenging datasets in order to spur new modelingadvances (Wang et al., 2019b,a; Sakaguchi et al.,2020). Our method helps diagnose when currenttext representations do not capture the variation ina dataset, as we showed for the case of WNLI, ne-cessitating richer embeddings. It can also indicatewhen datasets are not robust to patterns in existingembeddings, as we found with SNLI and QNLIaligning with baseline lexical features.

Finally, while empirical exploration has been aneffective strategy in NLP, better theoretical analysismay reveal simpler yet more powerful and more ex-plainable solutions (Saunshi et al., 2021). Applyingtheory-based analysis to representations rather thanto NLP models as a whole offers a way to benefitimmediately from such perspectives without requir-ing full theoretical analyses of deep networks.

Acknowledgments

We thank Maria Antoniak, Katherine Lee, Rosa-mond Thalken, Melanie Walsh, and MatthewWilkens for thoughtful feedback. We also thankBenjamin Hoffman for fruitful discussions of ana-lytic approaches. This work was supported by NSF#1652536.

Page 10: Comparing Text Representations: A Theory-Driven Approach

5536

Ethical Considerations

We do not anticipate any harms from our use of pub-licly available Stack Exchange datasets and NLIdatasets. While this work analyzes the text rep-resentations produced by large masked languagemodels, we do not anticipate any harms from themethod presented in this work. We fine-tune theselarge models in order to calibrate our method, butour method can be used on already-trained modelsand does not necessitate additional training. Webelieve that more analytical and interpretive worklike ours can better guide empirical computation-intensive research.

ReferencesZeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang.

2019. Learning and generalization in overparameter-ized neural networks, going beyond two layers. InNeurIPS.

Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, andRuosong Wang. 2019. Fine-grained analysis of op-timization and generalization for overparameterizedtwo-layer neural networks. In ICML.

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016.A simple but tough-to-beat baseline for sentence em-beddings. ICLR.

Christian Bentz, Dimitrios Alikaniotis, MichaelCysouw, and Ramon Ferrer-i Cancho. 2017. The en-tropy of words—learnability and expressivity acrossmore than 1000 languages. Entropy, 19(6):275.

Christian Bentz, Tatyana Ruzsics, Alexander Koplenig,and Tanja Samardzic. 2016. A comparison betweenmorphological complexity measures: Typologicaldata vs. language corpora. In Proceedings of theWorkshop on Computational Linguistics for Linguis-tic Complexity (CL4LC), pages 142–153.

Tolga Bolukbasi, Kai-Wei Chang, James Zou,Venkatesh Saligrama, and Adam Kalai. 2016.Man is to computer programmer as woman is tohomemaker? debiasing word embeddings. InNeurIPS.

Samuel R Bowman, Gabor Angeli, Christopher Potts,and Christopher D Manning. 2015. A large anno-tated corpus for learning natural language inference.In EMNLP.

Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Emerging cross-lingual structure in pretrained language models. InACL.

Nello Cristianini, Jaz Kandola, Andre Elisseeff, andJohn Shawe-Taylor. 2001. On kernel target align-ment. In NeurIPS.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In NAACL.

Kawin Ethayarajh. 2019. How contextual are contex-tualized word representations? comparing the geom-etry of BERT, ELMo, and GPT-2 embeddings. InEMNLP.

Robert Geirhos, Jörn-Henrik Jacobsen, ClaudioMichaelis, Richard Zemel, Wieland Brendel,Matthias Bethge, and Felix A. Wichmann. 2020.Shortcut learning in deep neural networks. NatureMachine Intelligence, 2:665–673.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel R. Bowman, andNoah A. Smith. 2018. Annotation artifacts in nat-ural language inference data. In NAACL.

Ximena Gutierrez-Vasques and Victor Mijangos. 2020.Productivity and predictability for measuring mor-phological complexity. Entropy, 22(1):48.

Charles R Harris, K Jarrod Millman, Stéfan J van derWalt, Ralf Gommers, Pauli Virtanen, David Cour-napeau, Eric Wieser, Julian Taylor, Sebastian Berg,Nathaniel J Smith, et al. 2020. Array programmingwith numpy. Nature, 585(7825):357–362.

Thorsten Joachims. 2001. A statistical learning learn-ing model of text classification for support vectormachines. In ACM SIGIR.

Ronan Le Bras, Swabha Swayamdipta, Chandra Bhaga-vatula, Rowan Zellers, Matthew Peters, Ashish Sab-harwal, and Yejin Choi. 2020. Adversarial filters ofdataset biases. In ICML.

Yann LeCun, Léon Bottou, Yoshua Bengio, and PatrickHaffner. 1998. Gradient-based learning applied todocument recognition. Proceedings of the IEEE,86(11):2278–2324.

Hector Levesque, Ernest Davis, and Leora Morgen-stern. 2011. The winograd schema challenge. InAAAI Spring Symposium: Logical Formalizations ofCommonsense Reasoning.

Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang,Yiming Yang, and Lei Li. 2020. On the sentenceembeddings from pre-trained language models. InEMNLP.

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019a. Lin-guistic knowledge and transferability of contextualrepresentations. In NAACL.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019b.RoBERTa: A robustly optimized BERT pretrainingapproach. arXiv preprint arXiv:1907.11692.

Page 11: Comparing Text Representations: A Theory-Driven Approach

5537

Alessio Miaschi and Felice Dell’Orletta. 2020. Con-textual and non-contextual word embeddings: An in-depth linguistic investigation. In Workshop on Rep-resentation Learning for NLP.

Sabrina J Mielke, Ryan Cotterell, Kyle Gorman, BrianRoark, and Jason Eisner. 2019. What kind of lan-guage is hard to language-model? ACL.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In NeurIPS.

David Mimno and Laure Thompson. 2017. The strangegeometry of skip-gram with negative sampling. InEMNLP.

Jiaqi Mu, Suma Bhat, and Pramod Viswanath. 2018.All-but-the-top: Simple and effective postprocessingfor word representations. ICLR.

Yixin Nie, Adina Williams, Emily Dinan, MohitBansal, Jason Weston, and Douwe Kiela. 2020. Ad-versarial NLI: A new benchmark for natural lan-guage understanding. In ACL.

Victor Y Pan and Zhao Q Chen. 1999. The complexityof the matrix eigenproblem. In STOC.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. 2019. PyTorch: An imperative style,high-performance deep learning library. NeurIPS.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. GloVe: Global vectors for wordrepresentation. In EMNLP.

Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021.Rissanen data analysis: Examining dataset charac-teristics via description length. In ICML.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In EMNLP.

Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda BViegas, Andy Coenen, Adam Pearce, and Been Kim.2019. Visualizing and measuring the geometry ofbert. NeurIPS.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2020. A primer in BERTology: What we knowabout how BERT works. TACL, 8:842–866.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-ula, and Yejin Choi. 2020. WinoGrande: An adver-sarial winograd schema challenge at scale. In AAAI.

Nikunj Saunshi, Sadhika Malladi, and Sanjeev Arora.2021. A mathematical exploration of why languagemodels help solve downstream tasks. In ICLR.

Shai Shalev-Shwartz and Shai Ben-David. 2014. Un-derstanding Machine Learning: From Theory to Al-gorithms. Cambridge University Press.

Stack Exchange Network. 2021. Stack Exchange DataDump.

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie,Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith,and Yejin Choi. 2020. Dataset cartography: Map-ping and diagnosing datasets with training dynamics.In EMNLP.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R Bowman, Dipan-jan Das, et al. 2019. What do you learn from con-text? Probing for sentence structure in contextual-ized word representations. ICLR.

Laure Thompson and David Mimno. 2018. Author-less topic models: Biasing models away from knownstructure. In COLING.

Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length.In EMNLP.

Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel R Bowman. 2019a. SuperGLUE:A stickier benchmark for general-purpose languageunderstanding systems. NeurIPS.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R Bowman. 2019b.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. ICLR.

Larry Wasserman. 2013. All of Statistics: A ConciseCourse in Statistical Inference. Springer Science &Business Media.

Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In NAACL.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Transformers: State-of-the-art natural language pro-cessing. In EMNLP: System Demonstrations.

Bo Xie, Yingyu Liang, and Le Song. 2017. Diverseneural network learns true target functions. In AIS-TATS.

Page 12: Comparing Text Representations: A Theory-Driven Approach

5538

A Dataset Pre-Processing

A.1 Natural Language Inference

We compare representations of English-languagenatural language inference datasets from the GLUEbenchmark (Wang et al., 2019b): MNLI (Williamset al., 2018), QNLI (Rajpurkar et al., 2016), andWNLI (Levesque et al., 2011). We also use SNLI(Bowman et al., 2015). Table 1 shows summarystatistics. Starting from documents labeled entail-ment or contradiction, we exclude any documentsthat are string identical (surprisingly, these datasetsdo contain a few duplicates) and any that are identi-cal under any representation that we consider. Prac-tically, this means excluding documents that haveidentical bag-of-words representations (because ofdifferent word order) or identical GloVe embed-dings (because of words/numbers not in the GloVevocabulary). We fine-tune on 10% of each train-ing set and exclude those documents as well, as inLe Bras et al. (2020).

A.2 Stack Exchange

We construct text classification tasks from English-language posts on Stack Exchange communities(Stack Exchange Network, 2021). We constructtwo datasets, the first with posts from Bicycles andCS Theory and the second with posts from CSand CS Theory. Table 2 shows summary statisticsfor each community. For each dataset, we sample2,500 posts from each community-year-AM/PMcombination, resulting in 10,000 documents fromeach community. This results in three balancedclassification tasks for each of the datasets, allow-ing us to compare their difficulty without label im-balance.

MNLI QNLI SNLI WNLIDuplicates removed 721 198 748 410% used in fine-tuning 39 270 10 474 54 936 n/aUnique docs excluded 39 922 10 653 55 586 4Total examples used 234 122 92 649 330 080 631

Table 1: Summary statistics for NLI datasets.

Year AM/PM Bicycles CS CS Theory[2010, 2015] [00 : 00− 11 : 59] 9872 11 596 7661[2010, 2015] [12 : 00− 23 : 59] 17 038 17 371 11 888[2016, 2021] [00 : 00− 11 : 59] 11 370 22 264 2890[2016, 2021] [12 : 00− 23 : 59] 18 415 34 408 4434

Total across all years and AM/PM 56 695 85 639 26 873

Table 2: Label breakdowns for the three Stack Ex-change communities we consider.

B Estimating E[DDC]

The expectation of DDC over uniformly randomlabelings can be accurately estimated by averag-ing the DDC of sampled labelings. The followingclaim shows that the number of samples required isdetermined by the difference between the inversesof the largest and smallest eigenvalues of the Grammatrix. Recall that H∞ is this Gram matrix withnumber of examples n. Note that Jensen’s inequal-ity can be used to upper-bound the expectation by√(

2n

)Tr [(H∞)−1], but it is not readily apparent

how to compute the expectation exactly.

Claim 1. To estimate the expected DDC over ran-dom labelings to within ε with probability at least1− δ, averaging DDC of the following number ofsampled uniformly random labelings is sufficient:

m ≥ ∆2

2ε2ln

(2

δ

)where ∆ is the difference between the maximumand minimum DDC values.

Proof. First, note that the DDC of a random la-beling is bounded. Using the eigendecompositioninterpretation of DDC, we find that the maximumDDC can be no greater than when the label vec-tor y projects entirely on the Gram matrix’s finaleigenvector with smallest eigenvalue λmin:

DDCmax ≤

√2‖y‖‖y‖λmin

· 1

n(4)

=

√2√n√n

λmin· 1

n=

√2

λmin(5)

Similarly, the minimum DDC can be no less thanwhen the label vector projects entirely on the Grammatrix’s initial eigenvector with largest eigenvalue:DDCmin ≥

√2

λmax. Let ∆ be the magnitude of this

difference:

∆ = |DDCmax−DDCmin | (6)

≤√

2

λmin−√

2

λmax(7)

Let Xi be the DDC of the ith random labeling,and let X̄ = 1

m (X1 +X2 + . . .+Xm) be the em-pirical mean of the first m random DDCs. E

[X̄]

is then the true expectation of DDC over randomlabelings. Because Xi ∈ [DDCmin,DDCmax] is abounded random variable, 1

mXi is also bounded:

Page 13: Comparing Text Representations: A Theory-Driven Approach

5539

1mXi ∈

[DDCminm , DDCmax

m

]with difference between

maximum and minimum values no greater than ∆m .

By the Hoeffding bound:

Pr[∣∣X̄ − E

[X̄]∣∣ ≥ ε] ≤ 2 exp

−2ε2∑i∈[m]

(∆m

)2(8)

Setting the right-hand side to be at most δ and solv-ing for m yields the stated bound on the number ofrequired samples.

How unlikely is the real labeling’s DDC? Ad-ditionally, the probability that a random label-ing has as low a complexity as the real labelingis given by the distribution function of random-label DDCs evaluated at the real DDC, denotedF (DDC).5 Let F̂ (DDC) be the empirical distribu-tion function evaluated at the real DDC: the frac-tion of sampled random labelings with complex-ities less than that of the real labeling. Wasser-man (2013) shows that the DKW Inequality can beused to bound the value of F (DDC) for the abovenumber of samples m. With probability 1 − δ,F (DDC) ∈ [F̂ (DDC)− γ, F̂ (DDC) + γ] where:

γ =

√1

2mln

(2

δ

).

C Experimental Setup and ComputingInfrastructure

Classification results in Figure 1. We used two-layer fully-connected ReLU networks with 10,000hidden units, as in Arora et al. (2019). Wetrained each network to convergence on the pic-tured 10,000 document-subset and evaluated onthe standard dev split. We used an Intel XeonCPU @ 2.00GHz with 27.3GB of RAM andan NVIDIA Tesla V100-SXM2 GPU. Runningtimes for reported runs varied between datasets,from 7.9 seconds for MNIST to 190 secondsfor MNLI represented as bags-of-words. We re-port the best dev accuracy for learning rate lr ∈{10−3, 10−4, 10−5, 10−6}.

5Note that this isn’t the same as any labeling having as lowa complexity, which could be found by union bounding overrandom labelings.

Fine-tuning RoBERTa-large. We fine-tunedRoBERTa-large for MNLI, QNLI, and SNLI onten percent of each dataset’s training data. Weused an Intel Core i7-5820K CPU @ 3.30GHzwith two NVIDIA GeForce GTX TITAN X GPUsand 64 GB of RAM. We train for three epochs andhyperparameter search over initial learning ratelr ∈ {1e–5, 2e–5, 3e–5}, as in Liu et al. (2019b),with a fixed per-GPU batch-size of 16. For thefine-tuned contextual embeddings, we use the net-works which attained highest accuracy on the devsplits. In all three cases, this was the network withlr = 2e–5. Training these models took 6076.9 sec-onds for MNLI, 1624.58 seconds for QNLI, and8459.5 seconds for SNLI.

Comparing NLI representations. We comparebaseline representations (bag-of-words and GloVeembeddings) and MLM contextual embeddings.For the MLM embeddings, we use BERT (110×106 parameters), RoBERTa (125×106 parameters),and RoBERTa-large (355× 106 parameters).

For calculating contextual embeddings from pre-trained and check-pointed fine-tuned MLMs, weused an Intel Core i7-5820K CPU @ 3.30GHz withan NVIDIA GeForce GTX TITAN X and 64 GB ofRAM. For eigendecompositions, we used an IntelXeon Gold 6134 CPU @ 3.20GHz with 528 GBof RAM. It took 215.44 ± 98.84 seconds to cal-culate contextual embeddings, 7530.95± 2399.05seconds to construct the Gram matrix and performthe eigendecomposition, and 2954.48 ± 1449.11seconds to sample random labels.