specter: document-level representation learning using citation … · 2020-05-21 · specter:...

SPECTER: Document-level Representation Learning usingCitation-informed Transformers

Arman Cohan†∗ Sergey Feldman†∗ Iz Beltagy† Doug Downey† Daniel S. Weld†,‡

†Allen Institute for Artificial Intelligence‡Paul G. Allen School of Computer Science & Engineering, University of Washington

{armanc,sergey,beltagy,dougd,danw}@allenai.org

Abstract

Representation learning is a critical ingre-dient for natural language processing sys-tems. Recent Transformer language mod-els like BERT learn powerful textual repre-sentations, but these models are targeted to-wards token- and sentence-level training ob-jectives and do not leverage information oninter-document relatedness, which limits theirdocument-level representation power. For ap-plications on scientific documents, such asclassification and recommendation, the em-beddings power strong performance on endtasks. We propose SPECTER, a new method togenerate document-level embedding of scien-tific documents based on pretraining a Trans-former language model on a powerful signalof document-level relatedness: the citationgraph. Unlike existing pretrained languagemodels, SPECTER can be easily applied todownstream applications without task-specificfine-tuning. Additionally, to encourage furtherresearch on document-level models, we intro-duce SCIDOCS, a new evaluation benchmarkconsisting of seven document-level tasks rang-ing from citation prediction, to document clas-sification and recommendation. We show thatSPECTER outperforms a variety of competitivebaselines on the benchmark.1

1 Introduction

As the pace of scientific publication continues toincrease, Natural Language Processing (NLP) toolsthat help users to search, discover and understandthe scientific literature have become critical. In re-cent years, substantial improvements in NLP toolshave been brought about by pretrained neural lan-guage models (LMs) (Radford et al., 2018; Devlinet al., 2019; Yang et al., 2019). While such modelsare widely used for representing individual words

∗ Equal contribution1 https://github.com/allenai/specter

or sentences, extensions to whole-document em-beddings are relatively underexplored. Likewise,methods that do use inter-document signals to pro-duce whole-document embeddings (Tu et al., 2017;Chen et al., 2019) have yet to incorporate state-of-the-art pretrained LMs. Here, we study how toleverage the power of pretrained language modelsto learn embeddings for scientific documents.

A paper’s title and abstract provide rich seman-tic content about the paper, but, as we show inthis work, simply passing these textual fields to an“off-the-shelf” pretrained language model—even astate-of-the-art model tailored to scientific text likethe recent SciBERT (Beltagy et al., 2019)—doesnot result in accurate paper representations. Thelanguage modeling objectives used to pretrain themodel do not lead it to output representations thatare helpful for document-level tasks such as topicclassification or recommendation.

In this paper, we introduce a new method forlearning general-purpose vector representations ofscientific documents. Our system, SPECTER,2 in-corporates inter-document context into the Trans-former (Vaswani et al., 2017) language models(e.g., SciBERT (Beltagy et al., 2019)) to learndocument representations that are effective acrossa wide-variety of downstream tasks, without theneed for any task-specific fine-tuning of the pre-trained language model. We specifically use cita-tions as a naturally occurring, inter-document in-cidental supervision signal indicating which docu-ments are most related and formulate the signal intoa triplet-loss pretraining objective. Unlike manyprior works, at inference time, our model does notrequire any citation information. This is criticalfor embedding new papers that have not yet beencited. In experiments, we show that SPECTER’srepresentations substantially outperform the state-

2SPECTER: Scientific Paper Embeddings using Citation-informed TransformERs

arX

iv:2

004.

0718

0v4

[cs

.CL

] 2

0 M

ay 2

020

https://github.com/allenai/specter

of-the-art on a variety of document-level tasks, in-cluding topic classification, citation prediction, andrecommendation.

As an additional contribution of this work, we in-troduce and release SCIDOCS3 , a novel collectionof data sets and an evaluation suite for document-level embeddings in the scientific domain. SCI-DOCS covers seven tasks, and includes tens of thou-sands of examples of anonymized user signals ofdocument relatedness. We also release our trainingset (hundreds of thousands of paper titles, abstractsand citations), along with our trained embeddingmodel and its associated code base.

2 Model

2.1 Overview

Our goal is to learn task-independent representa-tions of academic papers. Inspired by the recentsuccess of pretrained Transformer language modelsacross various NLP tasks, we use the Transformermodel architecture as basis of encoding the inputpaper. Existing LMs such as BERT, however, areprimarily based on masked language modeling ob-jective, only considering intra-document contextand do not use any inter-document information.This limits their ability to learn optimal documentrepresentations. To learn high-quality document-level representations we propose using citations asan inter-document relatedness signal and formu-late it as a triplet loss learning objective. We thenpretrain the model on a large corpus of citationsusing this objective, encouraging it to output rep-resentations that are more similar for papers thatshare a citation link than for those that do not. Wecall our model SPECTER, which learns ScientificPaper Embeddings using Citation-informed Trans-formERs. With respect to the terminology used byDevlin et al. (2019), unlike most existing LMs thatare “fine-tuning based”, our approach results in em-beddings that can be applied to downstream tasksin a “feature-based” fashion, meaning the learnedpaper embeddings can be easily used as features,with no need for further task-specific fine-tuning.In the following, as background information, webriefly describe how pretrained LMs can be appliedfor document representation and then discuss thedetails of SPECTER.

3https://github.com/allenai/scidocs

Transformer (initialized with SciBERT)

Related paper (P+)Query paper (PQ) Unrelated paper (P−)

Triplet loss =max{(

d(PQ,P+

)− d

(PQ,P−

)+m

), 0}

Figure 1: Overview of SPECTER.

2.2 Background: Pretrained Transformers

Recently, pretrained Transformer networks havedemonstrated success on various NLP tasks (Rad-ford et al., 2018; Devlin et al., 2019; Yang et al.,2019; Liu et al., 2019); we use these models asthe foundation for SPECTER. Specifically, we useSciBERT (Beltagy et al., 2019) which is an adap-tation of the original BERT (Devlin et al., 2019)architecture to the scientific domain. The BERTmodel architecture (Devlin et al., 2019) uses multi-ple layers of Transformers (Vaswani et al., 2017) toencode the tokens in a given input sequence. Eachlayer consists of a self-attention sublayer followedby a feedforward sublayer. The final hidden stateassociated with the special [CLS] token is usuallycalled the “pooled output”, and is commonly usedas an aggregate representation of the sequence.

Document Representation Our goal is to repre-sent a given paper P as a dense vector v that bestrepresents the paper and can be used in downstreamtasks. SPECTER builds embeddings from the titleand abstract of a paper. Intuitively, we would ex-pect these fields to be sufficient to produce accurateembeddings, since they are written to provide a suc-cinct and comprehensive summary of the paper.4

As such, we encode the concatenated title and ab-stract using a Transformer LM (e.g., SciBERT) andtake the final representation of the [CLS] token asthe output representation of the paper:5

v = Transformer(input)[CLS], (1)

where Transformer is the Transformer’s for-ward function, and input is the concatenation ofthe [CLS] token and WordPieces (Wu et al., 2016)of the title and abstract of a paper, separated by

4We also experimented with additional fields such asvenues and authors but did not find any empirical advantagein using those (see §6). See §7 for a discussion of using thefull text of the paper as input.

5It is also possible to encode title and abstracts individuallyand then concatenate or combine them to get the final embed-ding. However, in our experiments this resulted in sub-optimalperformance.

https://github.com/allenai/scidocs

the [SEP] token. We use SciBERT as our modelinitialization as it is optimized for scientific text,though our formulation is general and any Trans-former language model instead of SciBERT. Usingthe above method with an “off-the-shelf” SciBERTdoes not take global inter-document informationinto account. This is because SciBERT, like otherpretrained language models, is trained via languagemodeling objectives, which only predict words orsentences given their in-document, nearby textualcontext. In contrast, we propose to incorporate ci-tations into the model as a signal of inter-documentrelatedness, while still leveraging the model’s ex-isting strength in modeling language.

2.3 Citation-Based Pretraining Objective

A citation from one document to another suggeststhat the documents are related. To encode this relat-edness signal into our representations, we designa loss function that trains the Transformer modelto learn closer representations for papers when onecites the other, and more distant representationsotherwise. The high-level overview of the model isshown in Figure 1.

In particular, each training instance is a triplet ofpapers: a query paper PQ, a positive paper P+ anda negative paper P−. The positive paper is a paperthat the query paper cites, and the negative paperis a paper that is not cited by the query paper (butthat may be cited by P+). We then train the modelusing the following triplet margin loss function:

L = max

{(d(PQ,P+

)−d(PQ,P−

)+m

), 0

}(2)

where d is a distance function and m is the lossmargin hyperparameter (we empirically choosem = 1). Here, we use the L2 norm distance:

d(PA,PB) = ‖vA − vB‖2,where vA is the vector corresponding to the pooledoutput of the Transformer run on paper A (Equation1).6 Starting from the trained SciBERT model, wepretrain the Transformer parameters on the citationobjective to learn paper representations that capturedocument relatedness.

2.4 Selecting Negative Distractors

The choice of negative example papers P− is im-portant when training the model. We consider twosets of negative examples: the first set simply con-sists of randomly selected papers from the corpus.

6We also experimented with other distance functions (e..g,normalized cosine), but they underperformed the L2 loss.

Given a query paper, intuitively we would expectthe model to be able to distinguish between citedpapers, and uncited papers sampled randomly fromthe entire corpus. This inductive bias has beenalso found to be effective in content-based citationrecommendation applications (Bhagavatula et al.,2018). But, random negatives may be easy for themodel to distinguish from the positives. To providea more nuanced training signal, we augment therandomly drawn negatives with a more challengingsecond set of negative examples. We denote as“hard negatives” the papers that are not cited by thequery paper, but are cited by a paper cited by thequery paper, i.e. if P1 cite−−→ P2 and P2 cite−−→ P3

but P1 6cite−−→ P3, then P3 is a candidate hard nega-tive example for P1. We expect the hard negativesto be somewhat related to the query paper, but typi-cally less related than the cited papers. As we showin our experiments (§6), including hard negativesresults in more accurate embeddings compared tousing random negatives alone.

2.5 Inference

At inference time, the model receives one paper, P ,and it outputs the SPECTER’s Transfomer pooledoutput activation as the paper representation for P(Equation 1). We note that for inference, SPECTER

requires only the title and abstract of the giveninput paper; the model does not need any citationinformation about the input paper. This means thatSPECTER can produce embeddings even for newpapers that have yet to be cited, which is criticalfor applications that target recent scientific papers.

3 SCIDOCS Evaluation Framework

Previous evaluations of scientific document repre-sentations in the literature tend to focus on smalldatasets over a limited set of tasks, and extremelyhigh (99%+) AUC scores are already possible onthese data for English documents (Chen et al., 2019;Wang et al., 2019). New, larger and more diversebenchmark datasets are necessary. Here, we intro-duce a new comprehensive evaluation frameworkto measure the effectiveness of scientific paper em-beddings, which we call SCIDOCS. The frameworkconsists of diverse tasks, ranging from citation pre-diction, to prediction of user activity, to documentclassification and paper recommendation. Note thatSPECTER will not be further fine-tuned on any ofthe tasks; we simply plug in the embeddings as fea-tures for each task. Below, we describe each of the

tasks in detail and the evaluation data associatedwith it. In addition to our training data, we releaseall the datasets associated with the evaluation tasks.

3.1 Document Classification

An important test of a document-level embedding iswhether it is predictive of the class of the document.Here, we consider two classification tasks in thescientific domain:

MeSH Classification In this task, the goals is toclassify scientific papers according to their Medi-cal Subject Headings (MeSH) (Lipscomb, 2000).7

We construct a dataset consisting of 23K academicmedical papers, where each paper is assigned oneof 11 top-level disease classes such as cardiovas-cular diseases, diabetes, digestive diseases derivedfrom the MeSH vocabulary. The most populatedcategory is Neoplasms (cancer) with 5.4K instances(23.3% of the total dataset) while the category withleast number of samples is Hepatitis (1.7% of thetotal dataset). We follow the approach of Feldmanet al. (2019) in mapping the MeSH vocabulary tothe disease classes.

Paper Topic Classification This task is predict-ing the topic associated with a paper using the pre-defined topic categories of the Microsoft AcademicGraph (MAG) (Sinha et al., 2015)8. MAG pro-vides a database of papers, each tagged with a listof topics. The topics are organized in a hierarchyof 5 levels, where level 1 is the most general andlevel 5 is the most specific. For our evaluation,we derive a document classification dataset fromthe level 1 topics, where a paper is labeled by itscorresponding level 1 MAG topic. We construct adataset of 25K papers, almost evenly split over the19 different classes of level 1 categories in MAG.

3.2 Citation Prediction

As argued above, citations are a key signal of re-latedness between papers. We test how well differ-ent paper representations can reproduce this signalthrough citation prediction tasks. In particular, wefocus on two sub-tasks: predicting direct citations,and predicting co-citations. We frame these asranking tasks and evaluate performance using MAP

and nDCG, standard ranking metrics.

7https://www.nlm.nih.gov/mesh/meshhome.html

8https://academic.microsoft.com/

Direct Citations In this task, the model is askedto predict which papers are cited by a given querypaper from a given set of candidate papers. Theevaluation dataset includes approximately 30K to-tal papers from a held-out pool of papers, con-sisting of 1K query papers and a candidate set ofup to 5 cited papers and 25 (randomly selected)uncited papers. The task is to rank the cited papershigher than the uncited papers. For each embed-ding method, we require only comparing the L2distance between the raw embeddings of the queryand the candidates, without any additional trainableparameters.

Co-Citations This task is similar to the directcitations but instead of predicting a cited paper,the goal is to predict a highly co-cited paper witha given paper. Intuitively, if papers A and B arecited frequently together by several papers, thisshows that the papers are likely highly related anda good paper representation model should be ableto identify these papers from a given candidateset. The dataset consists of 30K total papers and isconstructed similar to the direct citations task.

3.3 User ActivityThe embeddings for similar papers should be closeto each other; we use user activity as a proxy foridentifying similar papers and test the model’s abil-ity to recover this information. Multiple users con-suming the same items as one another is a classicrelatedness signal and forms the foundation for rec-ommender systems and other applications (Schaferet al., 2007). In our case, we would expect thatwhen users look for academic papers, the papersthey view in a single browsing session tend to berelated. Thus, accurate paper embeddings should,all else being equal, be relatively more similar forpapers that are frequently viewed in the same ses-sion than for other papers. To build benchmarkdatasets to test embeddings on user activity, weobtained logs of user sessions from a major aca-demic search engine. We define the following twotasks on which we build benchmark datasets to testembeddings:

Co-Views Our co-views dataset consists of ap-proximately 30K papers. To construct it, we take1K random papers that are not in our train or de-velopment set and associate with each one up to 5frequently co-viewed papers and 25 randomly se-lected papers (similar to the approach for citations).Then, we require the embedding model to rank the

https://www.nlm.nih.gov/mesh/meshhome.html

https://www.nlm.nih.gov/mesh/meshhome.html

https://academic.microsoft.com/

co-viewed papers higher than the random papersby comparing the L2 distances of raw embeddings.We evaluate performance using standard rankingmetrics, nDCG and MAP.

Co-Reads If the user clicks to access the PDFof a paper from the paper description page, thisis a potentially stronger sign of interest in the pa-per. In such a case we assume the user will read atleast parts of the paper and refer to this as a “read”action. Accordingly, we define a “co-reads” taskand dataset analogous to the co-views dataset de-scribed above. This dataset is also approximately30K papers.

3.4 Recommendation

In the recommendation task, we evaluate the abil-ity of paper embeddings to boost performance ina production recommendation system. Our rec-ommendation task aims to help users navigate thescientific literature by ranking a set of “similar pa-pers” for a given paper. We use a dataset of userclickthrough data for this task which consists of22K clickthrough events from a public scholarlysearch engine. We partitioned the examples tem-porally into train (20K examples), validation (1K),and test (1K) sets. As is typical in clickthrough dataon ranked lists, the clicks are biased toward the topof original ranking presented to the user. To coun-teract this effect, we computed propensity scoresusing a swap experiment (Agarwal et al., 2019).The propensity scores give, for each position in theranked list, the relative frequency that the positionis over-represented in the data due to exposure bias.We can then compute de-biased evaluation metricsby dividing the score for each test example by thepropensity score for the clicked position. We reportpropensity-adjusted versions of the standard rank-ing metrics Precision@1 ( ˆP@1) and NormalizedDiscounted Cumulative Gain ( ˆnDCG).

We test different embeddings on the recommen-dation task by including cosine embedding dis-tance9 as a feature within an existing recommenda-tion system that includes several other informativefeatures (title/author similarity, reference and ci-tation overlap, etc.). Thus, the recommendationexperiments measure whether the embeddings canboost the performance of a strong baseline systemon an end task. For SPECTER, we also perform anonline A/B test to measure whether its advantages

9Embeddings are L2 normalized and in this case cosinedistance is equivalent to L2 distance.

on the offline dataset translate into improvementson the online recommendation task (§5).

4 Experiments

Training Data To train our model, we use asubset of the Semantic Scholar corpus (Ammaret al., 2018) consisting of about 146K query papers(around 26.7M tokens) with their correspondingoutgoing citations, and we use an additional 32Kpapers for validation. For each query paper we con-struct up to 5 training triples comprised of a query,a positive, and a negative paper. The positive pa-pers are sampled from the direct citations of thequery, while negative papers are chosen either ran-domly or from citations of citations (as discussed in§2.4). We empirically found it helpful to use 2 hardnegatives (citations of citations) and 3 easy neg-atives (randomly selected papers) for each querypaper. This process results in about 684K trainingtriples and 145K validation triples.

Training and Implementation We implementour model in AllenNLP (Gardner et al., 2018).We initialize the model from SciBERT pretrainedweights (Beltagy et al., 2019) since it is the state-of-the-art pretrained language model on scientifictext. We continue training all model parameters onour training objective (Equation 2). We performminimal tuning of our model’s hyperparametersbased on the performance on the validation set,while baselines are extensively tuned. Based oninitial experiments, we use a margin m=1 for thetriplet loss. For training, we use the Adam opti-mizer (Kingma and Ba, 2014) following the sug-gested hyperparameters in Devlin et al. (2019) (LR:2e-5, Slanted Triangular LR scheduler10 (Howardand Ruder, 2018) with number of train steps equalto training instances and cut fraction of 0.1). Wetrain the model on a single Titan V GPU (12Gmemory) for 2 epochs, with batch size of 4 (themaximum that fit in our GPU memory) and usegradient accumulation for an effective batch size of32. Each training epoch takes approximately 1-2days to complete on the full dataset. We releaseour code and data to facilitate reproducibility. 11

Task-Specific Model Details For the classifica-tion tasks, we used a linear SVM where embed-ding vectors were the only features. The C hyper-parameter was tuned via a held-out validation set.

10Learning rate linear warmup followed by linear decay.11https://github.com/allenai/specter

https://github.com/allenai/specter

For the recommendation tasks, we use a feed-forward ranking neural network that takes as inputten features designed to capture the similarity be-tween each query and candidate paper, includingthe cosine similarity between the query and candi-date embeddings and manually-designed featurescomputed from the papers’ citations, titles, authors,and publication dates.

Baseline Methods Our work falls into the inter-section of textual representation, citation mining,and graph learning, and we evaluate against state-of-the-art baselines from each of these areas. Wecompare with several strong textual models: SIF(Arora et al., 2017), a method for learning docu-ment representations by removing the first prin-cipal component of aggregated word-level embed-dings which we pretrain on scientific text; SciBERT(Beltagy et al., 2019) a state-of-the-art pretrainedTransformer LM for scientific text; and Sent-BERT(Reimers and Gurevych, 2019), a model that usesnegative sampling to tune BERT for producing op-timal sentence embeddings. We also compare withCiteomatic (Bhagavatula et al., 2018), a closelyrelated paper representation model for citation pre-diction which trains content-based representationswith citation graph information via dynamicallysampled triplets, and SGC (Wu et al., 2019a), astate-of-the-art graph-convolutional approach. Forcompleteness, additional baselines are also in-cluded; due to space constraints we refer to Ap-pendix A for detailed discussion of all baselines.We tune hyperparameters of baselines to maximizeperformance on a separate validation set.

5 Results

Table 1 presents the main results correspondingto our evaluation tasks (described in §3). Overall,we observe substantial improvements across alltasks with average performance of 80.0 across allmetrics on all tasks which is a 3.1 point absoluteimprovement over the next-best baseline. We nowdiscuss the results in detail.

For document classification, we report macroF1, a standard classification metric. We observethat the classifier performance when trained on ourrepresentations is better than when trained on anyother baseline. Particularly, on the MeSH (MAG)dataset, we obtain an 86.4 (82.0) F1 score which isabout a ∆= + 2.3 (+1.5) point absolute increaseover the best baseline on each dataset respectively.Our evaluation of the learned representations on

predicting user activity is shown in the “User activ-ity” columns of Table 1. SPECTER achieves a MAP

score of 83.8 on the co-view task, and 84.5 on co-read, improving over the best baseline (Citeomaticin this case) by 2.7 and 4.0 points, respectively.We observe similar trends for the “citation” and“co-citation” tasks, with our model outperformingvirtually all other baselines except for SGC, whichhas access to the citation graph at training and testtime.12 Note that methods like SGC cannot beused in real-world setting to embed new papersthat are not cited yet. On the other hand, on co-citation data our method is able to achieve the bestresults with nDCG of 94.8, improving over SGCwith 2.3 points. Citeomatic also performs well onthe citation tasks, as expected given that its primarydesign goal was citation prediction. Nevertheless,our method slightly outperforms Citeomatic on thedirect citation task, while substantially outperform-ing it on co-citations (+2.0 nDCG).

Finally, for recommendation task, we observethat SPECTER outperforms all other models on thistask as well, with nDCG of 53.9. On the recom-mendations task, as opposed to previous experi-ments, the differences in method scores are gen-erally smaller. This is because for this task theembeddings are used along with several other in-formative features in the ranking model (describedunder task-specific models in §4), meaning that em-bedding variants have less opportunity for impacton overall performance.

We also performed an online study to evaluatewhether SPECTER embeddings offer similar advan-tages in a live application. We performed an onlineA/B test comparing our SPECTER-based recom-mender to an existing production recommender sys-tem for similar papers that ranks papers by a textualsimilarity measure. In a dataset of 4,113 clicks, wefound that SPECTER ranker improved clickthroughrate over the baseline by 46.5%, demonstrating itssuperiority.

We emphasize that our citation-based pretrain-ing objective is critical for the performance ofSPECTER; removing this and using a vanilla SciB-ERT results in decreased performance on all tasks.

12For SGC, we remove development and test set citationsand co-citations during training. We also remove incomingcitations from development and test set queries as these wouldnot be available at test time in production.

Task→ Classification User activity prediction Citation predictionRecomm.

Avg.Subtask→ MAG MeSH Co-View Co-Read Cite Co-CiteModel ↓ / Metric→ F1 F1 MAP nDCG MAP nDCG MAP nDCG MAP nDCG ˆnDCG ˆP@1

Random 4.8 9.4 25.2 51.6 25.6 51.9 25.1 51.5 24.9 51.4 51.3 16.8 32.5Doc2vec (2014) 66.2 69.2 67.8 82.9 64.9 81.6 65.3 82.2 67.1 83.4 51.7 16.9 66.6Fasttext-sum (2017) 78.1 84.1 76.5 87.9 75.3 87.4 74.6 88.1 77.8 89.6 52.5 18.0 74.1SIF (2017) 78.4 81.4 79.4 89.4 78.2 88.9 79.4 90.5 80.8 90.9 53.4 19.5 75.9ELMo (2018) 77.0 75.7 70.3 84.3 67.4 82.6 65.8 82.6 68.5 83.8 52.5 18.2 69.0Citeomatic (2018) 67.1 75.7 81.1 90.2 80.5 90.2 86.3 94.1 84.4 92.8 52.5 17.3 76.0SGC (2019a) 76.8 82.7 77.2 88.0 75.7 87.5 91.6 96.2 84.1 92.5 52.7 18.2 76.9SciBERT (2019) 79.7 80.7 50.7 73.1 47.7 71.1 48.3 71.7 49.7 72.6 52.1 17.9 59.6Sent-BERT (2019) 80.5 69.1 68.2 83.3 64.8 81.3 63.5 81.6 66.4 82.8 51.6 17.1 67.5SPECTER (Ours) 82.0 86.4 83.6 91.5 84.5 92.4 88.3 94.9 88.1 94.8 53.9 20.0 80.0

Table 1: Results on the SCIDOCS evaluation suite consisting of 7 tasks.

6 Analysis

In this section, we analyze several design deci-sions in SPECTER, provide a visualization of itsembedding space, and experimentally compareSPECTER’s use of fixed embeddings against a fine-tuning approach.

Ablation Study We start by analyzing howadding or removing metadata fields from the in-put to SPECTER alters performance. The resultsare shown in the top four rows of Table 2 (forbrevity, here we only report the average of the met-rics from each task). We observe that removingthe abstract from the textual input and relying onlyon the title results in a substantial decrease in per-formance. More surprisingly, adding authors as aninput (along with title and abstract) hurts perfor-mance.13 One possible explanation is that authornames are sparse in the corpus, making it difficultfor the model to infer document-level relatednessfrom them. As another possible reason of this be-havior, tokenization using Wordpieces might besuboptimal for author names. Many author namesare out-of-vocabulary for SciBERT and thus, theymight be split into sub-words and shared acrossnames that are not semantically related, leadingto noisy correlation. Finally, we find that addingvenues slightly decreases performance,14 excepton document classification (which makes sense, aswe would expect venues to have high correlation

13We experimented with both concatenating authors withthe title and abstract and also considering them as an additionalfield. Neither were helpful.

14Venue information in our data came directly from pub-lisher provided metadata and thus was not normalized. Venuenormalization could help improve results.

CLS USR CITE REC Avg.

SPECTER 84.2 88.4 91.5 36.9 80.0− abstract 82.2 72.2 73.6 34.5 68.1+ venue 84.5 88.0 91.2 36.7 79.9+ author 82.7 72.3 71.0 34.6 67.3No hard negatives 82.4 85.8 89.8 36.8 78.4Start w/ BERT-Large 81.7 85.9 87.8 36.1 77.5

Table 2: Ablations: Numbers are averages of metricsfor each evaluation task: CLS: classification, USR:User activity, CITE: Citation prediction, REC: Recom-mendation, Avg. average over all tasks & metrics.

with paper topics). The fact that SPECTER does notrequire inputs like authors or venues makes it appli-cable in situations where this metadata is not avail-able, such as matching reviewers with anonymizedsubmissions, or performing recommendations ofanonymized preprints (e.g., on OpenReview).

One design decision in SPECTER is to use a set ofhard negative distractors in the citation-based fine-tuning objective. The fifth row of Table 2 showsthat this is important—using only easy negatives re-duces performance on all tasks. While there couldbe other potential ways to include hard negatives inthe model, our simple approach of including cita-tions of citations is effective. The sixth row of thetable shows that using a strong general-domain lan-guage model (BERT-Large) instead of SciBERT inSPECTER reduces performance considerably. Thisis reasonable because unlike BERT-Large, SciB-ERT is pretrained on scientific text.

Visualization Figure 2 shows t-SNE (van derMaaten, 2014) projections of our embeddings(SPECTER) compared with the SciBERT baseline

(a) SPECTER (b) SciBERT

Figure 2: t-SNE visualization of paper embeddings andtheir corresponding MAG topics.

for a random set of papers. When comparingSPECTER embeddings with SciBERT, we observethat our embeddings are better at encoding topi-cal information, as the clusters seem to be morecompact. Further, we see some examples of cross-topic relatedness reflected in the embedding space(e.g., Engineering, Mathematics and ComputerScience are close to each other, while Businessand Economics are also close to each other). Toquantify the comparison of visualized embeddingsin Figure 2, we use the DBScan clustering algo-rithm (Ester et al., 1996) on this 2D projection.We use the completeness and homogeneity cluster-ing quality measures introduced by Rosenberg andHirschberg (2007). For the points corresponding toFigure 2, the homogeneity and completeness val-ues for SPECTER are respectively 0.41 and 0.72compared with SciBERT’s 0.19 and 0.63, a clearimprovement on separating topics using the pro-jected embeddings.

Comparison with Task Specific Fine-TuningWhile the fact that SPECTER does not require fine-tuning makes its paper embeddings less costly touse, often the best performance from pretrainedTransformers is obtained when the models are fine-tuned directly on each end task. We experimentwith fine-tuning SciBERT on our tasks, and findthis to be generally inferior to using our fixed rep-resentations from SPECTER. Specifically, we fine-tune SciBERT directly on task-specific signals in-stead of citations. To fine-tune on task-specificdata (e.g., user activity), we used a dataset of co-views with 65K query papers, co-reads with 14Kquery papers, and co-citations (instead of directcitations) with 83K query papers. As the end tasksare ranking tasks, for all datasets we construct upto 5 triplets and fine-tune the model using tripletranking loss. The positive papers are sampled from

Training signal CLS USR CITE REC All

SPECTER 84.2 88.4 91.5 36.9 80.0SciBERT fine-tune on co-view 83.0 84.2 84.1 36.4 76.0SciBERT fine-tune on co-read 82.3 85.4 86.7 36.3 77.1SciBERT fine-tune on co-citation 82.9 84.3 85.2 36.6 76.4SciBERT fine-tune on multitask 83.3 86.1 88.2 36.0 78.0

Table 3: Comparison with task-specific fine-tuning.

the most co-viewed (co-read, or co-cited) paperscorresponding to the query paper. We also includeboth easy and hard distractors as when trainingSPECTER (for hard negatives we choose the leastnon-zero co-viewed (co-read, or co-cited) papers).We also consider training jointly on all task-specifictraining data sources in a multitask training process,where the model samples training triplets from adistribution over the sources. As illustrated in Ta-ble 3, without any additional final task-specificfine-tuning, SPECTER still outperforms a SciBERTmodel fine-tuned on the end tasks as well as theirmultitask combination, further demonstrating theeffectiveness and versatility of SPECTER embed-dings.15

7 Related Work

Recent representation learning methods in NLPrely on training large neural language models on un-supervised data (Peters et al., 2018; Radford et al.,2018; Devlin et al., 2019; Beltagy et al., 2019; Liuet al., 2019). While successful at many sentence-and token-level tasks, our focus is on using themodels for document-level representation learning,which has remained relatively under-explored.

There have been other efforts in document repre-sentation learning such as extensions of word vec-tors to documents (Le and Mikolov, 2014; Ganeshet al., 2016; Liu et al., 2017; Wu et al., 2018; Gy-sel et al., 2017), convolution-based methods (Liuet al., 2018; Zamani et al., 2018), and variationalautoencoders (Holmer and Marfurt, 2018; Wanget al., 2019). Relevant to document embedding, sen-tence embedding is a relatively well-studied area ofresearch. Successful approaches include seq2seqmodels (Kiros et al., 2015), BiLSTM Siamesenetworks (Williams et al., 2018), leveraging su-pervised data from other corpora (Conneau et al.,2017), and using discourse relations (Nie et al.,2019), and BERT-based methods (Reimers andGurevych, 2019). Unlike our proposed method,

15We also experimented with further task-specific fine-tuning of our SPECTER on the end tasks but we did not observeadditional improvements.

the majority of these approaches do not considerany notion of inter-document relatedness when em-bedding documents.

Other relevant work combines textual featureswith network structure (Tu et al., 2017; Zhang et al.,2018; Bhagavatula et al., 2018; Shen et al., 2018;Chen et al., 2019; Wang et al., 2019). These workstypically do not leverage the recent pretrained con-textual representations and with a few exceptionssuch as the recent work by Wang et al. (2019), theycannot generalize to unseen documents like ourSPECTER approach. Context-based citation rec-ommendation is another related application wheremodels rely on citation contexts (Jeong et al., 2019)to make predictions. These works are orthogonalto ours as the input to our model is just paper titleand abstract. Another related line of work is graph-based representation learning methods (Bruna et al.,2014; Kipf and Welling, 2017; Hamilton et al.,2017a,b; Wu et al., 2019a,b). Here, we compare toa graph representation learning model, SGC (Sim-ple Graph Convolution) (Wu et al., 2019a), whichis a state-of-the-art graph convolution approach forrepresentation learning. SPECTER uses pretrainedlanguage models in combination with graph-basedcitation signals, which enables it to outperform thegraph-based approaches in our experiments.

SPECTER embeddings are based on only the titleand abstract of the paper. Adding the full text of thepaper would provide a more complete picture of thepaper’s content and could improve accuracy (Co-hen et al., 2010; Lin, 2008; Schuemie et al., 2004).However, the full text of many academic papersis not freely available. Further, modern languagemodels have strict memory limits on input size,which means new techniques would be required inorder to leverage the entirety of the paper withinthe models. Exploring how to use the full papertext within SPECTER is an item of future work.

Finally, one pain point in academic paper rec-ommendation research has been a lack of publiclyavailable datasets (Chen and Lee, 2018; Kanakiaet al., 2019). To address this challenge, we re-lease SCIDOCS, our evaluation benchmark whichincludes an anonymized clickthrough dataset froman online recommendations system.

8 Conclusions and Future Work

We present SPECTER, a model for learning repre-sentations of scientific papers, based on a Trans-former language model that is pretrained on cita-

tions. We achieve substantial improvements overthe strongest of a wide variety of baselines, demon-strating the effectiveness of our model. We ad-ditionally introduce SCIDOCS, a new evaluationsuite consisting of seven document-level tasks andrelease the corresponding datasets to foster furtherresearch in this area.

The landscape of Transformer language modelsis rapidly changing and newer and larger modelsare frequently introduced. It would be interest-ing to initialize our model weights from more re-cent Transformer models to investigate if additionalgains are possible. Another item of future work isto develop better multitask approaches to leveragemultiple signals of relatedness information duringtraining. We used citations to build triplets for ourloss function, however there are other metrics thathave good support from the bibliometrics literature(Klavans and Boyack, 2006) that warrant exploringas a way to create relatedness graphs. Includingother information such as outgoing citations as ad-ditional input to the model would be yet anotherarea to explore in future.

Acknowledgements

We thank Kyle Lo, Daniel King and Oren Etzionifor helpful research discussions, Russel Reas forsetting up the public API, Field Cady for help ininitial data collection and the anonymous reviewers(especially Reviewer 1) for comments and sugges-tions. This work was supported in part by NSFConvergence Accelerator award 1936940, ONRgrant N00014-18-1-2193, and the University ofWashington WRF/Cable Professorship.

ReferencesAnant K. Agarwal, Ivan Zaitsev, Xuanhui Wang,

Cheng Yen Li, Marc Najork, and Thorsten Joachims.2019. Estimating position bias without intrusive in-terventions. In WSDM.

Waleed Ammar, Dirk Groeneveld, Chandra Bha-gavatula, Iz Beltagy, Miles Crawford, DougDowney, Jason Dunkelberger, Ahmed Elgohary,Sergey Feldman, Vu Ha, Rodney Kinney, Sebas-tian Kohlmeier, Kyle Lo, Tyler C. Murray, Hsu-Han Ooi, Matthew E. Peters, Joanna Power, SamSkjonsberg, Lucy Lu Wang, Christopher Wilhelm,Zheng Yuan, Madeleine van Zuylen, and Oren Et-zioni. 2018. Construction of the literature graph insemantic scholar. In NAACL-HLT.

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017.

A simple but tough-to-beat baseline for sentence em-beddings. In ICLR.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-ERT: A Pretrained Language Model for ScientificText. In EMNLP.

Chandra Bhagavatula, Sergey Feldman, Russell Power,and Waleed Ammar. 2018. Content-Based CitationRecommendation. In NAACL-HLT.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. TACL.

Joan Bruna, Wojciech Zaremba, Arthur Szlam, andYann LeCun. 2014. Spectral networks and locallyconnected networks on graphs. ICLR.

Liqun Chen, Guoyin Wang, Chenyang Tao, Ding-han Shen, Pengyu Cheng, Xinyuan Zhang, WenlinWang, Yizhe Zhang, and Lawrence Carin. 2019. Im-proving textual network embedding with global at-tention via optimal transport. In ACL.

Tsung Teng Chen and Maria Lee. 2018. Research Pa-per Recommender Systems on Big Scholarly Data.In Knowledge Management and Acquisition for In-telligent Systems.

K. Bretonnel Cohen, Helen L. Johnson, Karin M. Ver-spoor, Christophe Roeder, and Lawrence Hunter.2010. The structural and content aspects of abstractsversus bodies of full text journal articles are different.BMC Bioinformatics, 11:492–492.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. SupervisedLearning of Universal Sentence Representationsfrom Natural Language Inference Data. In EMNLP.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In NAACL-HLT.

Martin Ester, Hans-Peter Kriegel, Jorg Sander, XiaoweiXu, et al. 1996. A Density-based Algorithm for Dis-covering Clusters in Large Spatial Databases withNoise. In KDD.

Sergey Feldman, Waleed Ammar, Kyle Lo, Elly Trep-man, Madeleine van Zuylen, and Oren Etzioni. 2019.Quantifying Sex Bias in Clinical Studies at ScaleWith Automated Data Extraction. JAMA.

J Ganesh, Manish Gupta, and Vijay K. Varma. 2016.Doc2sent2vec: A novel two-phase approach forlearning document representation. In SIGIR.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe-ters, Michael Schmitz, and Luke Zettlemoyer. 2018.AllenNLP: A Deep Semantic Natural Language Pro-cessing Platform. In Proceedings of Workshop forNLP Open Source Software (NLP-OSS).

Christophe Van Gysel, Maarten de Rijke, and Evange-los Kanoulas. 2017. Neural Vector Spaces for Un-supervised Information Retrieval. ACM Trans. Inf.Syst.

Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017a.Inductive Representation Learning on Large Graphs.In NIPS.

William L. Hamilton, Zhitao Ying, and Jure Leskovec.2017b. Inductive representation learning on largegraphs. In NIPS.

Erik Holmer and Andreas Marfurt. 2018. Explainingaway syntactic structure in semantic document rep-resentations. ArXiv, abs/1806.01620.

Jeremy Howard and Sebastian Ruder. 2018. UniversalLanguage Model Fine-tuning for Text Classification.In ACL.

Chanwoo Jeong, Sion Jang, Hyuna Shin, Eun-jeong Lucy Park, and Sungchul Choi. 2019. Acontext-aware citation recommendation model withbert and graph convolutional networks. ArXiv,abs/1903.06464.

Anshul Kanakia, Zhihong Shen, Darrin Eide, andKuansan Wang. 2019. A Scalable Hybrid ResearchPaper Recommender System for Microsoft Aca-demic. In WWW.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A Method for Stochastic Optimization. ArXiv,abs/1412.6980.

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutionalnetworks. ICLR.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov,Richard S. Zemel, Antonio Torralba, Raquel Urta-sun, and Sanja Fidler. 2015. Skip-thought vectors.In NIPS.

Richard Klavans and Kevin W. Boyack. 2006. Iden-tifying a better measure of relatedness for mappingscience. Journal of the Association for InformationScience and Technology, 57:251–263.

Jey Han Lau and Timothy Baldwin. 2016. Anempirical evaluation of doc2vec with practical in-sights into document embedding generation. InRep4NLP@ACL.

Quoc Le and Tomas Mikolov. 2014. Distributed Repre-sentations of Sentences and Documents. In ICML.

Jimmy J. Lin. 2008. Is searching full text more effec-tive than searching abstracts? BMC Bioinformatics,10:46–46.

Carolyn E Lipscomb. 2000. Medical Subject Headings(MeSH). Bulletin of the Medical Library Associa-tion.

https://doi.org/10.1162/tacl_a_00051

https://doi.org/10.1162/tacl_a_00051

https://doi.org/10.18653/v1/D17-1070

https://doi.org/10.18653/v1/D17-1070

https://doi.org/10.18653/v1/D17-1070

https://doi.org/10.1001/jamanetworkopen.2019.6700

https://doi.org/10.1001/jamanetworkopen.2019.6700

https://doi.org/10.18653/v1/W18-2501

https://doi.org/10.18653/v1/W18-2501

https://doi.org/10.18653/v1/P18-1031

https://doi.org/10.18653/v1/P18-1031

Chundi Liu, Shunan Zhao, and Maksims Volkovs.2018. Unsupervised Document Embedding withCNNs. ArXiv, abs/1711.04168v3.

Pengfei Liu, King Keung Wu, and Helen M. Meng.2017. A Model of Extended Paragraph Vectorfor Document Categorization and Trend Analysis.IJCNN.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar S. Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke S. Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A Robustly Optimized BERT Pretrain-ing Approach. ArXiv, abs/1907.11692.

Laurens van der Maaten. 2014. Accelerating t-SNEUsing Tree-based Algorithms. Journal of MachineLearning Research.

Allen Nie, Erin Bennett, and Noah Goodman. 2019.DisSent: Learning Sentence Representations fromExplicit Discourse Relations. In ACL.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep Contextualized Word Rep-resentations.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training. arXiv.

Radim Rehurek and Petr Sojka. 2010. Software Frame-work for Topic Modelling with Large Corpora. InLREC.

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP.

Andrew Rosenberg and Julia Hirschberg. 2007. V-measure: A Conditional Entropy-based ExternalCluster Evaluation Measure. In EMNLP.

J Ben Schafer, Dan Frankowski, Jon Herlocker, andShilad Sen. 2007. Collaborative filtering recom-mender systems. In The adaptive web. Springer.

Martijn J. Schuemie, Marc Weeber, Bob J. A. Schijve-naars, Erik M. van Mulligen, C. Christiaan van derEijk, Rob Jelier, Barend Mons, and Jan A. Kors.2004. Distribution of information in biomedical ab-stracts and full-text publications. Bioinformatics,20(16):2597–604.

Dinghan Shen, Xinyuan Zhang, Ricardo Henao, andLawrence Carin. 2018. Improved semantic-awarenetwork embedding with fine-grained word align-ment. In EMNLP.

Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Dar-rin Eide, Bo-June Paul Hsu, and Kuansan Wang.2015. An Overview of Microsoft Academic Service(MAS) and Applications. In WWW.

Cunchao Tu, Han Liu, Zhiyuan Liu, and Maosong Sun.2017. Cane: Context-aware network embedding forrelation modeling. In ACL.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. In NIPS.

Wenlin Wang, Chenyang Tao, Zhe Gan, Guoyin Wang,Liqun Chen, Xinyuan Zhang, Ruiyi Zhang, QianYang, Ricardo Henao, and Lawrence Carin. 2019.Improving textual network learning with variationalhomophilic embeddings. In Advances in Neural In-formation Processing Systems, pages 2074–2085.

Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A Broad-Coverage Challenge Corpus for Sen-tence Understanding through Inference. In NAACL-HLT.

Felix Wu, Amauri H. Souza, Tianyi Zhang, Christo-pher Fifty, Tao Yu, and Kilian Q. Weinberger.2019a. Simplifying graph convolutional networks.In ICML.

Lingfei Wu, Ian En-Hsu Yen, Kun Xu, FangliXu, Avinash Balakrishnan, Pin-Yu Chen, PradeepRavikumar, and Michael J Witbrock. 2018. WordMover’s Embedding: From Word2Vec to DocumentEmbedding. In EMNLP.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural machinetranslation system: Bridging the gap between humanand machine translation. ArXiv, abs/1609.08144.

Zonghan Wu, Shirui Pan, Fengwen Chen, GuodongLong, Chengqi Zhang, and Philip S Yu. 2019b. AComprehensive Survey on Graph Neural Networks.ArXiv, abs/1901.00596.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. ArXiv, abs/1906.08237.

Hamed Zamani, Mostafa Dehghani, W. Bruce Croft,Erik G. Learned-Miller, and Jaap Kamps. 2018.From neural re-ranking to neural ranking: Learn-ing a sparse representation for inverted indexing. InCIKM.

Xinyuan Zhang, Yitong Li, Dinghan Shen, andLawrence Carin. 2018. Diffusion maps for textualnetwork embedding. In NeurIPS.

https://doi.org/10.18653/v1/P19-1442

https://doi.org/10.18653/v1/P19-1442

http://arxiv.org/abs/1908.10084



https://doi.org/10.18653/v1/N18-1101

https://doi.org/10.18653/v1/N18-1101

A Appendix A - Baseline Details

1. Random Zero-mean 25-dimensional vectorswere used as representations for each document.

2. Doc2Vec Doc2Vec is one of the earlier neuraldocument/paragraph representation methods (Leand Mikolov, 2014), and is a natural comparison.We trained Doc2Vec on our training subset usingGensim (Rehurek and Sojka, 2010), and chose thehyperparameter grid using suggestions from Lauand Baldwin (2016). The hyperparameter gridused:

{’window’: [5, 10, 15],’sample’: [0, 10 ** -6, 10 ** -5],’epochs’: [50, 100, 200]},

for a total of 27 models. The other parameterswere set as follows: vector_size=300,min_count=3, alpha=0.025,min_alpha=0.0001, negative=5, dm=0,dbow=1, dbow_words=0.

3. Fasttext-Sum This simple baseline is aweighted sum of pretrained word vectors. Wetrained our own 300 dimensional fasttext embed-dings (Bojanowski et al., 2017) on a corpus ofaround 3.1B tokens from scientific papers whichis similar in size to the SciBERT corpus (Beltagyet al., 2019). We found that these pretrained embed-dings substantially outperform alternative off-the-shelf embeddings. We also use these embeddings inother baselines that require pretrained word vectors(i.e., SIF and SGC that are described below). Thesummed bag of words representation has a numberof weighting options, which are extensively tunedon a validation set for best performance.

4. SIF The SIF method of Arora et al. (2017) isa strong text representation baseline that takes aweighted sum of pretrained word vectors (we usefasttext embeddings described above), then com-putes the first principal component of the documentembedding matrix and subtracts out each documentembedding’s projection to the first principal com-ponent.

We used a held-out validation set to choose afrom the range [1.0e-5, 1.0e-3] spaced evenlyon a log scale. The word probability p(w) wasestimated on the training set only. When com-puting term-frequency values for SIF, we usedscikit-learn’s TfidfVectorizer with the same pa-rameters as enumerated in the preceding sec-tion. sublinear_tf, binary, use_idf,

smooth_idf were all set to False. Since SIFis a sum of pretrained fasttext vectors, the resultingdimensionality is 300.

5. ELMo ELMo (Peters et al., 2018) provides con-textualized representations of tokens in a document.It can provide paragraph or document embeddingsby averaging each token’s representation for all 3LSTM layers. We used the 768-dimensional pre-trained ELMo model in AllenNLP (Gardner et al.,2018).

6. Citeomatic The most relevant baseline is Citeo-matic (Bhagavatula et al., 2018), which is an aca-demic paper representation model that is trained onthe citation graph via sampled triplets. Citeomaticrepresentations are an L2 normalized weighted sumof title and abstract embeddings, which are trainedon the citation graph with dynamic negative sam-pling. Citeomatic embeddings are 75-dimensional.

7. SGC Since our algorithm is trained on data fromthe citation graph, we also compare to a state-of-the-art graph representation learning model: SGC(Simple Graph Convolution) (Wu et al., 2019a),which is a graph convolution network. An al-ternative comparison would have been Graph-SAGE (Hamilton et al., 2017b), but SGC (withno learning) outperformed an unsupervised variantof GraphSAGE on the Reddit dataset16, Note thatSGC with no learning boils down to graph prop-agation on node features (in our case nodes areacademic documents). Following Hamilton et al.(2017a), we used SIF features as node representa-tions, and applied SGC with a range of parameterk, which is the number of times the normalizedadjacency is multiplied by the SIF feature matrix.Our range of k was 1 through 8 (inclusive), and waschosen with a validation set. For the node features,we chose the SIF model with a = 0.0001, as thismodel was observed to be a high-performing one.This baseline is also 300 dimensional.

8. SciBERT To isolate the advantage ofSPECTER’s citation-based fine-tuning objective,we add a controlled comparison with SciBERT(Beltagy et al., 2019). Following Devlin et al.(2019) we take the last layer hidden state corre-sponding to the [CLS] token as the aggregatedocument representation.17

16There were no other direct comparisons in Wu et al.(2019a)

17We also tried the alternative of averaging all token repre-sentations, but this resulted in a slight performance decreasecompared with the [CLS] pooled token.

9. Sentence BERT Sentence BERT (Reimers andGurevych, 2019) is a general-domain pretrainedmodel aimed at embedding sentences. The au-thors fine-tuned BERT using a triplet loss, wherepositive sentences were from the same documentsection as the seed sentence, and distractor sen-tences came from other document sections. Themodel is designed to encode sentences as opposedto paragraphs, so we embed the title and each sen-tence in the abstract separately, sum the embed-dings, and L2 normalize the result to produce afinal 768-dimensional paper embedding.18

During hyperparameter optimization we chosehow to compute TF and IDF values weights bytaking the following non-redundant combinationsof scikit-learn’s TfidfVectorizer (Pedregosa et al.,2011) parameters: sublinear_tf, binary,use_idf, smooth_idf. There were a totalof 9 parameter combinations. The IDF valueswere estimated on the training set. The otherparameters were set as follows: min_df=3,max_df=0.75, strip_accents=’ascii’,stop_words=’english’, norm=None,lowercase=True. For training of fasttext, weused all default parameters with the exception ofsetting dimension to 300 and minCount was setto 25 due to the large corpus.

18We used the ‘bert-base-wikipedia-sections-mean-tokens’model released by the authors: https://github.com/UKPLab/sentence-transformers

https://github.com/UKPLab/sentence-transformers

https://github.com/UKPLab/sentence-transformers