biophysicochemical motifs in t-cell receptor …...immune repertoire deep sequencing allows...

11
Translational Science Biophysicochemical Motifs in T-cell Receptor Sequences Distinguish Repertoires from Tumor- Inltrating Lymphocyte and Adjacent Healthy Tissue Jared Ostmeyer, Scott Christley, Inimary T. Toby, and Lindsay G. Cowell Abstract Immune repertoire deep sequencing allows comprehensive characterization of antigen receptorencoding genes in a lym- phocyte population. We hypothesized that this method could enable a novel approach to diagnose disease by identifying antigen receptor sequence patterns associated with clinical phe- notypes. In this study, we developed statistical classiers of T-cell receptor (TCR) repertoires that distinguish tumor tissue from patient-matched healthy tissue of the same organ. The basis of both classiers was a biophysicochemical motif in the comple- mentarity determining region 3 (CDR3) of TCRb chains. To develop each classier, we extracted 4-mers from every TCRb CDR3 and represented each 4-mer using biophysicochemical features of its amino acid sequence combined with quantication of 4-mer (or receptor) abundance. This representation was scored using a logistic regression model. Unlike typical logistic regres- sion, the classier is tted and validated under the requirement that at least 1 positively labeled 4-mer appears in every tumor repertoire and no positively labeled 4-mers appear in healthy tissue repertoires. We applied our method to publicly available data in which tumor and adjacent healthy tissue were collected from each patient. Using a patient-holdout cross-validation, our method achieved classication accuracy of 93% and 94% for colorectal and breast cancer, respectively. The parameter values for each classier revealed distinct biophysicochemical properties for tumor-associated 4-mers within each cancer type. We propose that such motifs might be used to develop novel immune-based cancer screening assays. Signicance: This study presents a novel computational approach to identify T-cell repertoire differences between normal and tumor tissue. See related commentary by Zoete and Coukos, p. 1299 Introduction The immune system actively responds to solid tumors, result- ing in tumor-inltrating lymphocytes (TIL). Natural immune control is often unsuccessful, however, because the tumor micro- environment contains a mix of immune-activating and immune- suppressing signals (1). But given the right environmental cues, cytotoxic T lymphocytes in the tumor have the capacity to mediate tumor cell killing in virtue of bearing antigen receptors, T-cell receptors (TCR), with specicity for tumor-associated anti- gens (1, 2). Although there is tremendous heterogeneity between patients' antigen landscapes due to patient-specic tumor neoan- tigens, there is also overlap (2, 3). Therefore, we reasoned that patients with the same cancer type or subtype may have cytotoxic T-cell responses against a common set of antigens. Indeed, there is evidence for shared immunoreactivity, as well as for shared TCR sequences (410). We further reasoned that if these T-cell responses could be detected, particularly early in the disease course, they could serve as an important addition to the suite of methods under development for the early detection of cancer. As a rst step in this direction, we designed this study to determine whether antitumor T-cell responses have a cancer-specic signa- ture that can reliably distinguish cancer-associated repertoires from those associated with healthy tissue of the same organ. We leveraged publicly available TCR deep sequencing data and the Multiple Instance Learning (MIL) machine-learning frame- work. The genes encoding TCRs are somatically generated through a process that creates essentially unique gene sequences at the relevant loci (11). This results in a tremendously diverse TCR repertoire, in which each TCR has its own distinct prole of antigens it can bind. Immune repertoire deep sequencing has made it possible to comprehensively prole the TCRs of a lym- phocyte population and has been widely applied to TILs (12). The technology has enabled novel approaches for diagnosing and prognosticating diseases with a driving immune component by identifying repertoire patterns associated with clinical pheno- types. Most studies have been purely descriptive and looked for shared amino acid sequences among patients with a common phenotype (7, 8), looked for clusters of sequences overrepresent- ed in one phenotype relative to another (13), or compared repertoire-level summary statistics, such as diversity, between phenotypes (reviewed in refs. 12, 14, 15). In the latter case, these features have prognostic value for some cancers and thera- pies (1618). We are aware of only a handful of studies devel- oping predictive models (1923). With the exception of our study Department of Clinical Sciences, UT Southwestern Medical Center, Dallas, Texas. Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/). Corresponding Author: Lindsay G. Cowell, UT Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390. Phone: 214-648-2289; E-mail: [email protected] doi: 10.1158/0008-5472.CAN-18-2292 Ó2019 American Association for Cancer Research. Cancer Research www.aacrjournals.org 1671 on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292

Upload: others

Post on 12-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biophysicochemical Motifs in T-cell Receptor …...Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor –encoding genes in a lym-phocyte population

Translational Science

Biophysicochemical Motifs in T-cell ReceptorSequences Distinguish Repertoires from Tumor-Infiltrating Lymphocyte and Adjacent HealthyTissueJared Ostmeyer, Scott Christley, Inimary T. Toby, and Lindsay G. Cowell

Abstract

Immune repertoire deep sequencing allows comprehensivecharacterization of antigen receptor–encoding genes in a lym-phocyte population. We hypothesized that this method couldenable a novel approach to diagnose disease by identifyingantigen receptor sequence patterns associated with clinical phe-notypes. In this study, we developed statistical classifiers of T-cellreceptor (TCR) repertoires that distinguish tumor tissue frompatient-matched healthy tissue of the same organ. The basis ofboth classifiers was a biophysicochemical motif in the comple-mentarity determining region 3 (CDR3) of TCRb chains. Todevelop each classifier, we extracted 4-mers from every TCRbCDR3 and represented each 4-mer using biophysicochemicalfeaturesof itsaminoacid sequencecombinedwithquantificationof4-mer (or receptor) abundance. This representationwas scoredusing a logistic regression model. Unlike typical logistic regres-sion, the classifier is fitted and validated under the requirement

that at least 1 positively labeled 4-mer appears in every tumorrepertoire and no positively labeled 4-mers appear in healthytissue repertoires. We applied our method to publicly availabledata in which tumor and adjacent healthy tissue were collectedfrom each patient. Using a patient-holdout cross-validation, ourmethod achieved classification accuracy of 93% and 94% forcolorectal and breast cancer, respectively. The parameter valuesfor eachclassifier revealeddistinctbiophysicochemicalpropertiesfor tumor-associated4-merswithin each cancer type.Weproposethat suchmotifs might be used to develop novel immune-basedcancer screening assays.

Significance: This study presents a novel computationalapproach to identify T-cell repertoire differences betweennormal and tumor tissue.

See related commentary by Zoete and Coukos, p. 1299

IntroductionThe immune system actively responds to solid tumors, result-

ing in tumor-infiltrating lymphocytes (TIL). Natural immunecontrol is often unsuccessful, however, because the tumor micro-environment contains a mix of immune-activating and immune-suppressing signals (1). But given the right environmental cues,cytotoxic T lymphocytes in the tumor have the capacity tomediatetumor cell killing in virtue of bearing antigen receptors, T-cellreceptors (TCR), with specificity for tumor-associated anti-gens (1, 2). Although there is tremendous heterogeneity betweenpatients' antigen landscapes due to patient-specific tumor neoan-tigens, there is also overlap (2, 3). Therefore, we reasoned thatpatients with the same cancer type or subtype may have cytotoxicT-cell responses against a common set of antigens. Indeed, there isevidence for shared immunoreactivity, as well as for shared TCRsequences (4–10). We further reasoned that if these T-cell

responses could be detected, particularly early in the diseasecourse, they could serve as an important addition to the suite ofmethods under development for the early detection of cancer. As afirst step in this direction, we designed this study to determinewhether antitumor T-cell responses have a cancer-specific signa-ture that can reliably distinguish cancer-associated repertoiresfrom those associated with healthy tissue of the same organ.

We leveraged publicly available TCR deep sequencing data andthe Multiple Instance Learning (MIL) machine-learning frame-work. The genes encoding TCRs are somatically generated througha process that creates essentially unique gene sequences at therelevant loci (11). This results in a tremendously diverse TCRrepertoire, in which each TCR has its own distinct profile ofantigens it can bind. Immune repertoire deep sequencing hasmade it possible to comprehensively profile the TCRs of a lym-phocyte population andhas beenwidely applied to TILs (12). Thetechnology has enabled novel approaches for diagnosing andprognosticating diseases with a driving immune component byidentifying repertoire patterns associated with clinical pheno-types. Most studies have been purely descriptive and looked forshared amino acid sequences among patients with a commonphenotype (7, 8), looked for clusters of sequences overrepresent-ed in one phenotype relative to another (13), or comparedrepertoire-level summary statistics, such as diversity, betweenphenotypes (reviewed in refs. 12, 14, 15). In the latter case, thesefeatures have prognostic value for some cancers and thera-pies (16–18). We are aware of only a handful of studies devel-oping predictivemodels (19–23).With the exception of our study

Department of Clinical Sciences, UT SouthwesternMedical Center, Dallas, Texas.

Note: Supplementary data for this article are available at Cancer ResearchOnline (http://cancerres.aacrjournals.org/).

Corresponding Author: Lindsay G. Cowell, UT Southwestern Medical Center,5323 Harry Hines Boulevard, Dallas, TX 75390. Phone: 214-648-2289; E-mail:[email protected]

doi: 10.1158/0008-5472.CAN-18-2292

�2019 American Association for Cancer Research.

CancerResearch

www.aacrjournals.org 1671

on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from

Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292

Page 2: Biophysicochemical Motifs in T-cell Receptor …...Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor –encoding genes in a lym-phocyte population

in multiple sclerosis (19), the studies were in the context ofdistinguishing infected from uninfected individuals or immu-nized from unimmunized or adjuvant-only immunized indivi-dualswhere large,mono-, or oligoclonal lymphocyte expansion isexpected.

Our approach relies on MIL, which provides a rigorous,established framework for relating immune repertoires to phe-notypes. Most of the individual receptors in a person areunrelated to any specific phenotype and exist to maintain adiverse set of specificities as a contingency against any possibleantigen. Only a small number of receptors are relevant to anyspecific phenotype. Mapping pertinent receptors in a repertoireto a single phenotype label can formally be described as MIL,which treats problems as bags of instances where the bags arelabeled but the instances are not (24). The receptors from asingle repertoire can be thought of as the instances, the reper-toire as the bag, and the phenotype as the label. The goal is topredict the phenotype from the receptors.

In this study, we applied MIL to publicly available TCR deepsequencing data from tumor andhealthy tissue frompatientswithcolorectal or breast cancer (25, 26). The locus encoding the TCR bchain (TCRb) was sequenced in all samples. In order to capturefeatures of the TCRbs' antigen-binding capabilities, we repre-sented each sequence using biophysicochemical features. Werepresented the somatically generated portionof each gene,whichis the primary determinant of the antigen-binding specificityencoded by the gene. We then developed a statistical classifierfor each cancer type and obtained classification accuracies byleave-one-out cross-validation of 93%and 94% for colorectal andbreast cancer, respectively. Permutation analyses resulted inclassification accuracies of 49% for both datasets. These resultsdemonstrate distinct, biophysicochemical motifs in the TCRbsequences of TILs that are specific to the cancer type and reliablydistinguish cancer-associated repertoires from those associatedwith healthy tissue of the same organ.

Materials and MethodsDatasets

Weused publicly available TCRbdeep sequencing data from14colorectal cancer patients (Table 1; ref. 26) and 16 breast cancerpatients (Table 2; ref. 25). In both original studies, tumor andadjacent healthy tissue was biopsied from each patient, andgenomicDNAwas extracted and sent to Adaptive Biotechnologiesfor sequencing using a proprietary technology that providesaccurate measurements of each receptor sequence's abun-dance (27). The sequences can be downloaded from the immu-neACCESS database (https://doi.org/10.21417/B7PP46; https://doi.org/10.21417/B7NW5B; ref. 28).

Representing TCRsWe utilized a representation of TCRb sequence that captures

features relevant to its antigen-binding capabilities. We focusedon complementarity determining region 3 (CDR3), because it isthe somatically generated portion of the gene and the primarydeterminant of antigen-binding specificity. CDR3 residues thatdirectly contact peptide in a peptide–MHC complex are expectedto make the largest contribution to a TCR's antigen-bindingspecificity. To determine which TCRb CDR3 residues contactpeptide, we analyzed X-ray crystallographic structures of humanTCRs bound to peptide–MHC complex (Fig. 1A) obtained from

the Protein Data Bank (29). We extracted the TCRb CDR3sequence from each structure. After removing duplicates, 55 wereleft for analysis (Supplementary Table S1 and Supplementary Fig.S1). We annotated each TCRb CDR3 residue as being in contactwith peptide or not being in contact with peptide. Being in contactwas defined as being � 5Å from a peptide residue. We used theannotations to perform a multiple sequence alignment, aligningcontact positions using clustalw (http://www.genome.jp/tools-bin/clustalw; Fig. 1B). The alignment revealed that TCRb CDR3residues in contact with peptide tend to lie adjacent to each other,forming a contiguous strip (Fig. 1A and B). The size and relativelocation of this strip varied, but the average length was four, and itrarely included any of the first or last three TCRb CDR3 residues(Fig. 1B). For 23 of the structures, the strip was longer than four,and for 33, the strip was shorter than four. In �1/3 of thestructures, an additional one or two residues were also in contact.

Table 1. Microsatellite instability status and the number of unique TCRb CDR3sequences from the tumor and healthy tissue samples are shown for the 14colorectal cancer patients

Colorectal samples from ref. 26Tumor Healthy

Patient # (patient ID) MSI status Unique TCRbs Unique TCRbs

1 (400464) MSS 1,836 1,4662 (400480) MSS 2,432 1,7733 (400488) MSI-H 2,090 6994 (400600) MSS 862 9845 (400712) MSS 203 6676 (400728) MSS 41 1,1107 (401144) MSS 1,390 1,0408 (401176) MSS 723 8839 (401248) MSS 391 1,84410 (401256) MSS 1,711 1,06811 (401264) MSS 3,849 91012 (401304) MSS 1,659 1,61213 (401320) MSS 2,933 1,66714 (401336) MSI-H 988 1,228

Abbreviations: MSI-H, microsatellite instability detected at two or moremarkers; MSS, no microsatellite instabilities detected.

Table 2. Breast cancer type, receptor status, and the number of unique TCRbCDR3 sequences from the tumor and healthy tissue samples are shown for the 16breast cancer patients

Breast samples from ref. 25Tumor HealthyPatient #

(patient ID) Type ER/PR/HER2 Unique TCRbs Unique TCRbs

1 (BR01) IDC þ/þ/� 50,667 18,8482 (BR05) IDC þ/þ/� 21,559 7,9233 (BR07) IDC þ/þ/� 22,345 12,3344 (BR13) IDC þ/þ/� 8,276 2,6095 (BR14) ILC þ/þ/� 34,203 5,5776 (BR15) IDC þ/þ/� 16,341 3,3167 (BR16) IDC þ/þ/� 8,237 22,4838 (BR17) IDC þ/þ/� 8,686 7,7489 (BR18) IDC þ/þ/� 5,324 81210 (BR19) ILC þ/þ/� 8,571 8,86511 (BR20) ILC þ/þ/� 15,956 13,61112 (BR21) IDC þ/þ/� 18,597 10,59313 (BR22) IMC þ/þ/� 51,097 22,77414 (BR24) IDC �/�/� 45,953 10,90315 (BR25) IDC �/�/þ 16,004 4,27616 (BR26) ILC þ/þ/� 6,250 3,397

Abbreviations: ER, estrogen receptor; IDC, invasive ductal carcinoma; ILC,invasive lobular carcinoma; IMC, invasivemucinous carcinoma; PR, progesteronereceptor.

Ostmeyer et al.

Cancer Res; 79(7) April 1, 2019 Cancer Research1672

on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from

Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292

Page 3: Biophysicochemical Motifs in T-cell Receptor …...Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor –encoding genes in a lym-phocyte population

To represent each TCRbCDR3, we excluded the first and last threeresidues and partitioned the remaining sequence into every pos-sible contiguous strip of four amino acids (4-mer; Fig. 1C). Thus,every CDR3 was represented by multiple 4-mers. Based on thealignment, we expect 4-mers to include most of the contactresidues for the vast majority of TCRb CDR3 sequences. Ourexpectation is that, for each TCRb CDR3, at least one of its 4-merscontacts the peptide component of the receptor's cognate antigen.

To identify 4-mers with different amino acid sequences butsimilar antigen-binding capabilities, we represented each 4-merusing numerical values for the biophysicochemical properties ofits component amino acids. There are currently at least 566 aminoacid indices one could choose from (https://www.genome.jp/aaindex/). Many are highly correlated and contain redundant

information. At least two efforts have applied dimensionalityreduction to large numbers of amino acid indices to derivesmall numbers of orthogonal properties (factors) that main-tain most of the information contained in the original set.Kidera and colleagues derived 10 factors from 188 amino acidindices (30), and Atchley and colleagues derived 5 factors from494 amino acid indices (31). We used Atchley factors, as theywere derived from the largest number of indices and wouldrequire half as many model parameters as the Kidera factors.The five Atchley factors correspond loosely to polarity, sec-ondary structure, molecular size/volume, codon diversity, andelectrostatic charge. For input into our model, each amino acidin a 4-mer is represented by a vector of its five Atchley factorvalues (Fig. 1D).

Figure 1.

A, X-ray crystallographic structure of a human TCRb chain (gray) bound to a peptide (blue) in complex with MHC (not shown). The CDR3 is shown in green, andthe portion of the CDR3 in direct contact (�5Å) with the peptide is shown in red. The MHC complex and a-chain are omitted for clarity. B, CDR3 sequencesextracted from 55 X-ray crystallographic structures of human TCRs bound to peptide:MHC. Residues�5Å from peptide (red) are used to align the sequences.The alignment was created using clustalw. The bar chart shows the proportion of structures in which the corresponding CDR3 position was in direct contact withpeptide. C, To profile the specificity of a CDR3 sequence, the CDR3 is cut into every possible 4-mer excluding the first and last three residues. D, Each 4-mer isconverted into a biophysicochemical representation. For each residue, there are 5 Atchley factor values describing the residues biophysicochemical properties,resulting in a 4-mer representation consisting of 20 numeric values.

TCR Motifs Distinguish Tumor from Healthy Tissue Repertoires

www.aacrjournals.org Cancer Res; 79(7) April 1, 2019 1673

on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from

Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292

Page 4: Biophysicochemical Motifs in T-cell Receptor …...Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor –encoding genes in a lym-phocyte population

T cells undergo clonal expansion in response to antigen stim-ulation. Thus, receptor quantity is an important feature for ourstatistical classifier. We used the logarithm of the 4-mers' relativeabundances as a feature in the model. We considered twoapproaches. The first we refer to as calculating "the 4-mer relativeabundance." First, we identify every TCRb sequence containingthe 4-mer in its CDR3 and sum over its template counts, CTCRb.

1

This provides the 4-mer count, C4mer, for the sample. We thendivide by the total count of all 4-mers in the sample, T4mer, to getthe 4-mer relative abundance, RA.

C4mer ¼ PTCRb

with 4mer

CTCRb ; T4mer ¼ P4merC

4mer ; RA ¼ C4mer

T4mer

ðAÞ

The second approach is to consider only the most abundantTCRb sequence containing the 4-mer in its CDR3. This we refer toas calculating "the TCRb relative abundance." First, we sum overthe TCRb template counts,CTCRb, of every TCRb in a sample to getthe total count, TTCRb. We then divide CTCRb for the most abun-dant TCRb by TTCRb to get the relative abundance, RATCRb. We usethe most abundant TCRb, ignoring all less abundant TCRbscontaining the 4-mer.

TTCRb ¼ P

TCRbCTCRb ;RATCRb ¼ CTCRb

TTCRb ; RA ¼ MaximumTCRb

with 4mer

RATCRb� �

ðBÞIt is unclear a prioriwhich approach is better, so we assessed the

performance of our classifiers using both.It is important to normalize the features of a classifier to be on

the same scale. We normalized the Atchley factor values sothat each has zero mean and unit variance. It was unclear a prioriwhether it would be appropriate to normalize the 4-mer abun-dance term, because its values are potentially unbounded.Therefore, we assessed classifier performance with and withoutnormalizing this term.

Logistic regression modelThe extracted 4-mers were scored using a logistic regression

function that predicts whether a 4-mer is tumor-associated. Weused this function because of its widespread use and simplicity,and because itmodels a binary dependent variable. First, a biased,weighted sum of the 4-mer features (the logit) is computed.

logit ¼ b0 þW1 � f1 þW2 � f2 þ . . .þW20�f20 þW21 � lnRA

ðCÞ

f1 through f20 represent thefiveAtchley factor values for the four4-mer residues. RA represents the 4-mer's relative abundancecalculated using either Eq. A or B. The bias term b0 and weightsW1 through W21 are the model parameters and are fit by max-imum likelihood using gradient optimization techniques(described below). The same weights W1 through W21 and biasterm b0 are used for all 4-mers. Once the logit is computed, the

sigmoid function is applied to obtain a value between 0 and 1.

score ¼ 1�1þ e�logit ðDÞ

The score represents the probability that the 4-mer is tumor-associated.

Multiple instance learningThe problem of predicting repertoire-level labels from the

4-mers in each repertoire can be formally described as MIL inwhich the 4-mers are instances, the repertoires are bags, and thebag label is the tissue source of the repertoire (i.e., tumor orhealthy; ref. 24). MIL relies on aggregating instance-level scores toassign a bag-level label. Thus, weneed to aggregate the scores fromall 4-mers in a repertoire into a single value that predicts whetherthe repertoire came from tumor or healthy tissue. Only a smallnumber of 4-mers are expected to interact with relevant antigens.Accordingly, under the standard MIL assumption, at least one4-mer per tumor repertoire must have a high score, whereas nonefrom healthy tissue repertoires should have high scores. This wasimplemented by taking the maximum 4-mer score as the reper-toire score. Thus, the probability that a repertoire came fromtumor tissue given the 4-mer scores is defined as:

P tumor j 4mer1; 4mer2; 4mer3; : : :ð Þ¼ Maximum score1; score2; score3; : : :f g ðEÞ

The predicted label is tumor when � one 4-mer scores � 0:5,whereas the predicted label is healthy when every 4-mer scores� 0:5. The model's parameter values were fit to maximize theassignment of correct labels.

Gradient optimizationSpecific values for W1 through W21 and b0 were determined

using repertoires with known labels (i.e., tumor vs. healthy).The values were selected to maximize the likelihood that eachprediction from Eq. E is correct. To search for optimal values,gradient optimization was used as in ref. 19. The initial valuesfor b0 and W1 through W20 were selected as in ref. 19. Twodifferent protocols for initializing W21 were tried (Table 3). Weran gradient optimization from 100,000 to 375,000 differentinitial weight values (Table 3). To save compute cycles, onlymodels with good or better performance were run a largenumber of times.

Overfitting is a concern with any statistical classifier, especiallywhen using small amounts of labeled data. Because our approachuses the same weights for every 4-mer in each sample, ourapproach has fewer parameters than labeled data points (Sup-plementary Tables S2 and S3), which helps alleviate the concernof overfitting. Still, we applied early stopping to regularize themodel, andwe assessedmodel generalization using leave-one-outcross-validation (see below). We found that the best performingmodels generalize best to theholdout data on the last training step(Table 3), indicating that they had not begun to overfit the data.Previously, we applied L1/L2 regularization and dropout to thesamemodels on a different disease and found that bothworsenedmodel performance (19). Therefore, we do not apply them here.

Model development and validationWe applied this approach to the colorectal and breast cancer

datasets. Each was treated separately, resulting in one model for1We treat TCRb sequences with identical CDR3 sequences as being the sameTCRb sequence, ignoring differences upstream of CDR3.

Ostmeyer et al.

Cancer Res; 79(7) April 1, 2019 Cancer Research1674

on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from

Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292

Page 5: Biophysicochemical Motifs in T-cell Receptor …...Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor –encoding genes in a lym-phocyte population

eachdataset. To assessmodel performance,we performedpatient-holdout cross-validation, where the tumor and patient-matchedhealthy sample for a single patient were simultaneously excludedduring parameter fitting and then scored after selection of the bestmodel (Fig. 2). The two samples were scored independently; the

model hadnoknowledge that onewas froma tumor and theotherwas from healthy tissue.

Several variations of the model were considered, includingdifferent methods for calculating the relative abundance term,for normalizing the relative abundance term, and for initializingits weight W21 (Table 3). Results for the best performing modelsare described below.

ResultsColorectal cancer

The number of 4-mers per sample in the colorectal cancerdataset ranged from 186 to 7,112, with an average of 3,789,giving � 79,566 features per sample (Supplementary Table S2).Thebestmodel used 4-mer (rather thanTCRb) relative abundancewith this term unnormalized and its weight (W21) initialized to 0(Table 3). This model correctly categorized 93% (26/28) of held-out samples with an average log-likelihood of �0.316 bits. Themodel always scored the tumor sample above the healthy sampledespite having no knowledge that the two samples were from thesame patient (Fig. 3A). To estimate the probability of correctlyclassifying 26 of 28 samples by chance, we performed a permu-tation analysis. For each permutation, a patient-holdout cross-validation was performed where the labels on the training datawere permuted but those on the holdouts were not (Supplemen-tary Table S4). The classification accuracies of all 20 permutationswere <93%, allowing us to assign P < 0.05 to the observedaccuracy. The average accuracy over all permutations is 49%, andthe average log-likelihood fit is �2.66 bits.

To discern the 4-mer biophysicochemical features thatincrease the probability of a tumor categorization, we exam-ined the model weights with parameters fit on all 14 patients.The weights reveal how each Atchley factor contributes to thescore and the relative importance of each 4-mer position(Fig. 3B). We observe negative weights for almost every 4-merposition for Atchley factors II and IV, indicating an increased

Small number of “N” cancer patients

Better fit?

Tumorsample

Normalsample

×N

“N-1” patients Holdout patient

Optimizelog-likelihood

(GRADIENT OPTIMZATION)

Load initial weights

×2,500Steps

×100,000 to 375,000Restarts

Best fit

Randomize weightsW ~ N(0,1/Nfeatures)

×100,000 to 375,000

Figure 2.

Workflow for model selection and parameter fitting. The diagram shows howthe data were used to train and validate each model. The performance ofeach model was assessed by a patient-holdout cross-validation, where thetumor and healthy samples from the same patient were excluded forvalidation. Data from the remaining N-1 patients were used to fit the model.For each model, between 100,000 and 375,000 initial sets of weights weregenerated. To save compute cycles, only models with good or betterperformance were run a large number of times. Each set of weights was usedfor exhaustive leave-one-out cross-validation over all N patients. Each run ofcross-validation with each set of initial weights was run for 2,500 iterations ofgradient optimization. The best fit to the N-1 training samples from among allruns was used to evaluate the excluded validation data.

Table 3. Model variations considered for each cancer type

Calculation ofRA (i.e., relativeabundance)

Normalization of"Log RA"

Initial value W21

(the weight termon "Log RA") Miscellaneous

Number ofinitializations

Patient holdoutcross-validation

Colorectal cancer, SHERWOOD et al. (26)125,000 10/28 � 36%

Equation A Unnormalized W21 ¼ 0 250,000 26/28 � 93%Equation A Unnormalized W21 ¼ 0 Batch normalization 125,000 16/28 � 57%Equation A Unnormalized W21 ¼ 0 Batch normalization and early stopping 125,000 19/28 � 67%Equation A m ¼ 0, s ¼ 1 W21 � N(0, 1/21) 250,000 18/28 � 64%Equation B Unnormalized W21 ¼ 0 375,000 21/28 � 75%Equation B Unnormalized W21 ¼ 0 Early stopping 375,000 23/28 � 82%Equation B Unnormalized W21 � N(0, 1/21) 125,000 18/28 � 64%

Breast cancer, BEAUSANG et al. (25)250,000 21/32 � 67%

Equation A Unnormalized W21 ¼ 0 125,000 14/32 � 44%Equation A Unnormalized W21 ¼ 0 Early stopping 125,000 23/32 � 72%Equation A Unnormalized W21 ¼ 0 Smaller step size 100,000 13/32 � 41%Equation A Unnormalized W21 ¼ 0 Smaller step size and early stopping 100,000 22/32 � 69%Equation B Unnormalized W21 ¼ 0 250,000 30/32 � 94%Equation B Unnormalized W21 � N(0, 1/21) 250,000 27/32 � 84%Equation B Unnormalized W21 � N(0, 1/21) Early stopping 250,000 28/32 � 87%

NOTE: First column, strategy for computing 4-mer relative abundance; 2nd column, approach to normalization of the relative abundance term; 3rd column, differentschemes for initializing the weight W21; 4th column, other variations of the model that were considered; 5th column, the number of initializations that were run; 6thcolumn, the performance of each variation of the model. The performances of the best-performing models are shown in bold font.

TCR Motifs Distinguish Tumor from Healthy Tissue Repertoires

www.aacrjournals.org Cancer Res; 79(7) April 1, 2019 1675

on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from

Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292

Page 6: Biophysicochemical Motifs in T-cell Receptor …...Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor –encoding genes in a lym-phocyte population

probability that a sample is tumor-derived if it contains 4-merscomprising residues with a propensity to participate in a-heli-cal segments and that appear infrequently among the space ofall protein sequences. We also observe only positive weights forAtchley factor V, indicating an increased probability that asample is tumor-derived if it contains 4-mers enriched withpositively charged residues. The weight on the abundance termfavors 4-mers with a large relative abundance. Analysis of theweights for each holdout model reveals that the model weightsare highly consistent across holdout patients and with theweights fit on all 28 samples (Supplementary Fig. S2 andSupplementary Table S5).

The high-scoring 4-mers from each holdout patient also scoredhigh with the model fit to all 14 patients. We aligned all 4-mersthat scored highly enough to categorize a sample as being tumor-associated and found that the amino acids vary considerably ateach 4-mer position (Fig. 3C). These 4-mers would not have beenfoundby looking for shared amino acid sequences. The 4-mers aredetected by our method, because they share similar biophysico-chemical properties at key positions, as selected by the weights ofthe model. Several of the TCRb CDR3 sequences correspond tolarge clones, but many of them do not (Fig. 3C). These CDR3sequences would not have been found by examining only themost abundant clones.

Breast cancerFor the breast cancer dataset, the number of 4-mers ranged from

2,518 to 39,354 with an average of 20,261, giving �425,487features per sample (Supplementary Table S3). The best perform-ing model uses the TCRb (rather than 4-mer) relative abundance(Table 3). It is otherwise like that for colorectal cancer (Table 3).Themodel correctly categorized 94%(30/32) of held-out sampleswith an average log-likelihood error of �0.283 bits (Fig. 4A). Aswith colorectal cancer, the model always scores the tumor sampleabove the patient-matched healthy sample (Fig. 4A). Permutationanalysis gave a classification accuracy of 49% and an average log-likelihood fit of �2.71 bits (Supplementary Table S6). The clas-sification accuracies of all 20 permutations were <94%, allowingus to assign P < 0.05 to the observed accuracy.

We examined the model weights with parameters fit on all 16patients (Fig. 4B). The direction and magnitude of the weightsdiffered considerably from those obtained on the colorectalsamples, indicating that the model is specific to cancer type. Forall Atchley factors, the weights are position-dependent. For exam-ple, 4-mer scores are increased for 4-mers with hydrophobicresidues at the first two positions and hydrophilic residues at thelast two positions (Fig. 4B). The one similarity with the colorectalresults is that themodel assigns a high score to 4-mers with a highrelative abundance. The weights of all models (holdout models

Figure 3.

Colorectal cancer results.A, Classification accuracy obtained by patient-holdout cross-validation, where the tumor and healthy tissue from the same patient areexcluded for validation. B, Illustration of the classifier weights after fitting the model to all 14 patients. For each of the five Atchley factors, the weights are shownfor the four residue positions. The weight for the log-frequency of the 4-mer is also shown. Positive weight values are shown pointing up, and negative weightvalues are shown pointing down. The length of the arrow corresponds to the weight's magnitude. C,All 4-mers with a score above 0.5 (middle column) shownfor each of the 14 patients (leftmost column). Each 4-mer is shown in the context of its respective CDR3. When the 4-mer appears in multiple CDR3 sequences,the CDR3 with the largest relative abundance is shown. The CDR3 sequences are ranked according to their relative abundance in the sample (rightmost column).A rank of 1 indicates the largest relative abundance in the sample. In patient 6, there are two CDR3s that each have two high-scoring 4-mers. MYRE and YREV areboth found in the TCRb CDR3 sequence CASSMYREVEAFF, and the 4-mers ERFY and RERF are both found in the TCRb CDR3 sequence CASSRERFYEQYF.

Ostmeyer et al.

Cancer Res; 79(7) April 1, 2019 Cancer Research1676

on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from

Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292

Page 7: Biophysicochemical Motifs in T-cell Receptor …...Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor –encoding genes in a lym-phocyte population

and the model fit on all samples) cluster more tightly than for thecolorectal cancermodels (Supplementary Fig. S3 and Supplemen-tary Table S7), possibly due to themuch larger number of featuresper sample.

The high-scoring 4-mers from each holdout tumor sample alsoscored high with the model fit to all 16 patients except for the4-merGSYN,which is oneof twohigh-scoring 4-mers for patient 3during cross-validation. We aligned all 4-mers that scored highenough to categorize a sample as tumor and found, as with thecolorectal cancermodel, that the amino acids vary considerably ateach4-merposition (Fig. 4C). In contrast towhatwasobserved forcolorectal cancer, however, we observe that almost every TCRbCDR3 sequence containing a high-scoring 4-mer corresponds to atop clone (Fig. 4C).

DiscussionImproved methods for cancer early detection are urgently

needed. For the vast majority of cancers, there is currently no testthat is both sensitive enough to detect early stage diseaseand specific enough tomitigate overdiagnosis and overtreatment.For those cancers, screening of average-risk populations is not

recommended, and cancer is typically detected only after it hasprogressed enough to cause symptoms. For the small number ofcancers forwhich screening of average-risk populations is advised,there are still significant downsides. Current, guideline-endorsedapproaches have sensitivities and specificities lower than idealand therefore require frequent rescreening and have the potentialfor overdiagnosis and overtreatment. Furthermore, they primarilyinvolve detecting abnormal tissue changes by imaging or cytologyand require follow-up by invasive tissue collection, which hasassociated risks.

The holy grail of cancer detection is a highly specific, highlysensitive test that detects early stage disease and does not requireinvasive tissue collection. Many potential blood-borne biomar-kers are under investigation, including protein markers and cir-culating cell-free tumor DNA (32). Some of them have also beenfound in other tissues accessible by minimally invasive proce-dures, such as cervical cytology samples (33). Themost promisingresults have been obtained by assaying for combinations ofmarkers (34). Although the results are promising, the sensitivitiesare highly variable, depend strongly on organ site, and are lesspromising for early stage disease (33–38). Thus, complementarybiomarkers are needed.

Figure 4.

Breast cancer results.A, Classification accuracy obtained by patient-holdout cross-validation, where the tumor and healthy tissue from the same patient areexcluded for validation. B, Illustration of the classifier weights after fitting the model to all 16 patients. For each of the five Atchley factors, the weights are shownfor the four residue positions. The weight for the log-frequency of the receptor is also shown. Positive weight values are shown pointing up, and negative weightvalues are shown pointing down. The length of the arrow corresponds to the weight's magnitude. C,All 4-mers with a score above 0.5 (middle column) shownfor each of the 16 patients (leftmost column). Each 4-mer is shown in the context of its respective CDR3. When the 4-mer appears in multiple CDR3 sequences,the CDR3 with the largest relative abundance is shown. The CDR3 sequences are ranked according to their relative abundance in the sample (rightmost column).A rank of 1 indicates the largest relative abundance in the sample. As with colorectal cancer, we observed TCRb CDR3 sequences containing multiple high-scoring4-mers. In patient 1, LSRS and RSNQ appear in the TCRb CDR3 sequence CASSLSRSNQPQHF. In patient 10, SSPH, AYNQ, and AAYN appear in the TCRb CDR3sequence CASSSPHRAAYNQPQHF.

TCR Motifs Distinguish Tumor from Healthy Tissue Repertoires

www.aacrjournals.org Cancer Res; 79(7) April 1, 2019 1677

on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from

Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292

Page 8: Biophysicochemical Motifs in T-cell Receptor …...Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor –encoding genes in a lym-phocyte population

Given the specificity of adaptive immune responses, lympho-cyte recirculation, and the sensitivity of tests for detecting rarelymphocyte clones (39, 40), we find it plausible that antitumorlymphocyte responses could provide one such complementarybiomarker. Success of the approach would require that antitumorlymphocyte responses have cancer-specific signatures, that thesignatures appear early in the disease, and that they be detectablein readily accessible tissue. We designed the current study toaddress the first requirement by determining whether tumor-associated T-cell repertoires have a cancer-specific signature thatcan reliably distinguish cancer-associated TCR repertoires fromthose associated with healthy tissue of the same organ.

With the results presented here, we have successfully demon-strated that TIL repertoires include TCRs with cancer-specific,biophysicochemical motifs. Specifically, we detected distinct bio-physicochemical motifs for colorectal cancer and breast cancertumor-infiltrating T lymphocytes and demonstrated that thesemotifs distinguish tumor repertoires from healthy tissue reper-toires with classification accuracies of 93% and 94%, respectively,by leave-one-out cross-validation. We further show by permuta-tion analysis that the probability of obtaining these accuracies bychance is <0.05. These results suggest that the first requirementgiven above, that antitumor lymphocyte responses have cancer-specific signatures, could be met, at least for some cancer types. Adefinitive answer will require follow-up studies on larger patientcohorts and in additional cancer types.

As in our original approach, we represented each TCRb's CDR3by partitioning it into every possible contiguous strip of fouramino acids (4-mer) and representing each 4-mer using theAtchley factor values for its component residues. We improvedthe approach, adding a feature that quantifies the relative abun-dances of the 4-mers. This feature is critical tomodel performance,increasing accuracy on the colorectal and breast cancer datasetsfrom 36% to 93% and 67% to 94%, respectively (Table 3). Ourapproach is not merely identifying highly expanded T-cell clones,however. In the colorectal cancer dataset, the TCRb CDR3sequences containing high-scoring 4-mers correspond to a top-ten most abundant clone for only 4 of the 14 patients.

The biophysicochemical motifs detected by our method arespecific to a cancer type. Four-mers with the following propertiesare classified as tumor-associated by the colorectal cancer model:hydrophilic residues in the 2nd and 3rd 4-mer positions withhydrophobic residues in the 1st and 4th positions; amino acidsthat tend to form a-helices at all four 4-mer positions; smallresidues in the 1st, 2nd, and 4th positions with large residues inthe 3rd position; and positively charged residues at all four 4-merpositions. In contrast, 4-mers with the following properties areclassified as tumor-associated by the breast cancer model: hydro-philic residues in the 3rd and 4th positions with hydrophobicresidues in the 1st and 2nd positions; amino acids that tend toform a-helices in the 1st 4-mer position with amino acids thattend to form bends, coils, or turns in the remaining positions;large residues in the 1st, 2nd, and 4th 4-mer positions with smallresidues in the 3rd position; and negatively charged amino acidsin the 2nd and 4th 4-mer positions with positively chargedresidues in the 1st and 3rd positions. Furthermore, both cancermotifs are quite different from the one reported for multiplesclerosis (19). Note that the Atchley factors have some moderate,interfactor correlation. Thus, two factors could have large weightsat a particular position, but only one is important to antigenbinding.

Finding biophysicochemical TCRbmotifs that, within a cancertype, are shared across the TILs repertoires of all patients but arelargely absent fromhealthy tissue repertoires is consistentwith thehypothesis that we are detecting TCR with specificity for a tumor-associated antigen that is shared between patients. We are not thefirst to hypothesize that patients may have shared TCRs respond-ing to a common antigen, nor are we the first to look for thecorresponding TCRs in colorectal andbreast cancer TILs (7, 10, 25,26, 41, 42). To address this hypothesis, other investigators havesearched for CDR3 amino acid sequences that were sharedbetween multiple patient TILs repertoires (7, 10, 25, 26, 41,42). None of the studies found sequences shared across allpatients in a study, and the degree of sharing varied tremendously.In the breast cancer studies, it was concluded that the shared TCRbCDR3 sequences most likely correspond to public clones ratherthan receptors responding to common cancer antigens, becausethe same TCRb CDR3 sequences were frequently found in data-bases of TCRb repertoires from presumed healthy indivi-duals (7, 25, 41, 42). In addition, Beausang and colleaguesanalyzed the sequence properties of the shared sequences andfound that they had many features in common with the TCRbCDR3 sequences of public clones, such as being shorter in lengthand having few insertions (25). The Munson and colleaguesarticle stands out, because they sequenced transcripts from bothTCR chains (7). They identified 14 TCR a and b pairs that werepresent in�7 of 20 patient TILs repertoires (including one presentin 15 TILs repertoires) but not present in peripheral bloodrepertoires from six presumed healthy individuals (7).

Public clones are typically defined as TCRswith identical aminoacid sequences observed across multiple individuals of a givenspecies (43). In our case, the TCRb CDR3s bearing high-scoring4-mers have different amino acid sequences across patients, andtherefore are not formally considered part of public TCRs. Weconductedour analysis using biophysicochemical representationsof amino acid sequence inorder tofindTCRbCDR3s that could beexpected to have similar antigen-binding capabilities, even in theabsence of having identical amino acid sequences. Despite theirsimilar biophysicochemical features, however, we cannot say thatthe TCRb CDR3s we have identified are capable of binding to thesame antigen. This needs to be tested experimentally.

The motifs we found are tumor-associated, suggesting that, ifTCRs bearing the motif can bind the same antigen, the antigen istumor-associated. The permutation analysis indicates that ourmethod cannot find a shared biophysicochemical motif thatuniquely distinguishes any randomgrouping of TCRb repertoires.In addition, the motif we have identified in each case is unique tothe TILs repertoires. This suggests that the motif is related to thefact that the T cells are found in the tumor. This may mean thatthey have specificity for a cancer antigen shared across thepatients. There are alternative explanations, however. The T cellsmay be responding to tissue damage in the tumor, or theymay beT regulatory cells contributing to immunosuppression. There arelikely other interpretations. Again, experimental follow-up stud-ies are needed.

It may seem surprising that our statistical classifier generalizesacross patients with presumably different HLA genes. We expectthis is because TCR:MHC interaction primarily happens via con-tacts in CDRs 1 and 2, whereas peptide contacts are primarily viaCDR3 (44–47). We further expect that using a 4-mer (rather thana longer k-mer) allows us to isolate the residues of the TCRbCDR3that are responsible for peptide interaction. In addition, we

Ostmeyer et al.

Cancer Res; 79(7) April 1, 2019 Cancer Research1678

on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from

Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292

Page 9: Biophysicochemical Motifs in T-cell Receptor …...Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor –encoding genes in a lym-phocyte population

speculate that having patient-matchedhealthy tissue is important,because it provides HLA-matched controls, enabling themodel toachieve goodperformance in the absence ofHLA information. It isquite possible, however, that the models' performances wouldimprove if HLA type were included for each patient. This issomething we will determine in future studies.

For both colorectal and breast cancers, we observed multipleTCRbCDR3 sequences that containmultiple high-scoring 4-mers.If these 4-mers are the residues that contact peptide, thenthis suggests that the corresponding receptors may interact withcognate antigen viamultiple 4-mers (Supplementary Fig. S4). Thesame receptor may bind the same peptide:MHC complex inmultiple ways. The same receptor may bind the same peptide inthe context of differentMHCmolecules using different 4-mers.Orthe receptormaybinddifferent peptideswith the different 4-mers.In this case, we would expect the different peptides to be highlysimilar in terms of their biophysicochemical properties. Ineither case, the TCRb CDR3 loop must exhibit considerableconformational flexibility. This kind of loop flexibility has beendemonstrated (48, 49).

Our study has several limitations. First, our approach isdesigned to detect biophysicochemical motifs in TCRs with spec-ificity for shared antigens. It cannot identify TCRs with specificityfor patient-specific tumor neoantigens. Second, the number ofcrystal structures examined is quite small relative to the space ofpossible TCR–peptide–MHC interactions. Because we examinedall Protein Data Bank structures from which we could extract theCDR3 amino acid sequence, increasing the number will requirethat additional crystal structures become available. Third, we usedTCRb sequences and thus have only part of the TCR antigen-binding site. Studies incorporating both TCRb and TCRa areneeded. It is not clear whether a model using both would findthe same TCRb biophysicochemical motifs together with a TCRamotif, or whether a novel motif would be detected. Fourth, we donot know the T lymphocyte subset to which the T cells bearing theTCRs with high-scoring 4-mers belong. Thus, we do not knowtheir functional dispositions (e.g., cytotoxic, regulatory) orwheth-er they could participate in tumor cell killing. Fifth, our patientsets are relatively small. Thus, follow-up studies on larger patientcohorts are needed. Finally, although we have successfully usedour method on three different diseases and with both B-cell andT-cell receptors, our approach is not yet a turnkey solution foridentifying biomarkers in immune repertoires. In each case, weconsidered several variations of the model to determine which

one worked best for each disease. Each time we identify a newbiomarker, however, we move one step closer to developingautomated approaches that relate immune repertoires to labeledclinical phenotypes.

Our approach also has benefits. It requires only collecting tissuefrom a small number of patients, sequencing the immune recep-tors, and fitting themodel. No prior knowledge of the antigens orreceptor specificities is needed, and no additional experiments arerequired to enrich disease-specific receptors. The model's predic-tions are easily interpretable and may elucidate individualimmune receptors associated with diseases such as cancer. Ourfuture work will include improving our methodology and apply-ing it to other diseases and tissue types.

Disclosure of Potential Conflicts of InterestNo potential conflicts of interest were disclosed.

Authors' ContributionsConception and design: J. Ostmeyer, L.G. CowellDevelopment of methodology: J. Ostmeyer, L.G. CowellAnalysis and interpretation of data (e.g., statistical analysis, biostatistics,computational analysis): J. Ostmeyer, S. Christley, I.T. Toby, L.G. CowellWriting, review, and/or revision of the manuscript: J. Ostmeyer, S. Christley,I.T. Toby, L.G. CowellStudy supervision: L.G. Cowell

AcknowledgmentsWe are grateful that the colorectal and breast cancer datasets have beenmade

available online, and we appreciate Adaptive Biotechnologies for hosting thedata on immuneACCESS (https://clients.adaptivebiotech.com/immuneaccess). Computing time on the UT Southwestern BioHPC computing clusterwas made available through the Harold C. Simmons Comprehensive CancerCenter.

This project was supported by a National Institute of Allergy and InfectiousDiseases–funded R01 (AI097403) to L.G. Cowell, a training grant to theSimmons Comprehensive Cancer Center at UT Southwestern from theCancer Prevention and Research Institute of Texas (RP160157), and fundingto L.G. Cowell fromUT Southwestern and the SimmonsComprehensive CancerCenter.

The costs of publication of this article were defrayed in part by thepayment of page charges. This article must therefore be hereby markedadvertisement in accordance with 18 U.S.C. Section 1734 solely to indicatethis fact.

Received July 30, 2018; revisedNovember 16, 2018; accepted January 3, 2019;published first January 8, 2019.

References1. ChenDS,Mellman I. Elements of cancer immunity and the cancer-immune

set point. Nature 2017;541:321–30.2. Kvistborg P, van Buuren MM, Schumacher TN. Human cancer regression

antigens. Curr Opin Immunol 2013;25:284–90.3. Dhodapkar K, Dhodapkar M. Harnessing shared antigens and T-cell recep-

tors in cancer: opportunities and challenges. Proc Natl Acad Sci U S A 2016;113:7944–5.

4. Romero P, Dunbar PR, Valmori D, Pittet M, Ogg GS, Rimoldi D, et al. Exvivo staining ofmetastatic lymphnodes by class Imajor histocompatibilitycomplex tetramers reveals high numbers of antigen-experienced tumor-specific cytolytic T lymphocytes. J Exp Med 1998;188:1641–50.

5. Dhodapkar KM, Gettinger SN, Das R, Zebroski H, Dhodapkar MV. SOX2-specific adaptive immunity and response to immunotherapy in non-smallcell lung cancer. Oncoimmunology 2013;2:e25205.

6. Dhodapkar MV, Sexton R, Das R, Dhodapkar KM, Zhang L, Sundaram R,et al. Prospective analysis of antigen-specific immunity, stem-cell antigens,

and immune checkpoints in monoclonal gammopathy. Blood 2015;126:2475–8.

7. Munson DJ, Egelston CA, Chiotti KE, Parra ZE, Bruno TC, Moore BL,et al. Identification of shared TCR sequences from T cells in humanbreast cancer using emulsion RT-PCR. Proc Natl Acad Sci U S A 2016;113:8272–7.

8. Massa C, Robins H, Desmarais C, Riemann D, Fahldieck C, Fornara P,et al. Identification of patient-specific and tumor-shared T cell receptorsequences in renal cell carcinoma patients. Oncotarget 2017;8:21212–28.

9. Bai X, ZhangQ,WuS, ZhangX,WangM,He F, et al. Characteristics of tumorinfiltrating lymphocyte and circulating lymphocyte repertoires in pancre-atic cancer by the sequencing of T cell receptors. Sci Rep 2015;5:13664.

10. Nakanishi K, Kukita Y, Segawa H, Inoue N, Ohue M, Kato K. Character-ization of the T-cell receptor beta chain repertoire in tumor-infiltratinglymphocytes. Cancer Med 2016;5:2513–21.

TCR Motifs Distinguish Tumor from Healthy Tissue Repertoires

www.aacrjournals.org Cancer Res; 79(7) April 1, 2019 1679

on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from

Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292

Page 10: Biophysicochemical Motifs in T-cell Receptor …...Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor –encoding genes in a lym-phocyte population

11. Fugmann SD, Lee AI, Shockett PE, Villey IJ, Schatz DG. The RAG proteinsand V(D)J recombination: complexes, ends, and transposition.Annu Rev Immunol 2000;18:495–527.

12. Kirsch I, Vignali M, RobinsH. T-cell receptor profiling in cancer. MolOncol2015;9:2063–70.

13. Galson JD, Tr€uck J, Fowler A, Clutterbuck EA, M€unz M, Cerundolo V, et al.Analysis of B cell repertoire dynamics following hepatitis B vaccination inhumans, and enrichment of vaccine-specific antibody sequences.EBioMedicine 2015;2:2070–9.

14. Miho E, Yermanos A, Weber CR, Berger CT, Reddy ST, Greiff V. Compu-tational strategies for dissecting the high-dimensional complexity of adap-tive immune repertoires. Front Immunol 2018;9:224.

15. Chaudhary N, Wesemann DR. Analyzing immunoglobulin repertoires.Front Immunol 2018;9:462.

16. JiaQ, Zhou J, ChenG, Shi Y, YuH,Guan P, et al. Diversity index ofmucosalresident T lymphocyte repertoire predicts clinical prognosis in gastriccancer. Oncoimmunology 2015;4:e1001230.

17. Postow MA, Manuel M, Wong P, Yuan J, Dong Z, Liu C, et al. Peripheral Tcell receptor diversity is associated with clinical outcomes followingipilimumab treatment in metastatic melanoma. J Immunother Cancer2015;3:23.

18. HosoiA, TakedaK,NagaokaK, Iino T,MatsushitaH,Ueha S, et al. Increaseddiversity with reduced "diversity evenness" of tumor infiltrating T-cells forthe successful cancer immunotherapy. Sci Rep 2018;8:1058.

19. Ostmeyer J, Christley S, Rounds WH, Toby I, Greenberg BM, MonsonNL, et al. Statistical classifiers for diagnosing disease from immunerepertoires: a case study using multiple sclerosis. BMC Bioinformatics2017;18:401.

20. Emerson RO, DeWitt WS, Vignali M, Gravley J, Hu JK, Osborne EJ, et al.Immunosequencing identifies signatures of cytomegalovirus exposurehistory and HLA-mediated effects on the T cell repertoire. Nat Genet2017;49:659–65.

21. Sun Y, Best K, Cinelli M, Heather JM, Reich-Zeliger S, Shifrut E, et al.Specificity, privacy, and degeneracy in the CD4 T cell receptor repertoirefollowing immunization. Front Immunol 2017;8:430.

22. CinelliM, Sun Y, Best K,Heather JM, Reich-Zeliger S, Shifrut E, et al. Featureselection using a one dimensional naive Bayes' classifier increases theaccuracy of support vector machine classification of CDR3 repertoires.Bioinformatics 2017;33:951–5.

23. ThomasN, Best K,CinelliM, Reich-Zeliger S,GalH, Shifrut E, et al. Trackingglobal changes induced in the CD4 T-cell receptor repertoire by immuni-zation with a complex antigen using short stretches of CDR3 proteinsequence. Bioinformatics 2014;30:3181–8.

24. Carbonneau M-A, Cheplygina V, Granger E, Gagnon G. Multiple instancelearning: a survey of problem characteristics and applications.Pattern Recognition 2018;77:329–53.

25. Beausang JF,Wheeler AJ, Chan NH, Hanft VR, Dirbas FM, Jeffrey SS, et al. Tcell receptor sequencing of early-stage breast cancer tumors identifiesaltered clonal structure of the T cell repertoire. Proc Natl Acad Sci U S A2017;114:E10409–E10417.

26. Sherwood AM, Emerson RO, Scherer D, Habermann N, Buck K, Staffa J,et al. Tumor-infiltrating lymphocytes in colorectal tumors displaya diversity of T cell receptor sequences that differ from the T cellsin adjacent mucosal tissue. Cancer Immunol Immunother 2013;62:1453–61.

27. Carlson CS, Emerson RO, Sherwood AM, Desmarais C, Chung MW,Parsons JM, et al. Using synthetic templates to design an unbiased mul-tiplex PCR assay. Nat Commun 2013;4:2680.

28. DeWitt WS, Lindau P, Snyder TM, Sherwood AM, Vignali M, Carlson CS,et al. A public database of memory and Naive B-cell receptor sequences.PLoS One 2016;11:e0160853.

29. Rose PW, Prli�c A, Bi C, Bluhm WF, Christie CH, Dutta S, et al. The RCSBProtein Data Bank: views of structural biology for basic and appliedresearch and education. Nucleic Acids Res 2015;43:D345–56.

30. Kidera A, Konishi Y, Oka M, Ooi T, Scheraga HA. Statistical-analysis of thephysical-properties of the 20 naturally-occurring amino-acids.J Protein Chem 1985;4:23–55.

31. Atchley WR, Zhao J, Fernandes AD, Dr€uke T. Solving the protein sequencemetric problem. Proc Natl Acad Sci U S A 2005;102:6395–400.

32. Babayan A, Pantel K. Advances in liquid biopsy approaches for earlydetection and monitoring of cancer. Genome Med 2018;10:21.

33. Kinde I, BettegowdaC,Wang Y,Wu J, AgrawalN, Shih I-M, et al. Evaluationof DNA from the Papanicolaou test to detect ovarian and endometrialcancers. Sci Transl Med 2013;5:167ra4.

34. Cohen JD, Li L, Wang Y, Thoburn C, Afsari B, Danilova L, et al. Detectionand localization of surgically resectable cancers with amulti-analyte bloodtest. Science 2018;359:926–30.

35. Krimmel JD, Schmitt MW, Harrell MI, Agnew KJ, Kennedy SR, Emond MJ,et al. Ultra-deep sequencing detects ovarian cancer cells in peritoneal fluidand reveals somatic TP53 mutations in noncancerous tissues.Proc Natl Acad Sci U S A 2016;113:6005–10.

36. Fernandez-Cuesta L, Perdomo S, Avogbe PH, Leblay N, Delhomme TM,Gaborieau V, et al. Identification of circulating tumor DNA for the earlydetection of small-cell lung cancer. EBioMedicine 2016;10:117–23.

37. Bettegowda C, Sausen M, Leary RJ, Kinde I, Wang Y, Agrawal N, et al.Detection of circulating tumor DNA in early- and late-stage humanmalignancies. Sci Transl Med 2014;6:224ra24.

38. NewmanAM, Bratman SV, To J,Wynne JF, EclovNCW,Modlin LA, et al. Anultrasensitive method for quantitating circulating tumor DNA with broadpatient coverage. Nat Med 2014;20:548–54.

39. Korde N, Roschewski M, Zingone A, Kwok M, Manasanch EE, Bhutani M,et al. Treatment with carfilzomib-lenalidomide-dexamethasone with lena-lidomide extension in patients with smoldering or newly diagnosedmultiple myeloma. JAMA Oncol 2015;1:746–54.

40. Wu D, Emerson RO, Sherwood A, Loh ML, Angiolillo A, Howie B, et al.Detection of minimal residual disease in B lymphoblastic leukemia byhigh-throughput sequencing of IGH. Clin Cancer Res 2014;20:4540–8.

41. Levy E, Marty R, Calder�on VG, Woo B, Dow M, Armisen R, et al. ImmuneDNA signature of T-cell infiltration inbreast tumor exomes. Sci Rep2016;6:30064.

42. Wang T, Wang C, Wu J, He C, Zhang W, Liu J, et al. The different T-cellreceptor repertoires in breast cancer tumors, draining lymph nodes, andadjacent tissues. Cancer Immunol Res 2017;5:148–56.

43. Venturi V, Price DA, Douek DC, Davenport MP. The molecular basis forpublic T-cell responses? Nat Rev Immunol 2008;8:231–8.

44. Garcia KC, Adams JJ, Feng D, Ely LK. The molecular basis of TCR germlinebias for MHC is surprisingly simple. Nat Immunol 2009;10:143–7.

45. Rossjohn J, Gras S, Miles JJ, Turner SJ, Godfrey DI, McCluskey J. T cellantigen receptor recognition of antigen-presenting molecules.Annu Rev Immunol 2015;33:169–200.

46. Rudolph MG, Stanfield RL, Wilson IA. How TCRs bind MHCs, peptides,and coreceptors. Annu Rev Immunol 2006;24:419–66.

47. Zhang H, Lim HS, Knapp B, Deane CM, Aleksic M, Dushek O, et al. Thecontribution of major histocompatibility complex contacts to the affinityand kinetics of T cell receptor binding. Sci Rep 2016;6:35326.

48. Reiser JB, Gr�egoire C, Darnault C, Mosser T, Guimezanes A, Schmitt-Verhulst AM, et al. A T cell receptor CDR3beta loop undergoesconformational changes of unprecedented magnitude upon bindingto a peptide/MHC class I complex. Immunity 2002;16:345–54.

49. Ayres CM, Scott DR, Corcelli SA, Baker BM. Differential utilization ofbinding loop flexibility in T cell receptor ligand selection and cross-reactivity. Sci Rep 2016;6:25070.

Cancer Res; 79(7) April 1, 2019 Cancer Research1680

Ostmeyer et al.

on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from

Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292

Page 11: Biophysicochemical Motifs in T-cell Receptor …...Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor –encoding genes in a lym-phocyte population

2019;79:1671-1680. Published OnlineFirst January 8, 2019.Cancer Res   Jared Ostmeyer, Scott Christley, Inimary T. Toby, et al.   Adjacent Healthy TissueDistinguish Repertoires from Tumor-Infiltrating Lymphocyte and Biophysicochemical Motifs in T-cell Receptor Sequences

  Updated version

  10.1158/0008-5472.CAN-18-2292doi:

Access the most recent version of this article at:

  Material

Supplementary

  http://cancerres.aacrjournals.org/content/suppl/2019/01/08/0008-5472.CAN-18-2292.DC1

Access the most recent supplemental material at:

   

   

  Cited articles

  http://cancerres.aacrjournals.org/content/79/7/1671.full#ref-list-1

This article cites 49 articles, 12 of which you can access for free at:

  Citing articles

  http://cancerres.aacrjournals.org/content/79/7/1671.full#related-urls

This article has been cited by 11 HighWire-hosted articles. Access the articles at:

   

  E-mail alerts related to this article or journal.Sign up to receive free email-alerts

  Subscriptions

Reprints and

  [email protected]

To order reprints of this article or to subscribe to the journal, contact the AACR Publications Department at

  Permissions

  Rightslink site. Click on "Request Permissions" which will take you to the Copyright Clearance Center's (CCC)

.http://cancerres.aacrjournals.org/content/79/7/1671To request permission to re-use all or part of this article, use this link

on August 21, 2020. © 2019 American Association for Cancer Research. cancerres.aacrjournals.org Downloaded from

Published OnlineFirst January 8, 2019; DOI: 10.1158/0008-5472.CAN-18-2292