cab-rep: a database of curated antibody repertoires for exploring antibody diversity … · 2 21...

23
1 cAb-Rep: A Database of Curated Antibody Repertoires for 1 Exploring antibody diversity and Predicting Antibody Prevalence 2 3 Yicheng Guo 1 , Kevin Chen 2 , Peter D. Kwong 3,4 , Lawrence Shapiro 1,3,4 , Zizhang Sheng 1,* 4 5 6 1 Zuckerman Mind Brain Behavior Institute, Columbia University, New York, New York 10027, 7 USA 8 2 College of Arts and Science, Stony Brook University, Stony Brook, New York 11794, USA 9 3 Vaccine Research Center, NIAID, NIH, Bethesda, Maryland 20892, USA 10 4 Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New 11 York 10032, USA 12 13 *Correspondence: 14 Zizhang Sheng 15 [email protected] 16 17 Key words: antibody prevalence, antibody repertoire, Antibodyomics, B cell response, 18 gene-specific substitution profile, next-generation sequencing, post-translational 19 modification, somatic hypermutation (SHM) 20 not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765099 doi: bioRxiv preprint

Upload: others

Post on 10-May-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

1

cAb-Rep: A Database of Curated Antibody Repertoires for 1Exploring antibody diversity and Predicting Antibody Prevalence 2 3Yicheng Guo1, Kevin Chen2, Peter D. Kwong3,4, Lawrence Shapiro1,3,4 , Zizhang Sheng1,* 4 5 61Zuckerman Mind Brain Behavior Institute, Columbia University, New York, New York 10027, 7USA 82College of Arts and Science, Stony Brook University, Stony Brook, New York 11794, USA 93Vaccine Research Center, NIAID, NIH, Bethesda, Maryland 20892, USA 104Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New 11York 10032, USA 12 13*Correspondence: 14Zizhang Sheng [email protected] 16 17Key words: antibody prevalence, antibody repertoire, Antibodyomics, B cell response, 18gene-specific substitution profile, next-generation sequencing, post-translational 19modification, somatic hypermutation (SHM) 20

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 2: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

2

Abstract 21 22The diversity of B cell receptors provides a basis for recognizing numerous pathogens. Antibody 23repertoire sequencing has revealed relationships between B cell receptor sequences, their 24diversity, and their function in infection, vaccination, and disease. However, many repertoire 25datasets have been deposited without annotation or quality control, limiting their utility. To 26accelerate investigations of B cell immunoglobulin sequence repertoires and to facilitate 27development of algorithms for their analysis, we constructed a comprehensive public database of 28curated human B cell immunoglobulin sequence repertoires, cAb-Rep (https://cab-29rep.c2b2.columbia.edu), which currently includes 306 immunoglobulin repertoires from 121 30human donors, who were healthy, vaccinated, or had autoimmune disease. The database contains 31a total of 267.9 million V(D)J heavy chain and 72.9 million VJ light chain transcripts. These 32transcripts are full-length or near full-length, have been annotated with gene origin, antibody 33isotype, somatic hypermutations, and other biological characteristics, and are stored in FASTA 34format to facilitate their direct use by most current repertoire-analysis programs. We describe a 35website to search cAb-Rep for similar antibodies along with methods for analysis of the 36prevalence of antibodies with specific genetic signatures, for estimation of reproducibility of 37somatic hypermutation patterns of interest, and for delineating frequencies of somatically 38introduced N-glycosylation. cAb-Rep should be useful for investigating attributes of B cell 39sequence repertoires, for understanding characteristics of affinity maturation, and for identifying 40potential barriers to the elicitation of effective neutralizing antibodies in infection or by 41vaccination. 42 43

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 3: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

3

Introduction 44B cells comprise a crucial component of the adaptive immune response (Murphy, 2014). B cells 45recognize three-dimensional epitopes of antigens through the variable domains of the B cell 46receptor (BCR), or its various secreted forms of antibody. The variable domains of BCRs and 47antibodies are composed of immunoglobulin heavy and light chains, encoded by separate genes. 48Through V(D)J gene recombination and somatic hypermutation (SHM), a high level of sequence 49diversity is introduced to the variable domain (Briney et al., 2019; Elhanati et al., 2015; Murphy, 502014; Soto et al., 2019), allowing B cells to recognize diverse antigens. Thus, an interrogation of 51B cell diversity and function is key to understanding the B cell immune response. The 52application of next-generation sequencing (NGS) to BCR repertoires provides snapshots of BCR 53diversity, and such studies in the past decade have characterized numerous features of B cell 54responses to infection, immunization, and autoimmune disease (Elhanati et al., 2015; Francica et 55al., 2015; Galson et al., 2016a; Galson et al., 2015; Galson et al., 2016b; Joyce et al., 2016; 56Kwong et al., 2017; Nielsen and Boyd, 2018; Rubelt et al., 2016; Sheng et al., 2016; Sheng et al., 572017; Stern et al., 2014; Tipton et al., 2015; Vander Heiden et al., 2017; Wu et al., 2015). 58Databases, programs, and websites have been developed to store, process, and annotate BCR 59repertoire data (Brochet et al., 2008; DeWitt et al., 2016; Margreitter et al., 2018; Miho et al., 602018; Schramm et al., 2016; Vander Heiden et al., 2018; Vander Heiden et al., 2017) (software 61reviewed in (Chaudhary and Wesemann, 2018) ). Nonetheless, most repertoire data are deposited 62in public databases in the format of raw NGS reads, which contain both sequencing duplicates 63and sequencing errors and may be difficult to annotate. A comprehensive database of curated and 64well-annotated BCR transcripts, should therefore accelerate repertoire studies including but not 65limited to characterization of B cell receptor diversity, mechanisms of clonal expansion, 66development of BCR repertoire analysis algorithms, estimation of prevalence of antigen-specific 67antibodies and their precursors, and functions of SHM. In addition, such a database should assist 68researchers in performing repertoire-related data mining. 69

70Designing vaccines that can elicit broadly neutralizing antibodies (bnAbs) is a long-term goal for 71preventing infections from fast evolving and/or highly diversified pathogens such as HIV-1 and 72influenza (Burton and Mascola, 2015; Corti and Lanzavecchia, 2013). Studies have revealed that 73some epitopes could elicit bnAbs in different individuals with similar modes of recognition and 74shared genetic signatures (e.g. V(D)J gene origin, complementarity determining region 3 (CDR3) 75length, CDR3 motifs, SHM) (Andrabi et al., 2015; Ekiert et al., 2009; Joyce et al., 2016; Kwong 76and Mascola, 2018; Wu et al., 2011; Zhou et al., 2015; Zhou et al., 2013), defined as convergent, 77multidonor or public bnAb classes. Such antibody classes often target conserved epitopes; those 78that can activate and mature precursor B cells with bnAb potential might provide templates for 79universal vaccines (Chuang et al., 2019; Kwong and Mascola, 2018; Kwong and Wilson, 2009; 80Pappas et al., 2014). However, not all antibody classes appear with high frequencies in BCR 81repertoires; those that do not may be limited due to disfavored developmental steps or 82“roadblocks”. Thus, the identifications of bnAb classes with precursor-like cells prevalent in 83humans is a critical consideration for determining which template antibodies are good targets for 84elicitation, and thus for immunogen design (Dosenovic et al., 2018; Jardine et al., 2016a; Joyce 85et al., 2016). A comprehensive database of curated B cell repertoire transcripts, for which 86uncertainties and variations in B cell repertoires (e.g. sampling bias, infection history, aging, and 87genetic diversification)(Rubelt et al., 2016) have been minimized, would be helpful for 88predicting antibody class prevalences. 89

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 4: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

4

90A database of curated BCR repertoires could, moreover, be used to investigate SHM preference 91and mechanisms of affinity maturation. Antibodies accumulate mutations with high preference 92determined by both intrinsic gene mutability and functional selection (Murphy, 2014; Odegard 93and Schatz, 2006). We previously developed gene-specific substitution profiles (GSSPs) to 94characterize positional substitution types and frequencies in 69 human V genes (Sheng et al., 952017). By incorporating additional B cell transcripts, GSSPs could be built for more genes, and 96this could have broad application; GSSPs or similar approaches have been applied to examine 97whether bnAbs mature with shared pathways (Bonsignori et al., 2016), to identify highly 98frequent SHMs and common mechanisms of function (Koenig et al., 2017; van de Bovenkamp et 99al., 2018), and to estimate whether rare SHMs (SHMs generated with very low frequency by the 100SHM machinery) in bnAbs could form barriers to re-elicitation by vaccination (Chuang et al., 1012019). 102 103In this study, we constructed a database of curated human B cell immunoglobulin sequence 104repertoires (cAb-Rep) from 306 high quality human repertoires, and developed methods to 105search cAb-Rep using either sequence or sequence signature. We used the database to construct 106GSSPs for 102 human antibody V genes and developed a script to identify rare SHMs in input 107sequences. We evaluated the robustness of query results relative to the number of repertoires, 108using antibody prevalence estimates as a test case. In summary, the database developed here 109should help to investigate B cell repertoire features, to find antibody templates for vaccine design, 110and overall to understand mechanisms of antibody development. 111 112Materials and Methods: 113BCR repertoire dataset 114We assembled a total of 376 BCR repertoire datasets from 108 human donors deposited in the 115NCBI short reads archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra). These repertoires 116were all sequenced using illumina MiSeq or HiSeq and were published in 9 studies (Table 1). In 117addition, annotated full length sequences of three donors (sequenced by AbHelix with high depth 118or HD repertoires, ~108 high quality sequences per donor) (Soto et al., 2019) and the near full 119length V(D)J sequences (sequenced with framework 1 region primers) from ten HD repertoires 120(Briney et al., 2019) were downloaded from the links provided by the authors. 121 122Next-generation sequencing data processing 123The NGS data were analyzed with the SONAR pipeline, version 2.0 124(https://github.com/scharch/sonar/) developed in our lab (Schramm et al., 2016). Briefly, 125USEARCH was used to merge the 2x300 raw reads to single transcripts and to remove 126transcripts potentially containing more than 20 miscalls calculated from sequencing quality score 127(Edgar and Flyvbjerg, 2015). Merged transcripts shorter than 300 nucleotides were removed. 128BLAST (http://www.ncbi.nlm.nih.gov/blast/) was used to assign germline V, D, and J genes to 129each transcript with customized parameters (Altschul et al., 1997; Schramm et al., 2016). CDR3 130was identified from BLAST alignment using the conserved 2nd Cysteine in V region and WGXG 131(heavy chain) or FGXG (light chain) motifs in J region (X represents any of the 20 amino acids). 132To assign an isotype for each heavy chain transcript, we used BLAST with default parameters to 133search the 3’ terminus of each transcript against a database of human heavy chain constant 134domain 1 region obtained from the international ImMunoGeneTics information system (IMGT) 135

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 5: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

5

database. A BLAST E-value threshold of 1E-6 was used to find significant isotype assignments. 136Then, sequences other than the V(D)J region of a transcript were removed and transcripts 137containing frame-shift and/or stop codon were excluded. 138 139We then applied two approaches to remove PCR duplicates and sequencing errors. For datasets 140sequenced with unique molecular identifier (UMI), we first grouped merged raw transcripts 141having identical UMI. Due to PCR crossover and UMI collisions, transcripts in a group may 142originate from different cells (Briney et al., 2019; Rubelt et al., 2016). We therefore clustered the 143raw transcripts in each group using USEARCH with a 97% sequence identity. All transcripts in a 144USEARCH cluster were aligned using CLUSTALO (Sievers et al., 2011) and a consensus 145sequence was generated. To further reduce redundant transcripts derived from multiple mRNA 146molecules of the same cell, we removed duplicates in the consensus sequences. Singletons, 147which have neither UMI duplicates nor consensus duplicates, were removed. SONAR was used 148to annotate the unique transcripts. For datasets don’t contain UMI, similar to our previous studies 149(Schramm et al., 2016; Sheng et al., 2017; Wu et al., 2015), we clustered transcripts of each 150repertoire using USEARCH with sequence identity of 0.99, and only one transcript with the 151highest sequencing depth or numbers of duplicates was selected from each cluster for later 152analyses. To further remove low quality reads, we excluded clusters with size less than 2. Finally, 153a unique dataset of transcripts was generated for each repertoire. To reduce artifacts and effect of 154small sample size on the comparisons of features of antibody repertoires, we excluded repertoires 155containing fewer than 1700 unique transcripts. 156 157For each of the 13 HD repertoires curated in previous studies, we further removed duplicates, 158sequences containing frameshifts or/and stop codons, and corrected annotation errors to form a 159unique dataset. The gene annotation information was retrieved from the annotation files provided 160by the studies. 161 162Germline gene database and prediction of new gene/alleles 163The human germline gene database from IMGT was used to assign germline V(D)J genes. To 164detect new germline genes or alleles, we combined unique reads from 108 repertoires, and used 165IgDiscover v0.9 with default parameters to identify potential new germline genes or alleles that 166are observed in multiple repertoires (Corcoran et al., 2016). While IgDiscover prefers to identify 167new genes or alleles from IgM repertoire, we used repertoires containing all isotypes. 168Nonetheless, identical unique reads were observed for each predicted gene/allele in at least two 169donors, suggesting that the predictions are still reliable. The predicted genes were submitted to 170European Nucleotide Archive (ENA) with project accession numbers: PRJEB31020. For each of 171the 13 HD repertoires, we randomly selected 1 million IgM transcripts for novel gene prediction 172(as recommended by IgDiscover manual) and no new gene or allele was found. 173 174V and J gene usage and antibody position numbering 175The unique dataset for each repertoire was used to calculate the distributions of germline gene 176usage and SHM levels. For each transcript, we used ANARCI to assign each position according 177to the Kabat, Chothia, and IMGT numbering schemes (Dunbar and Deane, 2016). 178 179Signature prevalence and rarefaction analysis 180

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 6: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

6

We developed a python script, Ab_search.py, to search the BCR repertoire database in two 181modes: signature motif and full V(D)J sequence. Briefly, in the signature searching mode, an 182amino acid signature motif from either CDR3 region or other positions of interest (defined with 183the Kabat, Chothia, and IMGT numbering scheme) is converted to python regular expression 184format and searched against all transcripts in the database. For the sequence searching mode, 185BLAST+ is called to find similar transcripts with E-value less than 1E-6 (Camacho et al., 2009). 186Signature frequency in each repertoire is calculated by dividing the number of matched 187transcripts with the total number of unique transcripts originated from the same germline gene or 188allele. 189 190To evaluate the effect of number of repertoires on estimation of signature frequency, we 191performed rarefaction analysis by sampling 𝑖 subset repertoires from PBMC samples (1 to 35). 192For each repertoire size 𝑖, the random sampling was performed N times (we used N=20 in the 193current study) and the mean signature frequency from each sampling 𝑓! was calculated. Then the 194coefficient of variance for each 𝑖, 𝐶𝑉! , was calculated as: 195

𝐶𝑉! =

(𝑓!,! −𝑓!,!

!

!!!𝑁 )!

!

!!!𝑁 − 1

𝑓!,!!

!!!𝑁

, 𝑓! =𝑓!!

!!!

𝑖

196Identification of antibody clones and Construction of GSSPs 197For each repertoire, we first sorted all transcripts to groups based on identical V and J gene 198usages. For each group, transcripts with 90% CDR3 sequence identity and the same CDR3 199length were clustered into clones using USEARCH. One representative sequence was selected in 200each clone and representative transcripts from all repertoires were combined to build GSSPs for 201V genes using mGSSP (Sheng et al., 2017). We exclude GSSPs built with less than 100 clones 202because of lack of information (Sheng et al., 2017). A python script, SHM_freq.py, was 203developed to identify SHMs in an input sequence, to search the GSSP of an assigned V gene, and 204to output the rarity of each mutation, calculated as: Rarity = (1 – Frequency of mutation) * 100%. 205 206Construction of gene-specific N-glycosylation profile 207Glycosylation sites for sequences having SHMs greater than 1% in the unique datasets of healthy 208and vaccination donors were predicted using an artificial neural network method NetNGlyc v1.0 209(http://www.cbs.dtu.dk/services/NetNGlyc/)(R. Gupta et al., 2004). Predictions with high 210specificity (potential greater than 0.5 and jury agreement of 9/9) and do not match Asn-Pro-211Ser/Thr motifs were included. N-glycosylation sites encoded in germline genes were excluded. 212 213Statistical tests 214ANOVA and T tests were performed in R. 215 216Results: 217cAb-Rep contains a diverse compendium of annotated next-generation sequencing datasets 218

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 7: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

7

We first assembled 376 BCR repertoire deep sequencing data sets from NCBI SRA database, 219each sequenced by Illumina MiSeq or HiSeq and with library preparation protocols to cover full 220length V(D)J region (5’ primers at leader regions or 5’ Rapid amplification of cDNA ends 221(RACE), 3’ primers targeting constant region 1 ) or near full length (5’ primers at N-terminus of 222framework 1 region). These datasets contain a total of 247 million raw reads (Table 1 and S1). 223We performed quality control including removing sequencing errors, PCR crossover, PCR 224duplicates, and annotated each sequence using SONAR (See Methods and Fig.1). A total of 293 225repertoires from 108 donors were selected to construct the cAb-Rep database, with 6.6 and 0.9 226million curated full length V(D)J heavy and light chain transcripts. 25541 curated transcripts 227were obtained per repertoire. In addition, annotated sequences from 13 healthy donor repertoires 228sequenced with high depth (more than 20 million unique sequences from about 107 to 108 B cells 229per donor) (Briney et al., 2019; Soto et al., 2019), were curated and incorporated. 230 231These repertoires are from studies including Hepatitis B vaccination, influenza vaccination, 232Tetanus vaccination, Multiple Sclerosis, Myasthenia Gravis, Systemic Lupus Erythematosus, and 233healthy donors (Table 1 and S1) (Galson et al., 2016a; Galson et al., 2015; Galson et al., 2016b; 234Gupta et al., 2017; Joyce et al., 2016; Rubelt et al., 2016; Stern et al., 2014; Tipton et al., 2015; 235Vander Heiden et al., 2017). Among all repertoires, 72 repertoires are from 43 healthy donors 236(35.5% of total donors). The database repertoiores were obtained from three tissues: 2 brain 237samples, 15 cervical lymph node samples, and 289 peripheral blood mononuclear cell (PBMC) 238samples. Among PBMC samples, 16 were from antibody secreting B cells, 27 from Hepatitis B 239surface antigen (HBsAg)+ B cells, 8 from Human Leukocyte Antigen – DR isotype (HLA-DR)+ 240plasma cells, 139 from IgG+ B cells, 20 from memory B cells, 34 from IgM or naïve B cells, and 24152 from whole PBMC (Table S1). Thus, cAb-Rep contains BCR repertoire data in different 242settings, which will enable cross-study comparisons. 243 244For each transcript in cAb-Rep, we numbered positions using Kabat, Chothia, and IMGT 245schemes (Fig. 1), and annotated gene origin, CDR3 sequence and length, somatic hypermutation 246level, isotype, donor, and clonotype. These curated datasets can be used for downstream analyses. 247The transcripts are stored in FASTA format with annotation information in the header line of 248each transcript. Such a format can be easily altered to feed into other programs for repertoire 249analysis. 250 251Because the 13 HD repertoires contains over 330 million unique sequences, to avoid sampling 252bias for profile constructions and improve speed of searching cAb-Rep (see sections below), we 253randomly selected 20,000 unique sequences of each isotype of each repertoire and combined 254with the rest of repertoires for all analyses below. Nonetheless, in case rare events of BCR 255signatures are of interests, we provided an option in our scripts and at cAb-Rep website to search 256the 13 HD repertoires alone. 257 258Ab_search identifies similar antibody transcripts and estimates signature frequency 259To identify antibody transcripts of interest from cAb-Rep, we developed a python script, 260Ab_search.py. The script accepts sequences or amino acid motifs as input and output frequency 261of the signature and matched sequences (See Materials and Methods). Here, as examples, we 262searched transcripts having signature motifs of a selected malaria antibody class in PBMC BCR 263repertoires or of a selected HIV-1 bnAb classes in IgM and IgG repertoires (Fig.2). 264

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 8: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

8

265The selected malaria antibody class contains both heavy (IGHV3-33 gene with a tryptophan at 266position 52) and light chain (IGKV1-5 gene with 8 amino acid CDRL3) signatures (Imkeller et 267al., 2018). The heavy chain signature was observed with a frequency of 3.2% ± 0.94% in 15 268PBMC repertoires in healthy donors (Fig.2A), the light chain signature was found with a 269frequency of 1.24% ± 0.18% in 17 healthy donor PBMC repertoires. All healthy donors we 270searched contain both heavy and light sequences. By assuming random pairing (Jayaram et al., 2712012), heavy-light paired antibodies similar to the malaria antibody class could be generated 272with a high frequency (approximately 40 per million B cells). Thus, vaccine mediated elicitation 273of such antibodies are promising candidates to enable protection. 274 275The HIV-1 VRC01 class contains both heavy and light chain signatures (West et al., 2012; Wu et 276al., 2011; Zhou et al., 2015; Zhou et al., 2013). Its heavy chain uses IGHV1-2*02 allele. The 277light chain is originated from a limited set of genes, including IGKV1-33, IGKV3-15, IGKV3-20, 278or IGLV2-14 and contains a CDR3 of five amino acids matching motif X-X-[AFILMYWV]-279[EQ]-X. Searching cAb-Rep with these signatures showed that the heavy chain allele for VRC01 280appears in IgM and IgG repertories with similar frequencies (2.7%±1.9% and 2.85%±2.13% 281respectively) (Fig. 2B). Further, VRC01 light chain-like transcripts were found in ~33% of 282donors, with overall frequencies of 0.005%±0.0098% in PBMCs (Fig. 2C). By assuming random 283pairing of heavy and light chains, our calculation of the mean frequency of VRC01 class-like 284antibodies was 1.4 per million B cells in healthy donors, close to estimates from previous studies 285(Jardine et al., 2016a; Zhou et al., 2013). 286 287Impact of repertoire diversity on signature prevalence 288Due to diversification in BCR repertoires by antigen selection and other factors, the predicted 289prevalence of transcripts of interest could vary. To evaluate the effect of the number of sampled 290repertoires on prevalence prediction, we performed rarefaction analysis to sample a subset of 291repertoires (ranging from 1 to 35) with each subset randomly sampling 20 times to calculate 292frequencies of the malaria IGHV3-33/IGKV1-5 class and VRC01 class light chain signatures. 293For each sampling size, we calculated the coefficient of variation (standard deviation/mean) to 294measure the degree of variation among sampling repeats (Fig. 2D). For both signatures, our 295analysis revealed that the coefficient of variation decreases dramatically when the sampling size 296increases to 10 or more. This suggests that 10 or more repertoires will be optimal to have a 297consistent estimation of signature frequency. 298 299Gene-specific substitution profiles and substitution frequency analysis 300To investigate substitution preferences in V genes, we predicted new germline genes in the 301database, clustered transcripts in each repertoire into clones, and selected one representative 302sequence per clone to build gene-specific substitution profiles (GSSPs) (see Methods). Overall, 303we identified 5 novel heavy chain alleles, 2 novel lambda chain alleles, and 1 kappa chain allele, 304each of which was found in two or more donors. By incorporating germline gene sequences in 305database and prediction, we built GSSPs for 102 human V genes, compared to GSSPs built for 30669 genes using three repertoires in our previous study. Our analysis further showed that the 307GSSPs of the two studies are highly consistent (r = 0.983 for IGHV1-2 gene, Fig. 3A). 308 309

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 9: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

9

To facilitate exploring substitution preference, we developed a python script, SHM_freq.py, to 310identify mutations in an input sequence, call the GSSP of corresponding V gene, and find the 311frequency of the mutation being generated by the somatic hypermutation machinery. To 312demonstrate how this information can be helpful, we analyzed frequencies of substitutions 313observed in the heavy chain of VRC01 class bnAbs (Fig. 3B). This analysis showed that all 314lineages in this class contain over 30% mutations, with ~30% of the mutations being low 315frequency or rare mutations (frequency <0.5% in IGHV1-2 GSSP). Functional studies have 316shown that some rare mutations are essential for function (Jardine et al., 2016b). However, since 317rare mutations are generated with low frequency, the likelihood of immunogens maturing 318antibodies to have similar mutations could be low or require longer maturation times. In contrast, 319we observed that autoantibodies (e.g. collected from HIV, autoimmune thyroid disease, 320atherosclerosis, Hashimoto disease, and rheumatic carditis (Chazenbalk et al., 1993; Hexham et 321al., 1992; Jeon et al., 2007; Pichurin et al., 2001; Pichurin et al., 2002; Wu et al., 1998)) 322originated from IGHV1-2 genes contain very few rare mutations, suggesting somatic mutations 323may not provide a barrier to elicitation of these lineages. 324 325Gene-specific N-glycosylation profiles (GSNPs) 326Post-translation modifications (PTM) (glycosylation, tyrosine sulfation, etc.), which affects 327antibody functions (Doria-Rose et al., 2014; van de Bovenkamp et al., 2018), can be introduced 328to antibodies by V(D)J recombination and somatic hypermutation processes. To understand the 329frequency and preference of PTMs generated by somatic hypermutation, as an example, we 330predicted V-gene-specific frequency of N-glycosylation sequons at each position using healthy 331and vaccination donor unique sequences that having more than 1% SHM. Overall, consistent 332with previous study (van de Bovenkamp et al., 2018), the predicted N-glycosylation sites are 333enriched in CDR1, CDR2, and framework3 regions, but different genes have different hotspots 334for glycosylation (Fig. 4A). Structural analysis showed that the side chains of these hotspot 335positions are surface-exposed (Fig. 4B), suggesting these sites are spatially accessible for 336modification. We think the GSNPs provide information for further experimental validation and 337investigations of functions of N-glycosylations. 338 339cAb-Rep website to search frequencies of signature motif and SHM 340While we developed scripts to search cAb-Rep, these may not be friendly to users not familiar 341with programing. Therefore, we developed a website for searching cAb-Rep (https://cab-342rep.c2b2.columbia.edu/). The website implements all functions of the scripts we developed 343above, including querying cAb-Rep using the three signature modes (CDR3, position, BLAST) 344with specified isotype, numbering scheme, and VJ recombinations, identifying rare SHMs for an 345input sequence, and showing the GSSP of a V gene (Fig. 5). Users can also query GSNP of an 346input gene as well as download all the datasets in this study. 347 348Discussion: 349To understand the mechanisms of BCR diversity generated by V(D)J recombination and somatic 350hypermutation, high quality BCR repertoire datasets are critical. In this study, we have 351constructed a database of curated BCR transcripts from previous deep sequencing studies of B 352cell immunoglobulin sequence repertoires, and demonstrated how this database can provide 353helpful information for repertoire studies both locally and online. Currently, heavy-light pairing 354information is not obtained in cAb-Rep datasets. As new technologies are applied to repertoire 355

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 10: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

10

sequencing, we believe more paired heavy-light chain transcripts will be incorporated, which 356will greatly advance functional characterization of BCRs. We will continue to update this 357database to include more disease conditions as well as those from animal models. 358 359In cAb-Rep, we incorporated BCR datasets sequenced both with and without UMI. Besides 360filtering out transcripts with low sequencing quality, different approaches were used to identify 361high quality transcripts. For repertoires sequenced using UMI, we used the UMI information to 362generate consensus sequences, which has been proven effective at removing sequencing errors 363(Khan et al., 2016). For BCR repertoires sequenced without UMI, we used a clustering approach, 364which we first sorted transcripts based on redundancy and selected transcripts with the most 365redundancy as seeds to cluster transcripts at a given identity cutoff and only used the seeds as 366high quality transcripts. The assumption is that each B cell contains multiple BCR mRNA 367molecules and the original BCR could be PCR amplified and sequenced with many copies. 368Sequencing errors and PCR crossover are rare and close to random events (Schirmer et al., 2015), 369sequences containing sequencing errors are likely a few hamming distance away from the seed 370transcripts or appear as singletons after clustering. Thus, removing transcripts highly similar to 371the seed transcripts and singletons will remove many sequencing errors, although we may lose a 372portion of biological transcripts with low sequencing coverage. This approach is proven effective 373at finding functional transcripts in previous studies (Bhiman et al., 2015; Krebs et al., 2019). 374 375Compared to other BCR repertoire databases, cAb-Rep provides more flexibility to genetic 376search signatures. Another public database, iReceptor, supported by the adaptive immune 377receptor repertoire community, has been developed to store, query, and analyze immune receptor 378repertoire data (Corrie et al., 2018). While iReceptor includes BCR repertoires amplified with 379various library preparation protocols and sequenced in multiple platforms (illumine, 454, ion 380torrent, etc.), cAb-Rep is limited to include full-length or near-full length sequences with enough 381sample size for repertoire related analyses. For repertoire analysis, iReceptor allows query with 382V(D)J recombination and a CDR3 peptide (no regular expression grammars allowed), while 383cAb-Rep provides BLAST mode and a more flexible CDR3 searching mode. cAb-Rep also 384allows to search motifs across the V(D)J region in the position mode. Although iReceptor 385includes curated repertoire data deposited by researchers, these curated datasets are processed 386using different bioinformatics pipeline and parameters (germline gene database, exclusion of 387redundancy, etc.), which may introduce bias when performing statistical analysis across studies. 388Another B cell repertoire database is available, but only includes repertoires from three donors 389(DeWitt et al., 2016). Moreover, the 13 HD curated BCR repertoires, which provide high depth 390BCR diversity information, haven’t been incorporated in other databases. These datasets increase 391the statistical power to study features as well as rare events of BCR diversity. 392 393The second goal of cAb-Rep is to advance the investigation of functions of somatic 394hypermutation, which, as far as we know, is not available in other B cell repertoire databases. 395We provide a query portal and GSSPs for human antibody V genes to understand gene- and 396positional-specific substitution preferences. We also predicted gene-specific frequencies of N-397glycosylation in human antibody V genes. Such sequence-derived information, together with 398functional study, is critical to identify common mechanism of SHM functioning and understand 399similarities in pathways of antibody affinity maturation (Bonsignori et al., 2016; Koenig et al., 4002017; van de Bovenkamp et al., 2018). Nonetheless, our current knowledge on how SHM 401

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 11: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

11

function is still very limited, and a database to comprehensively annotate functions of SHM will 402accelerate SHM studies. In the future, we will collect literatures to annotate functions of SHM as 403well as develop new tools to predict functions of SHM. A portal in cAb-Rep will be created to 404query functions of SHM, which will elucidate the relations of antibody sequence, structure, and 405function and provide knowledge for antibody design. 406 407In summary, we believe cAb-Rep and the tools developed in this study are complimentary to 408other B cell repertoire databases, and will be helpful to researchers not familiar with repertoire 409annotation to explore features of repertoires and compare datasets across studies and diseases. 410 411Conflict of Interest 412The authors declare that the research was conducted in the absence of any commercial or 413financial relationships that could be construed as a potential conflict of interest. 414 415Author Contributions 416ZS, PK, and LS designed the research; ZS, YG, and KC analyzed data; ZS, YG, LS, and PK 417wrote the paper. All authors reviewed, commented on, and approved the manuscript. 418 419Funding 420This work was supported by National Institute of Allergy and Infectious Disease, National 421Institute of Health [1R21AI138024-01A1] to ZS. 422 423 424References: 425Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.,andLipman,D.J.426(1997).GappedBLASTandPSI-BLAST:anewgenerationofproteindatabasesearchprograms.427NucleicAcidsRes25,3389-3402.428Andrabi,R.,Voss,J.E.,Liang,C.H.,Briney,B.,McCoy,L.E.,Wu,C.Y.,Wong,C.H.,Poignard,P.,429andBurton,D.R.(2015).IdentificationofCommonFeaturesinPrototypeBroadlyNeutralizing430AntibodiestoHIVEnvelopeV2ApextoFacilitateVaccineDesign.Immunity43,959-973.431Bhiman,J.N.,Anthony,C.,Doria-Rose,N.A.,Karimanzira,O.,Schramm,C.A.,Khoza,T.,Kitchin,432D.,Botha,G.,Gorman,J.,Garrett,N.J.,etal.(2015).Viralvariantsthatinitiateanddrive433maturationofV1V2-directedHIV-1broadlyneutralizingantibodies.Naturemedicine21,1332-4341336.435Bonsignori,M.,Zhou,T.,Sheng,Z.,Chen,L.,Gao,F.,Joyce,M.G.,Ozorowski,G.,Chuang,G.Y.,436Schramm,C.A.,Wiehe,K.,etal.(2016).MaturationPathwayfromGermlinetoBroadHIV-1437NeutralizerofaCD4-MimicAntibody.Cell165,449-463.438Briney,B.,Inderbitzin,A.,Joyce,C.,andBurton,D.R.(2019).Commonalitydespiteexceptional439diversityinthebaselinehumanantibodyrepertoire.Nature.440Brochet,X.,Lefranc,M.P.,andGiudicelli,V.(2008).IMGT/V-QUEST:thehighlycustomizedand441integratedsystemforIGandTRstandardizedV-JandV-D-Jsequenceanalysis.NucleicAcidsRes44236,W503-508.443Burton,D.R.,andMascola,J.R.(2015).AntibodyresponsestoenvelopeglycoproteinsinHIV-1444infection.Natureimmunology16,571-576.445

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 12: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

12

Camacho,C.,Coulouris,G.,Avagyan,V.,Ma,N.,Papadopoulos,J.,Bealer,K.,andMadden,T.L.446(2009).BLAST+:architectureandapplications.BMCBioinformatics10,421.447Chaudhary,N.,andWesemann,D.R.(2018).AnalyzingImmunoglobulinRepertoires.Frontiersin448immunology9,462.449Chazenbalk,G.D.,Portolano,S.,Russo,D.,Hutchison,J.S.,Rapoport,B.,andMcLachlan,S.450(1993).Humanorgan-specificautoimmunedisease.Molecularcloningandexpressionofan451autoantibodygenerepertoireforamajorautoantigenrevealsanantigenicimmunodominant452regionandrestrictedimmunoglobulingeneusageinthetargetorgan.TheJournalofclinical453investigation92,62-74.454Chuang,G.Y.,Zhou,J.,Acharya,P.,Rawi,R.,Shen,C.H.,Sheng,Z.Z.,Zhang,B.S.,Zhou,T.Q.,455Bailer,R.T.,Dandey,V.P.,etal.(2019).StructuralSurveyofBroadlyNeutralizingAntibodies456TargetingtheHIV-1EnvTrimerDelineatesEpitopeCategoriesandCharacteristicsof457Recognition.Structure27,196-+.458Corcoran,M.M.,Phad,G.E.,VazquezBernat,N.,Stahl-Hennig,C.,Sumida,N.,Persson,M.A.,459Martin,M.,andKarlssonHedestam,G.B.(2016).ProductionofindividualizedVgenedatabases460revealshighlevelsofimmunoglobulingeneticdiversity.Naturecommunications7,13642.461Corrie,B.D.,Marthandan,N.,Zimonja,B.,Jaglale,J.,Zhou,Y.,Barr,E.,Knoetze,N.,Breden,462F.M.W.,Christley,S.,Scott,J.K.,etal.(2018).iReceptor:Aplatformforqueryingandanalyzing463antibody/B-cellandT-cellreceptorrepertoiredataacrossfederatedrepositories.464Immunologicalreviews284,24-41.465Corti,D.,andLanzavecchia,A.(2013).Broadlyneutralizingantiviralantibodies.Annualreview466ofimmunology31,705-742.467DeWitt,W.S.,Lindau,P.,Snyder,T.M.,Sherwood,A.M.,Vignali,M.,Carlson,C.S.,Greenberg,468P.D.,Duerkopp,N.,Emerson,R.O.,andRobins,H.S.(2016).APublicDatabaseofMemoryand469NaiveB-CellReceptorSequences.PLoSOne11,e0160853.470Doria-Rose,N.A.,Schramm,C.A.,Gorman,J.,Moore,P.L.,Bhiman,J.N.,DeKosky,B.J.,Ernandes,471M.J.,Georgiev,I.S.,Kim,H.J.,Pancera,M.,etal.(2014).Developmentalpathwayforpotent472V1V2-directedHIV-neutralizingantibodies.Nature509,55-62.473Dosenovic,P.,Kara,E.E.,Pettersson,A.K.,McGuire,A.T.,Gray,M.,Hartweger,H.,Thientosapol,474E.S.,Stamatatos,L.,andNussenzweig,M.C.(2018).Anti-HIV-1Bcellresponsesaredependent475onBcellprecursorfrequencyandantigen-bindingaffinity.ProcNatlAcadSciUSA115,4743-4764748.477Dunbar,J.,andDeane,C.M.(2016).ANARCI:antigenreceptornumberingandreceptor478classification.Bioinformatics32,298-300.479Edgar,R.C.,andFlyvbjerg,H.(2015).Errorfiltering,pairassemblyanderrorcorrectionfornext-480generationsequencingreads.Bioinformatics31,3476-3482.481Ekiert,D.C.,Bhabha,G.,Elsliger,M.A.,Friesen,R.H.,Jongeneelen,M.,Throsby,M.,Goudsmit,J.,482andWilson,I.A.(2009).Antibodyrecognitionofahighlyconservedinfluenzavirusepitope.483Science324,246-251.484Elhanati,Y.,Sethna,Z.,Marcou,Q.,Callan,C.G.,Jr.,Mora,T.,andWalczak,A.M.(2015).485InferringprocessesunderlyingB-cellrepertoirediversity.PhilosTransRSocLondBBiolSci370.486Francica,J.R.,Sheng,Z.,Zhang,Z.,Nishimura,Y.,Shingai,M.,Ramesh,A.,Keele,B.F.,Schmidt,487S.D.,Flynn,B.J.,Darko,S.,etal.(2015).Analysisofimmunoglobulintranscriptsand488

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 13: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

13

hypermutationfollowingSHIV(AD8)infectionandprotein-plus-adjuvantimmunization.Nature489communications6,6565.490Galson,J.D.,Truck,J.,Clutterbuck,E.A.,Fowler,A.,Cerundolo,V.,Pollard,A.J.,Lunter,G.,and491Kelly,D.F.(2016a).B-cellrepertoiredynamicsaftersequentialhepatitisBvaccinationand492evidenceforcross-reactiveB-cellactivation.GenomeMed8,68.493Galson,J.D.,Truck,J.,Fowler,A.,Clutterbuck,E.A.,Munz,M.,Cerundolo,V.,Reinhard,C.,van494derMost,R.,Pollard,A.J.,Lunter,G.,andKelly,D.F.(2015).AnalysisofBCellRepertoire495DynamicsFollowingHepatitisBVaccinationinHumans,andEnrichmentofVaccine-specific496AntibodySequences.EBioMedicine2,2070-2079.497Galson,J.D.,Truck,J.,Kelly,D.F.,andvanderMost,R.(2016b).InvestigatingtheeffectofAS03498adjuvantontheplasmacellrepertoirefollowingpH1N1influenzavaccination.SciRep6,37229.499Gupta,N.T.,Adams,K.D.,Briggs,A.W.,Timberlake,S.C.,Vigneault,F.,andKleinstein,S.H.500(2017).HierarchicalClusteringCanIdentifyBCellCloneswithHighConfidenceinIgRepertoire501SequencingData.JournalofImmunology198,2489-2499.502Hexham,J.M.,Furmaniak,J.,Pegg,C.,Burton,D.R.,andSmith,B.R.(1992).Cloningofahuman503autoimmuneresponse:preparationandsequencingofahumananti-thyroglobulin504autoantibodyusingacombinatorialapproach.Autoimmunity12,135-141.505Imkeller,K.,Scally,S.W.,Bosch,A.,Marti,G.P.,Costa,G.,Triller,G.,Murugan,R.,Renna,V.,506Jumaa,H.,Kremsner,P.G.,etal.(2018).AntihomotypicaffinitymaturationimproveshumanB507cellresponsesagainstarepetitiveepitope.Science360,1358-1362.508Jardine,J.G.,Kulp,D.W.,Havenar-Daughton,C.,Sarkar,A.,Briney,B.,Sok,D.,Sesterhenn,F.,509Ereno-Orbea,J.,Kalyuzhniy,O.,Deresa,I.,etal.(2016a).HIV-1broadlyneutralizingantibody510precursorBcellsrevealedbygermline-targetingimmunogen.Science351,1458-1463.511Jardine,J.G.,Sok,D.,Julien,J.P.,Briney,B.,Sarkar,A.,Liang,C.H.,Scherer,E.A.,HenryDunand,512C.J.,Adachi,Y.,Diwanji,D.,etal.(2016b).MinimallyMutatedHIV-1BroadlyNeutralizing513AntibodiestoGuideReductionistVaccineDesign.PLoSpathogens12,e1005815.514Jayaram,N.,Bhowmick,P.,andMartin,A.C.(2012).GermlineVH/VLpairinginantibodies.515ProteinEngDesSel25,523-529.516Jeon,Y.E.,Seo,C.W.,Yu,E.S.,Lee,C.J.,Park,S.G.,andJang,Y.J.(2007).Characterizationof517humanmonoclonalautoantibodyFabfragmentsagainstoxidizedLDL.MolImmunol44,827-518836.519Joyce,M.G.,Wheatley,A.K.,Thomas,P.V.,Chuang,G.Y.,Soto,C.,Bailer,R.T.,Druz,A.,Georgiev,520I.S.,Gillespie,R.A.,Kanekiyo,M.,etal.(2016).Vaccine-InducedAntibodiesthatNeutralize521Group1andGroup2InfluenzaAViruses.Cell166,609-623.522Khan,T.A.,Friedensohn,S.,deVries,A.R.,Straszewski,J.,Ruscheweyh,H.J.,andReddy,S.T.523(2016).Accurateandpredictiveantibodyrepertoireprofilingbymolecularamplification524fingerprinting.SciAdv2,e1501371.525Koenig,P.,Lee,C.V.,Walters,B.T.,Janakiraman,V.,Stinson,J.,Patapoff,T.W.,andFuh,G.526(2017).Mutationallandscapeofantibodyvariabledomainsrevealsaswitchmodulatingthe527interdomainconformationaldynamicsandantigenbinding.ProcNatlAcadSciUSA114,E486-528E495.529Krebs,S.J.,Kwon,Y.D.,Schramm,C.A.,Law,W.H.,Donofrio,G.,Zhou,K.H.,Gift,S.,Dussupt,V.,530Georgiev,I.S.,Schätzle,S.,etal.(2019).LongitudinalAnalysisRevealsEarlyDevelopmentof531

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 14: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

14

ThreeMPER-DirectedNeutralizingAntibodyLineagesfromanHIV-1-InfectedIndividual.532Immunity50,677-691.e613.533Kwong,P.D.,Chuang,G.Y.,DeKosky,B.J.,Gindin,T.,Georgiev,I.S.,Lemmin,T.,Schramm,C.A.,534Sheng,Z.,Soto,C.,Yang,A.S.,etal.(2017).Antibodyomics:bioinformaticstechnologiesfor535understandingB-cellimmunitytoHIV-1.Immunologicalreviews275,108-128.536Kwong,P.D.,andMascola,J.R.(2018).HIV-1VaccinesBasedonAntibodyIdentification,BCell537Ontogeny,andEpitopeStructure.Immunity48,855-871.538Kwong,P.D.,andWilson,I.A.(2009).HIV-1andinfluenzaantibodies:seeingantigensinnew539ways.Natureimmunology10,573-578.540Margreitter,C.,Lu,H.C.,Townsend,C.,Stewart,A.,Dunn-Walters,D.K.,andFraternali,F.(2018).541BRepertoire:auser-friendlywebserverforanalysingantibodyrepertoiredata.NucleicAcids542Res46,W264-W270.543Miho,E.,Yermanos,A.,Weber,C.R.,Berger,C.T.,Reddy,S.T.,andGreiff,V.(2018).544ComputationalStrategiesforDissectingtheHigh-DimensionalComplexityofAdaptiveImmune545Repertoires.Frontiersinimmunology9,224.546Murphy,K.(2014).Janeway'sImmunobiology(GarlandScience).547Nielsen,S.C.A.,andBoyd,S.D.(2018).Humanadaptiveimmunereceptorrepertoireanalysis-548Past,present,andfuture.Immunologicalreviews284,9-23.549Odegard,V.H.,andSchatz,D.G.(2006).Targetingofsomatichypermutation.Naturereviews.550Immunology6,573-583.551Pappas,L.,Foglierini,M.,Piccoli,L.,Kallewaard,N.L.,Turrini,F.,Silacci,C.,Fernandez-Rodriguez,552B.,Agatic,G.,Giacchetto-Sasselli,I.,Pellicciotta,G.,etal.(2014).Rapiddevelopmentofbroadly553influenzaneutralizingantibodiesthroughredundantmutations.Nature516,418-+.554Pichurin,P.,Guo,J.,Yan,X.,Rapoport,B.,andMcLachlan,S.M.(2001).Humanmonoclonal555autoantibodiestoB-cellepitopesoutsidethethyroidperoxidaseautoantibody556immunodominantregion.Thyroid11,301-313.557Pichurin,P.N.,Guo,J.,Estienne,V.,Carayon,P.,Ruf,J.,Rapoport,B.,andMcLachlan,S.M.558(2002).Evidencethatthecomplementcontrolprotein-epidermalgrowthfactor-likedomainof559thyroidperoxidaseliesonthefringeoftheimmunodominantregionrecognizedby560autoantibodies.Thyroid12,1085-1095.561R.Gupta,E.Jung,andBrunak,S.(2004).PredictionofN-glycosylationsitesinhumanproteins.562InPreparation.563Rubelt,F.,Bolen,C.R.,McGuire,H.M.,VanderHeiden,J.A.,Gadala-Maria,D.,Levin,M.,564Euskirchen,G.M.,Mamedov,M.R.,Swan,G.E.,Dekker,C.L.,etal.(2016).Individualheritable565differencesresultinuniquecelllymphocytereceptorrepertoiresofnaiveandantigen-566experiencedcells.Naturecommunications7,11112.567Schirmer,M.,Ijaz,U.Z.,D'Amore,R.,Hall,N.,Sloan,W.T.,andQuince,C.(2015).Insightinto568biasesandsequencingerrorsforampliconsequencingwiththeIlluminaMiSeqplatform.569NucleicAcidsResearch43,e37.570Schramm,C.A.,Sheng,Z.,Zhang,Z.,Mascola,J.R.,Kwong,P.D.,andShapiro,L.(2016).SONAR:571AHigh-ThroughputPipelineforInferringAntibodyOntogeniesfromLongitudinalSequencingof572BCellTranscripts.Frontiersinimmunology7,372.573

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 15: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

15

Sheng,Z.,Schramm,C.A.,Connors,M.,Morris,L.,Mascola,J.R.,Kwong,P.D.,andShapiro,L.574(2016).EffectsofDarwinianSelectionandMutabilityonRateofBroadlyNeutralizingAntibody575EvolutionduringHIV-1Infection.PLoSComputBiol12,e1004940.576Sheng,Z.,Schramm,C.A.,Kong,R.,Program,N.C.S.,Mullikin,J.C.,Mascola,J.R.,Kwong,P.D.,577Shapiro,L.,Benjamin,B.,Bouffard,G.,etal.(2017).Gene-SpecificSubstitutionProfilesDescribe578theTypesandFrequenciesofAminoAcidChangesduringAntibodySomaticHypermutation.579Frontiersinimmunology8,537.580Sievers,F.,Wilm,A.,Dineen,D.,Gibson,T.J.,Karplus,K.,Li,W.,Lopez,R.,McWilliam,H.,581Remmert,M.,Soding,J.,etal.(2011).Fast,scalablegenerationofhigh-qualityproteinmultiple582sequencealignmentsusingClustalOmega.Molecularsystemsbiology7,539.583Soto,C.,Bombardi,R.G.,Branchizio,A.,Kose,N.,Matta,P.,Sevy,A.M.,Sinkovits,R.S.,Gilchuk,584P.,Finn,J.A.,andCrowe,J.E.,Jr.(2019).HighfrequencyofsharedclonotypesinhumanBcell585receptorrepertoires.Nature566,398-402.586Stern,J.N.,Yaari,G.,VanderHeiden,J.A.,Church,G.,Donahue,W.F.,Hintzen,R.Q.,Huttner,A.J.,587Laman,J.D.,Nagra,R.M.,Nylander,A.,etal.(2014).Bcellspopulatingthemultiplesclerosis588brainmatureinthedrainingcervicallymphnodes.Sciencetranslationalmedicine6,248ra107.589Tipton,C.M.,Fucile,C.F.,Darce,J.,Chida,A.,Ichikawa,T.,Gregoretti,I.,Schieferl,S.,Hom,J.,590Jenks,S.,Feldman,R.J.,etal.(2015).Diversity,cellularoriginandautoreactivityofantibody-591secretingcellpopulationexpansionsinacutesystemiclupuserythematosus.Nature592immunology16,755-765.593vandeBovenkamp,F.S.,Derksen,N.I.L.,Ooijevaar-deHeer,P.,vanSchie,K.A.,Kruithof,S.,594Berkowska,M.A.,vanderSchoot,C.E.,IJspeert,H.,vanderBurg,M.,Gils,A.,etal.(2018).595AdaptiveantibodydiversificationthroughN-linkedglycosylationoftheimmunoglobulin596variableregion.PNatlAcadSciUSA115,1901-1906.597VanderHeiden,J.A.,Marquez,S.,Marthandan,N.,Bukhari,S.A.C.,Busse,C.E.,Corrie,B.,598Hershberg,U.,Kleinstein,S.H.,MatsenIv,F.A.,Ralph,D.K.,etal.(2018).AIRRCommunity599StandardizedRepresentationsforAnnotatedImmuneRepertoires.Frontiersinimmunology9,6002206.601VanderHeiden,J.A.,Stathopoulos,P.,Zhou,J.Q.,Chen,L.,Gilbert,T.J.,Bolen,C.R.,Barohn,R.J.,602Dimachkie,M.M.,Ciafaloni,E.,Broering,T.J.,etal.(2017).DysregulationofBCellRepertoire603FormationinMyastheniaGravisPatientsRevealedthroughDeepSequencing.Journalof604Immunology198,1460-1473.605West,A.P.,Jr.,Diskin,R.,Nussenzweig,M.C.,andBjorkman,P.J.(2012).Structuralbasisfor606germ-linegeneusageofapotentclassofantibodiestargetingtheCD4-bindingsiteofHIV-1607gp120.ProcNatlAcadSciUSA109,E2083-2090.608Wu,X.,Liu,B.,VanderMerwe,P.L.,Kalis,N.N.,Berney,S.M.,andYoung,D.C.(1998).Myosin-609reactiveautoantibodiesinrheumaticcarditisandnormalfetus.Clinicalimmunologyand610immunopathology87,184-192.611Wu,X.,Zhang,Z.,Schramm,C.A.,Joyce,M.G.,Kwon,Y.D.,Zhou,T.,Sheng,Z.,Zhang,B.,O'Dell,612S.,McKee,K.,etal.(2015).MaturationandDiversityoftheVRC01-AntibodyLineageover15613YearsofChronicHIV-1Infection.Cell161,470-485.614Wu,X.,Zhou,T.,Zhu,J.,Zhang,B.,Georgiev,I.,Wang,C.,Chen,X.,Longo,N.S.,Louder,M.,615McKee,K.,etal.(2011).FocusedevolutionofHIV-1neutralizingantibodiesrevealedby616structuresanddeepsequencing.Science333,1593-1602.617

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 16: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

16

Zhou,T.,Lynch,R.M.,Chen,L.,Acharya,P.,Wu,X.,Doria-Rose,N.A.,Joyce,M.G.,Lingwood,D.,618Soto,C.,Bailer,R.T.,etal.(2015).StructuralRepertoireofHIV-1-NeutralizingAntibodies619TargetingtheCD4Supersitein14Donors.Cell161,1280-1292.620Zhou,T.,Zhu,J.,Wu,X.,Moquin,S.,Zhang,B.,Acharya,P.,Georgiev,I.S.,Altae-Tran,H.R.,621Chuang,G.Y.,Joyce,M.G.,etal.(2013).Multidonoranalysisrevealsstructuralelements,genetic622determinants,andmaturationpathwayforHIV-1neutralizationbyVRC01-classantibodies.623Immunity39,245-258.624 625 626 627Figure Legend: 628 629Figure 1. Flowchart for the processing of repertoire data and developed tools. Next-630generation sequencing data was processed and annotated using SONAR. Other published 631programs used were highlighted in bold font. Scripts developed in this study were in italic font.632 633Figure 2. Frequencies of malaria and HIV-1 broadly neutralizing antibody classes-like 634transcripts. (A) Frequencies of influenza IGHV3-33/IGKV1-5 class heavy and light chain 635signatures in 15 heavy and 17 light chain PBMC repertoires of healthy donors. Signatures are: 636IGHV3-33 gene and W52 for heavy chain and IGKV1-5 and 8 amino acids CDRL3 for light 637chain. (B) VRC01 class heavy chain and light chain signatures were showed in (B) and (C). The 638searched signatures were IGHV1-2*02 origin and either IGKV1-33, IGKV3-15, IGKV3-20, or 639IGLV2-14 origin plus a CDR3 of five amino acids matching motif X-X-[AFILMYWV]-[EQ]-X 640for heavy and light chains respectively. (D) Rarefaction analysis of the malaria heavy and 641VRC01 light signatures from 35 PBMC repertoires showed the variations of signature 642frequencies between random samplings (measured with coefficient of variation) reduce when 643more than 10 repertoires were sampled. 644 645Figure 3. Comparison of gene-specific substitution profiles and usage of a substitution 646profile for investigating substitution preference. (A) Comparison of substitution frequencies 647of all amino acid types at all IGHV1-2 positions estimated using cAb-Rep dataset and previous 648dataset. A Pearson correlation coefficient of 0.982 suggested that the substitution profiles of 649IGHV1-2 are highly consistent. (B) The gene-specific substitution profile of IGHV1-2 and rarity 650of somatic hypermutations in HIV-1 bnAbs and autoantibodies. Rare mutations, colored red, are 651observed frequently in HIV-1 bnAbs but not in autoantibodies, suggesting the mutation patterns 652in HIV-1 bnAbs may be generated with low frequency. For each antibody sequence, residues 653identical to IGHV1-2*02 germline gene were shown with dots. Missing residues were showed 654with minus sign. The disease and antigen were labeled on the right side of each sequence. 655 656Figure 4. Predicted glycosylation sites in V genes and structural location. (A) Hotspots of 657glycosylation sites in IgHV1-69, IgHV3-11, and IgHV4-39 genes. (B) A structural demo 658(PDBID: 1dn0) to show that the predicted glycosylation hotspots are surface-exposed, suggesting 659they are accessible for post-translational modification. 660 661Figure 5. Querying frequencies of signature and somatic hypermutation at cAb-Rep 662website. (A) cAb-Rep interfaces to estimate frequency of an input signature, gene-specific rarity 663

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 17: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

17

of somatic hypermutation, and gene-specific glycosylation frequency. (B) Result for VRC01 664bnAb-like light chain sequences using the CDR3 mode. The full results including sequences and 665donor information can be obtained from the download link, and the frequency of the motif in 666each donor is shown online. The parameters used are: IGKV3-20 gene and a [A-667Z]{2}[AFILMYWV][EQ][A-Z] motif. 668

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 18: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

Table1:Summaryofdatasets,typeofdonors, experimentalmethods,uniquereads,rawreads,numberofclones,andreferences

Typeofdonors Experimentalmethods Total repertoires Uniquedonors AverageNO. rawreads

AverageNO.uniquereads

AverageNO.clones

NO.ofStudies

Healthy RACE/FR1primerMiseq 59 31 1247695 13432 7142 5High-depth healthy donor RACE/FR1 primer Miseq 13 13 43984663 25632108 N.D. 2

HepatitisBprimevaccination FR1primerMiseq 88 9 291249 7677 5648 1HepatitisBBoostvaccination FR1primerMiseq 18 9 338467 26570 21936 1

Influenzavaccination RACE/FR1primerMiseq 81 46 742531 16029 7016 3Multiplesclerosisbrain RACE 17 4 1654572 34326 11130 1

Myastheniagravis RACE 18 9 862543 17512 10332 1SLE RACE 8 7 3741944 410286 41705 1

Tetanusvaccination RACE 4 4 1999230 58787 17309 1Total RACE,FR1primerMiseq 306 121 818868037 340841950 2655416 11

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 19: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

B cell repertoires

Unique transcripts

Kabat, Chothia, IMGT numbering B cell clones

Gene-specific substitution profiles

SHM rarity

Signature search

Comparison ofrepertoire features

SONAR

ANARCI

Frequency and sequences

SONAR

mGSSP

Ab_search.py SHM_freq.py

Fig.1

N-glycosylation prediction

NetNGlyc

Gene-specific N-glycosylation profile

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 20: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

0 5 10 15 20 25 30 35

0.0

0.2

0.4

0.6

0.8

1.0

Number of samples

CV

IGHV3−33_W52VRC01 LightIGKV1−5_CDR3:8

0.00

0.01

0.02

0.03

0.04

Fig. 2. (A) Malaria VH3-33/KV1-5 class (C) VRC01-class light

(D)

(B) VRC0-class heavy

0.00

0.01

0.02

0.03

0.04

PBMC

Freq

uenc

y (%

)

0.01

0.02

0.03

0.04

0

Number of samples

Cov

aria

tion

of v

aria

nce

0

0.2

0.6

0.4

0.8

0 10 20 30

1

2

3

4

5Fr

eque

ncy (

%)

1

2

3

4

5

IGHV3-33_W52 IGKV1-5_CDR3:8

0.0

2.5

5.0

7.5

0.0

2.5

5.0

7.5

Freq

uenc

y (%

)

IGM IGG0

2.5

5

7.5

1.0

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 21: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

Q V Q L VQ S G AE V K KP G AS V K VS C K AS G Y TF T G YY M HW V R QA P G QG L EW M G WI N P NS G GT N Y AQ K F QG R V TM T RD T S IS T A YM E LS R L RS D D TA V Y YC. . . . .. . . GQ M . .. . E. M R I. . R .. . . E. I D CT L N. I . L. . . KR P .. . . .L K . RG . AV . . .R P L .. . . .. . .. V Y SD . . FL . .R S . TV . . .. . . F.S Q R . .. . . PQ . R .. . S. . R I. . E T. . . .. N A .I L .. F . .. . . RS F .. . . .. K . KF . AV . . .H S . .. . I .L . .. I Y RE . . FL D .T G . .F . . .. . . ... . . . .. . . SG . . .. . .. . R .. . W T. E D I. R T EL I .. . . .. . . .. . .. I . .V K T VT . AV . F GP D . RQ . . SL . .. R D LF . . H. D IR G . TQ G . .. T . F.. . . . L. . . .A . T .. . .. . R .. . E .. . . NI R D .F I .. W . .. . . .. . Q. V . .. . . KT . QP . N PR Q . .. . . SL . .. F D TF S F .. D .K A . .. . . .. . . F.S Q H . .. . . TQ . . .. . .. . R .. . Q .. . . .. . N .I L .. W . .. . . .. . .. . . L. K . VF . AV . . .R Q . .. . I QL . .. I Y RE I . FL D .. G . .. . . .. . . ... . R . .. . . P. . . .. . S. . T .. . Q .. . G .. S S .A F T. . . .. . . .. . .. L . MV T . IF . EA K . S. R . E. . . .I . A. E . T. . T SI . .R G . T. E . .. I . ... . R . .. . . NQ . R .. . .. . R I. . E .. . . K. I D HF I .. . . .V . . H. . .. L . .. . . RG . .V . . SR S . .. K L .. . .. N F EE . . .L D .. K . NP G . .. . . F.E . . F .. . . .. . . .. . .. . R .. . E .. . . S. . D . L Q. I . .. . . .R P .. . . .. K . ER . AV S . .P Q . .. . L .L . .. L Y TE . . .. H FK N . .. . . .. I . ... . . .. . .. . A .R . .. . .. T .. . ED D D. S A PE H FV I .F L .. . .. . Q .. . LA . M. . TN . AV . .. W YL N .. . . A. . .R . MT . .F L .V K S. . .. . .. . . ..- - - - -- . . .. . . .. . .. . E .. . . .. . . .. . . .. . .. . . .. . . .. . .. . . .. . . .. . .. . . .. . . .. . . .. . .. . . .. . . .. . .. . . .. . . .. . . ..- - - - -- - - -- - - -- - -- - - -. . . L. . . S. . . .. I .. . . .. . . .. P .. . . .. K . S. - .. . . .. . . .D . . .. . .. . . .. . . .. . .. . . .. . . .. . . F.- - L . E. . . .. L . .. . .. . . .. . . .. . . .. . A .. I .. . . .. . . .. . .. . . .. . . .. . .. . . .. . . .. . . .. . .. . . S. . . .. D .. . . T. . . .. . . ..- - . . .. . . .. . . .. . .. . . .. . . .. . . P. S . .. . .. . . .. R . R. . .. . . .. . . D. . .. K . E. . . .. . . .V . .. . . .. . T .. . .. K . .. . . .. . . ..- - - . E. . . .. . . .. . .. . . .. . . .. . . .. . . H. . .. . . .. . . .. . .. I . .. S . .R . A. R F .. . . .. . . .. . S. . . .N . V .. . .. G . .F . . .. . . F.- - - . EE . . .. L . .. . .. . R .. . . .. . . .. . D .. I .. . . .. . . .. . .. V . .. . . KN A .. R . SE . . .. . . .. . .. . A .. A V .. . .T S . K. . . .. . . ..- - - . EE . . .. L . .. . .. . R .. . . P. . . .. S D .. I .. . . .. . . .A V .. V . .. . . KN A .. R . SE . . .. . . .. . .. . A .. A V .. . .T S . K. . . .. . . ... . K . LE . . .. L . .. . .. . R .. . . P. . . S. S D .. I .. . . .. . . .. . .. V . .. . . KN A .. R . SE . . .. . . .. . .. . A .. A V .. . .T S . K. . . .. . . ..- . . . .E . . .. . . .. . .. . . .. . . .. . . .. . . .. . .. . . .. . . .. . .. . . .. . . .. W T. . . .. . . .. K . .. . K. . . .. . . .. . .. . . .. . . .. . . ..- . . . .E . . .. . . .. . .. . . .. . . .. . . .. . . .. . .. . . .. . . .. . .. . . .. . . .. . .. . . .. . V .. . . .. . .. . T .. . . .. . .. . . .. . . .. . . ..- - L . EE . . .. L . .. . .. . . .. . . .. . . .. . A .. I .. . . .. . . .. . .. . . .. . . .. . .. . . .. . . .. . . .. . .. . . S. . . .. D .. . . T. . . .. . . ..

Fig.3

(B)

IGHV1-2*02VRC01VRC27

VRCPG04 3BNC117

12A21 PG20

VRC18 VRC23

VRC-CH31Autoantibody_40Autoantibody_45

Autoantibody_103Autoantibody_131Autoantibody_159Autoantibody_167Autoantibody_169Autoantibody_180Autoantibody_257Autoantibody_260Autoantibody_309

CDR1 CDR2

MutationFrequency(%)

IGHV1-2

1 11 20 30 40 50 60 70 80 900

10

20

30

40

50

60

70

0

10

20

30

40

50

60

70

80

Subs

titutio

n freq

uenc

y

1H 2IL

MGEA

3

LRH

4MV

5

I

AMEL

6 7P 8AE

9GV

SPT

10

G

AD

11

I

ALM

12

MTQER

13

S

MTQNER

14

A

TS

15 16

SD

T

17

A

18

I

ML

19

E

MQNTR

20

LI

21

P

A

22 23

M

TEQR

24

GVPST

25

Y

P

A

F

26

RAE

27

SHNDF

28

L

V

KMRAPNIS

29

I

VL

30

GVRPANIS

31

H

Y

ETVSNAD

32

C

QDSNFH

33

L

A

WDNSHF

34

T

FVLI

35

D

SPQNY

36 37

AIML

38 39

H

RL

40

TV

41 42

R

43

L

KERH

44

A

R

45

FP

46D

Q47

L

CY

48ILV

49

A

50

Q

F

GLSYC

51

S

T

FLMV

52

RITYKHSD

52A

F

V

CLATS

53

V

L

QGAEITRYHSDK

54

IDRGNT

55

R

S

AD

56

I

Y

TRNESVAD

57

V

K

RPISA

58

M

A

GFQLVEYIRTSHDK

59

T

V

D

NCIHLSF

60

LETGPVS

61

D

KLHRPE

62

G

I

M

DSEQTRN

63

V

Y

L

64

LKHER

65

AD

66

G

KS

67

AFLI

68

NIAS

69

IVL

70

I

AS

71

AMKWGTS

72

G

N

AE

73

PKRSMA

74

TPA

75

A

N

FMSLVT

76

KADGRTN

77

ISA

78

F

LISGTV

79

D

N

SHF

80

IVL

81

V

G

AQD

82

F

IMV

82A

Y

HADIKGRNT

82B

DKWTNGS

82C

V

83

M

Q

A

ENSGIKT

84

V

LTAYPF

85

AENG

86 87

MAS

88

G

89

FTMLI

90

F

91

H

S

F

92 93

SGTV

94

N

VALGISTK

(A)

0

10

20

30

40

50

0 10 20 30 40 50Substitution frequency of previous GSSPs (%)

Subs

titut

ion

frequ

ency

of u

pdat

ed G

SSPs

(%)

r = 0.983p=2.2e-16

Subs

titut

ion

frequ

ency

of u

pdat

ed

GSS

Ps (

%)

Substitution frequency of previous GSSPs (%)0 10 20 30 40

50

50

010

2030

40

DiseaseHIV-1infectionHIV-1infectionHIV-1infectionHIV-1infectionHIV-1infectionHIV-1infectionHIV-1infectionHIV-1infectionHIV-1infectionAtherosclerosisHashimoto diseaseAuto-thyroid diseaseAtherosclerosisAuto-thyroid diseaseAuto-thyroid diseaseAuto-thyroid diseaseAuto-thyroid diseaseRheumaticcarditisRheumaticcarditisAuto-thyroid disease

AntigenEnvEnvEnvEnvEnvEnvEnvEnvEnvOxidizedLDLThyroglobulinThyroidperoxidaseOxidizedLDLThyroidperoxidaseThyroidperoxidaseThyroidperoxidaseThyroidperoxidaseCardiacmyosinCardiacmyosinThyroidperoxidase

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 22: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 52A

53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 82A

82B

82C

83 84 85 86 87 88 89 90 91 92

Freq

uenc

y pe

r 100

0 se

quen

ces

V gene

IGHV1-69

IGHV3-11

IGHV4-39

Fig.4

(A)

(B)

Q V Q L V Q S G A E V K K P G S S V K V S C K A S G G T F S S Y A I S W V R Q A P G Q G L E W M G G I I P I F G T A N Y A Q K F Q G R V T I T A D E S T S T A Y M E L S S L R S E D T A V Y Y CQ V Q L V E S G G G L V K P G G S L R L S C A A S G F T F S D Y Y M S W I R Q A P G K G L E W V S Y I S S S G S T I Y Y A D S V K G R F T I S R D N A K N S L Y L Q M N S L R A E D T A V Y Y CQ L Q L Q E S G P G L V K P S E T L S L T C T V S G G S I S S S S Y Y W I R Q P P G K G L E W I G S I Y - Y S G S T Y Y N P S L K S R V T I S V D T S K N Q F S L K L S S V T A A D T A V Y Y C

IGHV1-69*01IGHV3-11*01IGHV4-39*01

HeavyLightGlycosylationsitesCDRH1CDRH2

21

2373

81

82B

6858

5455

31 30 28

CDRH1 CDRH2

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint

Page 23: cAb-Rep: A Database of Curated Antibody Repertoires for Exploring antibody diversity … · 2 21 Abstract 22 23 The diversity of B cell receptors provides a basis for recognizing

Fig.5

Home Download About Help Contact

INTRODUCTION

cAb-Repisadatabaseofcuratedan�bodyrepertoires.Itcurrentlyincludes306Bcellrepertoirescollectedfrom121humanindividuals,includinghealthy,vaccinated,andautoimmunediseasedonors.Thedatabasehasatotalof269.88and73.05millioncuratedfull-lengthV(D)Jheavyand lightchaintranscriptsrespec�vely,whichweregeneratedusing illuminasequencingtechnology inprevioiusstudies.Weremovedredundancyandsequencingerrors in therawdatasets,andannotatedeachtranscriptincludinggeneorigin,an�bodysubtype,muta�ons,post-transla�onalmodifica�on,andotherinforma�on.

ThediversityofBcell receptors,generatedbyV(D)J recombina�onandsoma�chypermuta�on,providesabasis for recognizingnumerouspathogens.Toaccelerateinves�ga�onsofBcellresponseandan�bodydiversity,weprovidetoolstosearchcAb-Reptoes�amteprevelenceofan�bodyrecombina�onandtoinves�gatesoma�chypermuta�onpreference.WehopethecAb-Repdatabaseandcomputa�onaltoolsareusefulforinves�ga�ngfeaturesofBcellrepertoires,understandingmechanismsofaffinitymatura�on,andiden�fyingpoten�albarriersinelici�ngbroadlyneutralizingan�bodiesinvaccinedesign.

Es�mateprevalenceofan�bodywithsignatureofinterest?:

SearchMode: Chain: Isotype: Scheme: Onlyforhigh-depthdataset:

Sequencesignature(orfastasequenceforBlastmode):

CDR3length: Vgene: Jgene:

Es�materarityofsoma�chypermuta�on?:

GermlineVgene: Rarityforiden�fyingrareSHM(%):

I.Inputasequenceandes�matetherarityofthesubs�tu�on:Sequencetype: Selectafastafile:

II.Inputaminoacidsandfindthesubs�tu�onrarityataVgenepos�on(Kabatscheme):Pos�on: AminoAcid:

Gene-specificfrequencyofglycosyla�on?:

GermlineVgene:

PleaseinputaVgenepos�on(Kabatscheme):Pos�on:

Cita�on:

Gene-specificsubsitu�onprofile:ZizhangSheng,ChaimA.Schramm,RuiKong,NISCCompara�veSequencingProgram,JamesC.Mullikin,JohnR.Mascola,PeterD.Kwong,andLawrenceShapiro.Gene-specific subs�tu�on profiles describe the types and frequencies of amino acid changes during an�body soma�c hypermuta�on. Front. Immunol. 8:537. doi:10.3389/mmu.2017.00537(2017).

cAb-Repdatabase:YichengGuo,KevinChen,PeterD.Kwong,LawrenceShaprio,andZizhangSheng.cAb-Rep:adatabaseofcuratedan�bodyrepertoiresforexploringBcellresponseandpredic�ngan�bodyprevalence.(Insubmission)2019

(A)

(B)

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 11, 2019. ; https://doi.org/10.1101/765099doi: bioRxiv preprint