heterozygosity at individual amino acid sites: extremely high levels

5
Proc. Nati. Acad. Sci. USA Vol. 88, pp. 5897-5901, July 1991 Evolution Heterozygosity at individual amino acid sites: Extremely high levels for HLA-A and -B genes (evolution/histocompatibility/polymorphism) PHILIP W. HEDRICK*t, THOMAS S. WHITTAM*, AND PETER PARHAMf *Institute of Molecular Evolutionary Genetics and Department of Biology, Pennsylvania State University, University Park, PA 16802; and tDepartment of Cell Biology, Stanford University, Stanford, CA 94305 Communicated by Hugh 0. McDevitt, March 4, 1991 ABSTRACT The amino acid heterozygosities per site for HLA-A and -B loci are determined to be extremely high by combining population serotypic frequencies with amino acid sequences. For the 54 amino acid sites thought to have func- tional importance, the average heterozygosity per site is 0.301. Sixteen positions have heterozygosities >0.5 at one or both loci and the frequencies of amino acids at a given position are very even, resulting in nearly the maximum heterozygosity possible. Furthermore, the high heterozygosity is concentrated in the peptide-interacting sites, whereas the sites that interact with the T-cell receptor have lower heterozygosity. Overall, these results indicate the importance of some form of balancing selection operating at HLA loci, maybe even at the individual amino acid level. Fundamental to understanding evolutionary genetics is knowledge of the magnitude of genetic variation within populations and the pattern of variation within and among genes. In the past decade, the development of fast DNA sequencing techniques has made it possible to assay the extent of genetic variation at the most basic level for a variety of genetic loci. Here we describe and evaluate the patterns of molecular variation at two of the most polymorphic loci known, illustrating the usefulness of information obtained with this technology in addressing basic evolutionary ques- tions. The class I and II genes of the major histocompatibility complex (MHC), or HLA region, are among the most vari- able loci known in human populations (e.g., refs. 1-3). Based on the detection of serotypically distinct allelic variants, human populations typically have 10-20 alleles segregating at the HLA-A and -B loci. The distribution of the allelic fre- quencies at HLA-A and -B loci departs from the predictions of the neutral mutation theory, suggesting that some form of balancing selection contributes to the maintenance of allelic variation (e.g., refs. 4 and 5). The function of the products of HLA genes is to control the recognition of foreign and self proteins by T lymphocytes. Within cells, class I and II molecules bind short peptides that they transport to the cell surface (6). There the complexes of peptides and HLA molecules provide the ligands for the T-cell antigen receptors (7). HLA alleles have been associ- ated with various diseases (8, 9), often autoimmune in character and thought to be caused by T cells specific for self-peptides bound by the relevant HLA molecules (10). In recent years, nucleotide and amino acid sequences for many class I HLA alleles and proteins have been determined, particularly for alleles at the HLA-A and -B loci, two genes 0.8 map unit apart on chromosome 6 (reviewed in ref. 11). In general, the serotypically distinct alleles at these loci are quite different from one another, with on the average only 92.7% or 93.6% amino acid identity among serotypic alleles at the HLA-A or -B loci, respectively. Determination of the three-dimensional structure of the HLA-A2 protein identified the peptide-binding site of MHC molecules and a postulated face of interaction with the T-cell receptor (TcR) (12, 13). Particular amino acid residues could be assigned as contact- ing either bound peptide, the TcR, or both, and it is precisely at these positions that many of the allelic differences are found. Previous analyses of HLA structural variation have given equal weight to all alleles irrespective of the allelic frequen- cies within human populations (14, 15). This has the effect of overemphasizing the importance of "rare variants" at the expense of the common alleles. Here, by combining knowl- edge from population surveys of serotypic variation with that of amino acid sequence for various HLA polypeptides, we have determined population heterozygosity for particular amino acid positions. Because a substantial proportion of the alleles at the HLA-A and -B loci has now been sequenced, examination of the amino acid variation with population- genetic techniques is now possible. MATERIALS AND METHODS Amino Acid Sequences. Amino acid sequences for the HLA proteins were inferred from DNA sequences obtained from ref. 11 or sources therein. The 12 different HLA-A sequences are Al, A2, A3, All, A24, A25, A26, A29, A30, A31, A32, and Aw33, representing, on average, 85% of the HLA-A serotypic alleles in the three population samples (see below). The 20 HLA-B sequences are B7, B8, Bwl3, B18, B27, B35, B37, Bw4l, Bw42, B44, Bw46, Bw47, B49, B51, Bw52, Bw57, Bw58, Bw60, Bw62, and Bw65, representing, on average, 77% of the HLA-B serotypic alleles in the population samples. Many studies employing either isoelectric focusing, T-cell- mediated cytolysis, or nucleotide sequencing have shown that the officially designated serotypes often encompass the products of multiple, closely related alleles. For example, six different A2 nucleotide sequences, three All sequences, and two sequences for each of B13, B27, and B44 are available. When all 21 pairwise sequence comparisons within serotypes are made, the average number of amino acid differences are few (2.9 residues or 0.80% of the total) compared with that obtained for pairwise comparisons between serotypes (21.3 residues or 6.9% of the total) and justifies our assumption in the present analysis of amino acid homogeneity within sero- types. We have designated amino acid sites in two different ways. The first classification (after ref. 12, as modified by ref. 11) is based on the putative function of specific amino acid Abbreviations: MHC, major histocompatibility complex; TcR, T-cell receptor. tTo whom reprint requests should be addressed. 5897 The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact.

Upload: dokhanh

Post on 02-Jan-2017

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Heterozygosity at individual amino acid sites: Extremely high levels

Proc. Nati. Acad. Sci. USAVol. 88, pp. 5897-5901, July 1991Evolution

Heterozygosity at individual amino acid sites: Extremely high levelsfor HLA-A and -B genes

(evolution/histocompatibility/polymorphism)

PHILIP W. HEDRICK*t, THOMAS S. WHITTAM*, AND PETER PARHAMf*Institute of Molecular Evolutionary Genetics and Department of Biology, Pennsylvania State University, University Park, PA 16802; and tDepartment of CellBiology, Stanford University, Stanford, CA 94305

Communicated by Hugh 0. McDevitt, March 4, 1991

ABSTRACT The amino acid heterozygosities per site forHLA-A and -B loci are determined to be extremely high bycombining population serotypic frequencies with amino acidsequences. For the 54 amino acid sites thought to have func-tional importance, the average heterozygosity per site is 0.301.Sixteen positions have heterozygosities >0.5 at one or both lociand the frequencies of amino acids at a given position are veryeven, resulting in nearly the maximum heterozygosity possible.Furthermore, the high heterozygosity is concentrated in thepeptide-interacting sites, whereas the sites that interact withthe T-cell receptor have lower heterozygosity. Overall, theseresults indicate the importance of some form of balancingselection operating at HLA loci, maybe even at the individualamino acid level.

Fundamental to understanding evolutionary genetics isknowledge of the magnitude of genetic variation withinpopulations and the pattern of variation within and amonggenes. In the past decade, the development of fast DNAsequencing techniques has made it possible to assay theextent ofgenetic variation at the most basic level for a varietyof genetic loci. Here we describe and evaluate the patterns ofmolecular variation at two of the most polymorphic lociknown, illustrating the usefulness of information obtainedwith this technology in addressing basic evolutionary ques-tions.The class I and II genes of the major histocompatibility

complex (MHC), or HLA region, are among the most vari-able loci known in human populations (e.g., refs. 1-3). Basedon the detection of serotypically distinct allelic variants,human populations typically have 10-20 alleles segregating atthe HLA-A and -B loci. The distribution of the allelic fre-quencies at HLA-A and -B loci departs from the predictionsof the neutral mutation theory, suggesting that some form ofbalancing selection contributes to the maintenance of allelicvariation (e.g., refs. 4 and 5).The function of the products ofHLA genes is to control the

recognition of foreign and self proteins by T lymphocytes.Within cells, class I and II molecules bind short peptides thatthey transport to the cell surface (6). There the complexes ofpeptides and HLA molecules provide the ligands for theT-cell antigen receptors (7). HLA alleles have been associ-ated with various diseases (8, 9), often autoimmune incharacter and thought to be caused by T cells specific forself-peptides bound by the relevant HLA molecules (10).

In recent years, nucleotide and amino acid sequences formany class I HLA alleles and proteins have been determined,particularly for alleles at the HLA-A and -B loci, two genes0.8 map unit apart on chromosome 6 (reviewed in ref. 11). Ingeneral, the serotypically distinct alleles at these loci arequite different from one another, with on the average only

92.7% or 93.6% amino acid identity among serotypic allelesat the HLA-A or -B loci, respectively. Determination of thethree-dimensional structure ofthe HLA-A2 protein identifiedthe peptide-binding site ofMHC molecules and a postulatedface of interaction with the T-cell receptor (TcR) (12, 13).Particular amino acid residues could be assigned as contact-ing either bound peptide, the TcR, or both, and it is preciselyat these positions that many of the allelic differences arefound.

Previous analyses ofHLA structural variation have givenequal weight to all alleles irrespective of the allelic frequen-cies within human populations (14, 15). This has the effect ofoveremphasizing the importance of "rare variants" at theexpense of the common alleles. Here, by combining knowl-edge from population surveys of serotypic variation with thatof amino acid sequence for various HLA polypeptides, wehave determined population heterozygosity for particularamino acid positions. Because a substantial proportion of thealleles at the HLA-A and -B loci has now been sequenced,examination of the amino acid variation with population-genetic techniques is now possible.

MATERIALS AND METHODSAmino Acid Sequences. Amino acid sequences for the HLA

proteins were inferred from DNA sequences obtained fromref. 11 or sources therein. The 12 different HLA-A sequencesare Al, A2, A3, All, A24, A25, A26, A29, A30, A31, A32,and Aw33, representing, on average, 85% of the HLA-Aserotypic alleles in the three population samples (see below).The 20 HLA-B sequences are B7, B8, Bwl3, B18, B27, B35,B37, Bw4l, Bw42, B44, Bw46, Bw47, B49, B51, Bw52,Bw57, Bw58, Bw60, Bw62, and Bw65, representing, onaverage, 77% ofthe HLA-B serotypic alleles in the populationsamples.Many studies employing either isoelectric focusing, T-cell-

mediated cytolysis, or nucleotide sequencing have shownthat the officially designated serotypes often encompass theproducts of multiple, closely related alleles. For example, sixdifferent A2 nucleotide sequences, three All sequences, andtwo sequences for each of B13, B27, and B44 are available.When all 21 pairwise sequence comparisons within serotypesare made, the average number of amino acid differences arefew (2.9 residues or 0.80% of the total) compared with thatobtained for pairwise comparisons between serotypes (21.3residues or 6.9% of the total) and justifies our assumption inthe present analysis of amino acid homogeneity within sero-types.We have designated amino acid sites in two different ways.

The first classification (after ref. 12, as modified by ref. 11)is based on the putative function of specific amino acid

Abbreviations: MHC, major histocompatibility complex; TcR, T-cellreceptor.tTo whom reprint requests should be addressed.

5897

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement"in accordance with 18 U.S.C. §1734 solely to indicate this fact.

Page 2: Heterozygosity at individual amino acid sites: Extremely high levels

Proc. Natl. Acad. Sci. USA 88 (1991)

residues in the HLA molecule and consists offour categories:the 29 amino acid residues postulated to interact with pep-tides ("peptide"), the 18 residues postulated to be in contactwith the TcR, the 7 residues that are postulated to interactwith peptides and/or the TcR ("peptide-TcR"), and the 313(HLA-A) or 310 (HLA-B) amino acids that are thought not tobe directly involved in any of these functions ("other").The second classification is based on the functional do-

mains of the HLA molecule that are encoded by differentexons. The domains include the leader peptide with 24 aminoacids encoded by exon 1; the a,, a2, and a3 domains of themolecule with 90, 92, and 92 amino acids, respectively, whichare encoded by exons 2, 3, and 4; and the combination of themembrane-spanning region and the cytoplasmic tail specifiedby exons 5-8, which includes 68 (HLA-A) or 65 (HLA-B)amino acids that together form the carboxyl-terminal part ofthe molecule. The amino acid positions identified as peptide,TcR, and peptide-TcR sites occur only in the a, and a2

domains.Allelic Frequencies. Our analysis utilizes the frequencies of

HLA serotypic alleles in the Caucasian, Asian, and Africansamples compiled in the Ninth International Histocompati-bility Workshop and Conference published in ref. 1. Thesample sizes from these populations are quite large: 2163 forthe HLA-A locus and 2132 for the HLA-B locus in theCaucasian sample, 976 for HLA-A and 968 for HLA-B in theAsian sample, and 311 for both loci in the African sample.Because not all serotypic alleles have been sequenced, thefrequencies for the sequenced alleles have been normalizedto sum to unity. The frequencies of blank alleles in thesesurveys are quite low, 2.37% for HLA-A and 1.1% for HLA-B(1), and should have little effect on our analysis.

Statistical Analysis. To analyze the amino acid variation forHLA-A and -B proteins with a population-genetic approach,we calculated the heterozygosity at a given position withinthe sequence as H = 1 - 1=i1 p? in which pi is the populationfrequency of a given amino acid at the position underconsideration and k is the number of different amino acidspresent at that position. H gives the proportion of individualsheterozygous at the amino acid position in question assumingthe population is in Hardy-Weinberg proportions. For ex-

ample, if seven alleles comprising 0.7 of the population have

0.75 r

0.50 [

0.25

H o

0.25

0.50

0.75

arginine at a given position and the remaining five alleles,comprising 0.3 of the population, have valine, then H = 1 -(0.7)2 - (0.3)2 = 0.42, and thus 42% of individuals in thepopulation will be heterozygous at this position.To test statistically for differences in average heterozygos-

ity per amino acid position because the heterozygosities are

not normally distributed, we used the Kruskal-Wallis test, a

nonparametric analogue for a single classification analysis ofvariance (16) for comparison among three or more groups.

For large samples, the test statistic, HKW, is approximatelydistributed as x2 (df = number of groups minus 1) when thenull hypothesis is true (16). We employed the nonparametricMann-Whitney U test for comparisons of heterozygositiesbetween two groups (17).

RESULTS

Heterozygosity per Amino Acid Position. The extent of thevariability at amino acid positions, as measured by theaverage heterozygosity per position, in the HLA-A and -Bsequences for the Caucasian, Asian, and African populationsis plotted in Fig. 1. The domains of the proteins and thelocations of the 54 peptide, TcR, or peptide-TcR sites are

marked by vertical lines at the base of Fig. 1. In general, thepositions of highest heterozygosity are concentrated at thesesites, although a number of the peptide or TcR sites areinvariant and some of the highly variable positions are notpeptide or TcR sites.To begin more detailed analysis of the patterns of variation,

we examined the average heterozygosity for the three pop-

ulation samples (see left side of Table 1). First, within eachfunctional category the heterozygosity values across thepopulation samples are not significantly different (HKW < 1,df = 2, P > 0.05 for eight comparisons). As a result, in thesubsequent analysis, we used the unweighted average het-erozygosity for the three samples. Second, the amino acidheterozygosities differ significantly among functional cate-gories within each population for the HLA-A and -B mole-cules (HKW > 80.0, df = 3, P < 0.001 for six comparisons).The peptide sites have the highest heterozygosity per position(0.264 for HLA-A and 0.337 for HLA-B), and the other siteshave the lowest heterozygosity. In comparison to the peptide

FIG. 1. Average heterozygosityfor the 366 (HLA-A) or 363 (HLA-B)amino acid positions. HLA-A het-erozygosity is indicated by the barsabove the horizontal axis and HLA-Bheterozygosity is indicated by thebars below. Indicated by vertical barsalong the bottom are the 54 aminoacid sites postulated to interact withpeptides or TcR and the domains ofthe molecules. TM, membrane-span-ning region; CYT, cytoplasmic tail;L, leader peptide.

IIIi Ill aT 1 a21 a3+lClI I- 1111Y1T--ILa acl OC i~ a3 'TM + CYTI

5898 Evolution: Hedrick et al.

Page 3: Heterozygosity at individual amino acid sites: Extremely high levels

Proc. Natl. Acad. Sci. USA 88 (1991) 5899

Table 1. Average heterozygosity per amino acid position ofHLA-A and HLA-B molecules for sites in five different categories

Population Domaint

Category n Cau Asi Afr Ave a1 a2HLA-A

Peptide 29 0.26 0.28 0.25 0.264 0.212 0.348TcR 18 0.17 0.18 0.16 0.171 0.157 0.184Peptide-TcR 7 0.22 0.23 0.22 0.217 0.346 0.165Other 312 0.04 0.04 0.04 0.036 0.014 0.042Average 366 0.06 0.07 0.06 0.064 0.075 0.099

HLA-BPeptide 29 0.35 0.33 0.34 0.337 0.318 0.368TcR 18 0.05 0.06 0.06 0.057 0.105 0.008Peptide-TcR 7 0.14 0.09 0.14 0.121 0.035 0.156Other 309 0.03 0.03 0.03 0.031 0.041 0.039Average 363 0.06 0.05 0.06 0.058 0.103 0.082

Values are given for three human populations (and their average)and for the two domains that have peptide and TcR sites. n, Numberof sites in each category. Cau, Caucasian; Asi, Asian; Afr, African;Ave, average.tThe numbers of sites in the peptide, TcR, peptide-TcR, other, andaverage categories for a, are 18, 9, 2, 61, and 90, respectively, andfor a2 are 11, 9, 5, 67, and 92, respectively.

sites, the TcR sites have lower values of heterozygosity(0.171 for HLA-A and 0.057 for HLA-B). This is especiallymarked for HLA-B, at which the TcR sites are no moreheterozygous than the average value obtained from all resi-dues.Comparison of the a1 and a2 domains reveals significant

differences in the average heterozygosity between the differ-ent functional groups in both domains (HKW > 29, df = 3, P< 0.05 for four comparisons) with the highest heterozygosityoccurring at the peptide sites (Table 1). Note that the differ-ences between HLA-A and -B in TcR-site heterozygosity areprimarily a function of the a2 domain: the nine TcR sites ofthe HLA-B a2 domain are nearly invariant (H = 0.008) andsignificantly less variable than those sites in a2 of HLA-A(Mann-Whitney U test, U = 10, P < 0.05).

Surprisingly, the average amino acid heterozygosity forHLA-A molecules (0.064) is slightly greater than it is forHLA-B molecules (0.058), an observation that contrasts withprevious assessment that HLA-B is the most variable class Ilocus [based on the number of serotypically defined alleles (1)and sequence differences between alleles (16)]. It is alsonoteworthy that the average heterozygosity per position inthe leader peptide (0.052 for HLA-A and 0.103 for HLA-B),the a3 domain (0.032 for HLA-A and 0.006 for HLA-B), andthe transmembrane and cytoplasmic domains (0.050 forHLA-A and 0.221 for HLA-B) slightly exceeds that of theother sites in the a1 and a2 domains (0.029 for HLA-A and0.040 for HLA-B).Most Variable Positions. To identify the amino acid posi-

tions with greatest variability, we compared the level ofamino acid heterozygosity at homologous positions in theHLA-A and -B molecules (Fig. 2). Twenty-five of the 182residues (14%) of the a1 and a2 domains were polymorphic (H> 0) for both loci and all of these polymorphic positions wereat functionally important sites. (Only four positions not in a,or a2 were polymorphic for HLA-A and -B.)The distributions of polymorphic positions at the HLA-A

and -B loci are summarized in Table 2. Particularly striking isthat of peptide sites that are heterozygous for one of the twoloci, 78% are heterozygous for both loci. In contrast, none ofthe 128 a1 and a2 sites in the other class is heterozygous forboth loci. For the other remaining domains, only 2.2% of thepositions are heterozygous for both loci.

Fifteen of the 16 positions with heterozygosities that ex-ceed 0.5 for either HLA-A or -B belong to one of the

0.801

0.60-

co0.40

0.20-

0

p p

----------w------------Ir

0P

pP

p

TT

p0

0 0.20 0.40

H for HLA-A

p

0.60 0.80

FIG. 2. Distribution of heterozygosity per amino acid site at theHLA-A and HLA-B loci. Four types of sites are distinguished: P,peptide-interacting sites; T, TcR-interacting sites; B, sites thatinteract with peptides and TcRs; 0, other sites not in the previousthree categories. The dotted lines separate sites with H > or <0.5.

functionally important categories of the a, and a2 domains(Table 3). It is striking that only a single TcR site (position 76)is included, again emphasizing the relative conservation atTcR compared to peptide sites.The 15 a1 and a2 positions are nearly equally divided

between the two domains. Five of the eight positions in a,(24, 45, 62, 67, and 76) show striking differences in heterozy-gosity between HLA-A and -B (Table 3). HLA-A has four a,positions with heterozygosity >0.5 and HLA-B has five, butonly one position (77) has heterozygosity >0.5 in HLA-A and-B. In the a2 domain, on the other hand, there is a verydifferent picture with comparable heterozygosities observedat both loci. At five of the seven a2 positions, the heterozy-gosity exceeds 0.5 for both loci and at the other two positionsthe heterozygosity is close to 0.5 in the locus with the lowervalue. The five positions with the greatest average heterozy-gosity over both loci, 95 (0.587), 97 (0.620), 114 (0.578), 116(0.610), and 156 (0.580), are all peptide sites in the a2 domain(four of these, 95, 97, 114, and 116, are ,-sheet residues). Thetwo positions with highest heterozygosity for a single locus,45 in HLA-B (0.710) and 62 in HLA-A (0.692), are, however,in the a, domain.The locations of the 15 highly variable positions in a1 or a2

on the three-dimensional structure of the HLA molecule aredepicted in Fig. 3. Within the antigen-recognition site, posi-

Table 2. Percentage of amino acids sites in the a, and a2domains or the other domains categorized as to whetherthey are heterozygous or not for HLA-A and HLA-B

Heterozygosity > 0.0

HLA-A HLA-A HLA-B Neitherand but not but not HLA-A nor

Potential contact n HLA-B HLA-B HLA-A HLA-B

a, and a2 domainsPeptide 29 62.1 3.4 13.8 20.7TcR 18 22.2 38.9 5.6 33.3Peptide-TcR 7 42.9 14.3 0 42.9Other 128 0 10.9 12.5 76.6

Other domainsLeader 24 4.2 16.7 20.8 58.3a3 92 1.1 6.5 2.2 90.2TM + CYT 65 3.1 13.8 4.6 78.5Average 181 2.2 10.5 5.5 81.8

TM, membrane-spanning region; CYT, cytoplasmic tail; n, num-ber of sites in each category.

Evolution: Hedrick et al.

T

Page 4: Heterozygosity at individual amino acid sites: Extremely high levels

Proc. NatL. Acad. Sci. USA 88 (1991)

Table 3. Amino acid heterozygosity and number of differentamino acid sites for the 16 sites with H > 0.5 at either HLA-A orHLA-B

Potential HLA-A HLA-B AverageDomain Position contact H n H n H

al 9 Peptide 0.571 4 0.399 3 0.48524 Peptide 0.000 1 0.609 3 0.30445 Peptide 0.000 1 0.710 4 0.35562 Peptide-TcR 0.692 5 0.070 2 0.35667 Peptide 0.164 2 0.671 5 0.40676 TcR 0.541 3 0.035 2 0.28677 Peptide 0.501 3 0.503 3 0.50280 Peptide 0.309 2 0.543 3 0.426

a2 95 Peptide 0.567 3 0.607 3 0.58797 Peptide 0.643 3 0.597 6 0.620114 Peptide 0.639 4 0.517 3 0.578116 Peptide 0.541 3 0.678 5 0.610152 Peptide 0.560 4 0.468 2 0.514156 Peptide 0.537 4 0.624 4 0.580163 Peptide-TcR 0.358 2 0.570 3 0.464

CYT 321 Other 0.508 3 0.000 1 0.254CYT, cytoplasmic tail.

tions with high heterozygosity are found in both a helices andin the 8 strands of the floor (Fig. 3). Residue 45, althoughappearing outside the peptide-binding groove in Fig. 2, is atthe end of a cavity-"the 45 pocket"-that extends from thegroove under the a-helix of the a, domain (18). The distri-bution of residues in the /8 strands with high heterozygosityis noticeably asymmetric such that the floor at one end of thegroove (right end of Fig. 3) is a focus for variation, whereasthe other end is almost constant.

DISCUSSIONPopulation Genetic Implications. Many insights into the

evolution and population genetics of variation in the HLAregion have been gained through the analysis of antigenfrequencies in human populations. Early studies revealedextensive allelic diversity underlying HLA variation and

FIG. 3. Spatial location of the highly variable sites in the modelof the three-dimensional structure of the HLA molecule indicatingthe highly variable sites by position (solid part of structure) andnumber. The a-helices are represented as helical ribbons that formthe sides of the pocket in which the peptide is found. The a-sheetsare represented as broad bands across the bottom of the pocket.

uncovered a number of statistical associations with specificdiseases, suggesting that selective forces have had substan-tial effects on the frequencies of HLA alleles. Recent com-parisons have shown that the single-locus heterozygosity,based on population surveys, for the HLA-A and -B loci issignificantly greater than that predicted by the neutral-mutation hypothesis (3).The amino acid heterozygosity documented here gives

deeper insights into the evolutionary processes for HLA-Aand -B. The level of heterozygosity is far greater than thatdocumented for any other genes to date (e.g., refs. 19-21) andwill probably only be exceeded by that for some self-incompatibility loci in plants (e.g., ref. 22). Incredibly, the 29amino acid sites that interact with peptides have averageheterozygosities per site of 0.264 for HLA-A and 0.337 forHLA-B. Sixteen positions have heterozygosity values >0.5 ateither or both loci, the maximum being 0.710 for amino acidsite 45 on HLA-B. The other sites, those not in one offunctionally important categories, have heterozygosities anorder of magnitude lower, 0.036 for HLA-A and 0.031 forHLA-B.

Earlier studies (4, 5) have shown that serotypic alleles atHLA loci have very even frequency distributions. Whenexamined for single amino acid positions, the same phenom-enon is observed. To illustrate, we can compare the observedheterozygosity to the maximum possible for given number (k)ofamino acids [the maximum heterozygosity is 1 - 1/k whenall amino acids have the same frequency (23)]. For example,the observed heterozygosity at position 45 in HLA-A withfour amino acids is 0.710, whereas the maximum possible is0.75. For the a, or a2 sites in Table 3 withH> 0.5 (21 in total),the ratio of the observed heterozygosity to that expected is0.825. In other words, the level of heterozygosity on theamino acid level is very close to the maximum that is possibleand indicates that the evenness of serotypic frequenciescarries over to an extreme evenness of amino acid frequen-cies at the highly polymorphic positions.The high level of heterozygosity for functionally important

sites suggests that some form of strong balancing selectionhas been acting on variation at these HLA loci. Furthermore,the rate of nonsynonymous substitution at these functionallyimportant amino acid sites exceeds the rate of synonymoussubstitution for five HLA-A and four HLA-B sequences (24,25), a result that suggests that natural selection has favoredamino acid replacements at these sites. Various balancingselection models have been used to explain the high variationfor HLA genes (3), including heterozygote advantage (24),frequency-dependent selection due to host-pathogen inter-actions (2), maternal-fetal interactions (26), and nonrandommating (27). Because a basic function of MHC is to presentforeign peptides degraded from viruses or bacteria, we feelthat resistance to pathogens is probably the primary reasonfor the extensive variation at HLA and that other modes ofselection such as maternal-fetal interactions or nonrandommating, although possibly significant, may have acted only toenhance these effects.Amino Acid Heterozygosity and HLA Function. Interpreta-

tion of the crystal structure of HLA-A2 led to the identifi-cation of the peptide-binding site and a putative face ofinteraction with the TcR. Residues at the tops of the twohelices with side chains pointing up were assigned as inter-acting with the TcR; residues of the ,/ strands and the heliceswith side chains pointing into the binding groove wereassigned as interacting with peptide. The population analysisdescribed here now clearly demonstrates that variation at thepeptide-interacting sites is quantitatively much greater fromthat at the TcR-interacting sites by several criteria.Davis and Bjorkman (7) have proposed that the highly

variable CDR3 regions of the TcR are primarily involved ininteractions with MHC-bound peptide, whereas the less

5900 Evolution: Hedrick et al.

Page 5: Heterozygosity at individual amino acid sites: Extremely high levels

Proc. Natl. Acad. Sci. USA 88 (1991) 5901

variable CDR1 and CDR2 regions contact the top faces of theMHC helices. Our finding that the TcR-interacting residuesofHLA-A and -B molecules are not that variable is consistentwith models that postulate the preservation of a fundamentalTcR-MHC interaction. The difference in variability betweenpeptide and TcR-interacting sites is particularly pronouncedfor the HLA-B locus and especially for the a2 helix, whereTcR-interacting residues are virtually homozygous. Func-tional experiments have for many years implicated substitu-tions in the a2-helix as being particularly effective at disrupt-ing recognition of class I HLA and H-2 molecules by T cells(28).

Correlating with these locus differences in the TcR-interacting residues of the a1 and a2 domains are HLA-A/Bdifferences in the a3 domain. This domain is the site ofinteraction with CD8, another functionally important mole-cule of the TcR complex (29, 30). Within the a3 domain,HLA-B alleles are almost homogeneous in sequence. Incontrast, HLA-A alleles exhibit significant variation and inthe case of HLA-Aw68 have a profound effect upon itsfunction due to loss of affinity for CD8 (31). Thus thereappears to be relative conservation of TcR and CD8 inter-action sites among HLA-B molecules compared to HLA-Amolecules.There are also striking differences in variation of HLA-A

and -B molecules at peptide residues and especially at thoseof the a1 domain. High heterozygosity at one locus is fre-quently associated with low heterozygosity at the other andthere are fewer positions of variation as well as less variationfor HLA-A compared to HLA-B. In conclusion, we observegreater diversity of peptide-interacting sites combined withgreater conservation of TcR- and CD8-interacting sites forHLA-B compared to HLA-A.

We appreciate comments by A. Clark, A. Hughes, W. Klitz, andM. Kuhner. This research was supported by Public Health ServiceGrants GM35326 (P.W.H.), K04-AI00964 (T.S.W.), and A117892(P.P.).

1. Baur, M. P., Neugegauer, M., Deppe, H., Sigmund, M., Lu-ton, T., Mayer, W. R. & Albert, E. D. (1984) in Histocompat-ibility Testing 1984, eds. Albert, E., Baur, J. J. & Mayr, W.(Springer, Berlin), pp. 333-341.

2. Bodmer, W. F. & Bodmer, J. G. (1989) in Mathematical Evo-lutionary Theory, ed. Feldman, M. (Princeton Univ. Press,Princeton, NJ), pp. 315-334.

3. Hedrick, P. W., Klitz, W., Robinson, W. P., Kuhner, M. K. &Thomson, G. (1990) in Evolution at the Molecular Level, eds.Selander, R. K., Clark, A. G. & Whittam, T. S. (Sinauer,Sunderland, MA), pp. 248-271.

4. Hedrick, P. W. & Thomson, G. (1983) Genetics 104, 449-456.

5. Klitz, W., Thomson, G. & Baur, M. P. (1986) Am. J. Hum.Genet. 39, 340-349.

6. Brodsky, F. M. & Guagliardi, L. E. (1991) Annu. Rev. Immu-nol. 9, 707-744.

7. Davis, M. M. & Bjorkman, P. J. (1988) Nature (London) 334,395-402.

8. Tiwari, J. L. & Terasaki, P. I. (1985) HLA and Disease Asso-ciations (Springer, New York).

9. Thomson, G. (1988) Annu. Rev. Genet. 22, 31-50.10. Sinha, A. A., Lopez, M. T. & McDevitt, H. D. (1990) Science

248, 1380-1388.11. Bjorkman, P. J. & Parham, P. (1990) Annu. Rev. Biochem. 59,

253-288.12. Bjorkman, P. J., Saper, M. A., Samraoui, B., Bennett, W. S.,

Strominger, J. L. & Wiley, D. C. (1987) Nature (London) 329,506-512.

13. Bjorkman, P. J., Saper, M. A., Samraoui, B., Bennett, W. S.,Strominger, J. L. & Wiley, D. C. (1987) Nature (London) 329,512-518.

14. Parham, P., Lomen, C. E., Lawlor, D. A., Ways, J. P.,Holmes, N., Coppin, H. L., Salter, R. D., Wan, A. M. &Ennis, P. D. (1988) Proc. Natl. Acad. Sci. USA 85, 4005-4009.

15. Todd, J. A., Bell, J. I. & McDevitt, H. 0. (1987) Nature(London) 329, 599-604.

16. Sokal, P. R. & Rohlf, F. J. (1981) Biometry (Freeman, NewYork).

17. Siegel, S. (1956) Nonparametric Statistics for the BehavioralSciences (McGraw-Hill, New York).

18. Garrett, T. P. J., Saper, M. A., Bjorkman, P. J., Strominger,J. L. & Wiley, D. C. (1989) Nature (London) 342, 692-696.

19. Nei, M. (1987) Molecular Evolutionary Genetics (ColumbiaUniv. Press, New York).

20. Kreitman, M. (1983) Nature (London) 304, 412-417.21. Dubose, R. F., Dykhuizen, D. E. & Hartl, D. L. (1988) Proc.

Natl. Acad. Sci. USA 85, 7036-7040.22. loerger, T. R., Clark, A. G. & Kao, T.-H. (1990) Proc. Natl.

Acad. Sci. USA 87, 9732-9735.23. Hedrick, P. W. (1983) Genetics of Populations (Jones and

Bartlett, Boston).24. Hughes, A. L. & Nei, M. (1988) Nature (London) 335, 167-170.25. Hughes, A. L. & Nei, M. (1989) Proc. Natl. Acad. Sci. USA 86,

958-962.26. Hedrick, P. W. & Thomson, G. (1988) Genetics 119, 205-212.27. Yamazaki, K., Beauchamp, G. K., Wysock, J., Bard, J.,

Thomas, L. & Boyse, E. A. (1983) Science 221, 186-188.28. Townsend, A. R. M. & McMichael, A. J. (1985) Prog. Allergy

36, 10-43.29. Salter, R. D., Benjamin, R. J., Wesley, P. K., Buxton, S. E.,

Garrett, T. P. J., Clayberger, C., Krensky, A. M., Norment,A. M., Littman, D. R. & Parham, P. (1990) Nature (London)345, 41-46.

30. Connolly, J. M., Hansen, T. H., Ingold, A. L. & Potter, T. A.(1990) Proc. Natl. Acad. Sci. USA 87, 2137-2141.

31. Salter, R. D., Norment, A. M., Chen, B. P., Clayberger, C.,Krensky, A. M., Littman, D. R. & Parham, P. (1989) Nature(London) 338, 345-347.

Evolution: Hedrick et al.