the testing of dominants for heterozygosity

8
Ann. Hum. Genet., Lond. (1976), 40, 183 Printed in Ureat Britain 183 The testing of dominants for heterozygosity BY C. C. LI Department of Biostatistics, University of Pittsburgh, Pittsburgh, Pennsylvania 15261 Consider a pair of autosomal genes with gene G dominant over gene g. Animal and plant breeders have always used some type of 'progeny tests' to distinguish the homozygous dominants (GG) from the heterozygous dominants (Gg). The plant breeders usually have no difficulty in doing this, as each test-mating yields a large number of offspring. For mammals, especially large animals, however, the results of test mating are not always clear-cut because of the limited number of offspring produced by each mating. A heterozygous dominant producing all dominant offspringmay be mistaken for a homozygous dominant. A fundamental paper on the subject of test mating, mis-classification, and correction is the pioneer work of Haldane (1938). Unfortu- nately, there are a number of obscurities in that paper that have puzzled human genetics students who tried to apply the method. The present communication attempts to clarify some of the obscure points, and I am sure that the late Professor Haldane would have liked to see this done. In this note the original notation has been followed without any modification. To make the note self-contained, some of the original expressions are reproduced and numbered for ease of reference. Suppose that a sample of the population includes a dominants and b recessives and that a number of the dominants have been tested by crossing to recessives. Some of these dominant parents will yield at least one recessive offspring and will thus be certainly proved to be hetero- zygous. Some of the others, producing all dominant offspring, may in spite of this be hetero- zygous. Consequently there will be a positive correction to the number of heteroaygotes as ascertained by test mating. Following Haldane, let p be the true frequency of heterozygous individuals among the domi- nants. Note that the p here is not a gene frequency (as it usually is in current literature). Rather, it is the fraction [Gg]/[GG + Gg], where the square brackets read 'frequency of '. The value of this fraction is the basic parameter we wish to estimate from test-mating results, without assuming any system of mating in the population. Of the tested dominant individuals which have given s progeny, let y8 have given all dominant offspringand 8, have given at least one recessive offspring. Let Xy8 = y and 26, = 6, where the summation is understood to be with respect to s throughout the paper. The total number of tested dominant individuals, y + 6, is of course equal to or less than a, the number of dominants in the sample. Consider a random individual among the dominants. The probabilities of being a homozygote and a heterozygote are respectively 1 -p and p. In the latter event, the probability that it be ascertained as a heterozygote by test mating is 1 - 2-8. Then the expected values of y8 and 8, are E(y8) = -p + 2-8p) (78 + 'S), (1)

Upload: c-c-li

Post on 02-Oct-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The testing of dominants for heterozygosity

Ann. Hum. Genet., Lond. (1976), 40, 183 Printed in Ureat Britain

183

The testing of dominants for heterozygosity

BY C . C . LI Department of Biostatistics, University of Pittsburgh,

Pittsburgh, Pennsylvania 15261

Consider a pair of autosomal genes with gene G dominant over gene g. Animal and plant breeders have always used some type of 'progeny tests' to distinguish the homozygous dominants (GG) from the heterozygous dominants (Gg). The plant breeders usually have no difficulty in doing this, as each test-mating yields a large number of offspring. For mammals, especially large animals, however, the results of test mating are not always clear-cut because of the limited number of offspring produced by each mating. A heterozygous dominant producing all dominant offspring may be mistaken for a homozygous dominant. A fundamental paper on the subject of test mating, mis-classification, and correction is the pioneer work of Haldane (1938). Unfortu- nately, there are a number of obscurities in that paper that have puzzled human genetics students who tried to apply the method. The present communication attempts to clarify some of the obscure points, and I am sure that the late Professor Haldane would have liked to see this done. In this note the original notation has been followed without any modification. To make the note self-contained, some of the original expressions are reproduced and numbered for ease of reference.

Suppose that a sample of the population includes a dominants and b recessives and that a number of the dominants have been tested by crossing to recessives. Some of these dominant parents will yield a t least one recessive offspring and will thus be certainly proved to be hetero- zygous. Some of the others, producing all dominant offspring, may in spite of this be hetero- zygous. Consequently there will be a positive correction to the number of heteroaygotes as ascertained by test mating.

Following Haldane, let p be the true frequency of heterozygous individuals among the domi- nants. Note that the p here is not a gene frequency (as it usually is in current literature). Rather, it is the fraction [Gg]/[GG + Gg], where the square brackets read 'frequency of '. The value of this fraction is the basic parameter we wish to estimate from test-mating results, without assuming any system of mating in the population.

Of the tested dominant individuals which have given s progeny, let y8 have given all dominant offspring and 8, have given at least one recessive offspring. Let Xy8 = y and 26, = 6, where the summation is understood to be with respect to s throughout the paper. The total number of tested dominant individuals, y + 6, is of course equal to or less than a, the number of dominants in the sample.

Consider a random individual among the dominants. The probabilities of being a homozygote and a heterozygote are respectively 1 - p and p . In the latter event, the probability that it be ascertained as a heterozygote by test mating is 1 - 2-8. Then the expected values of y8 and 8, are

E(y8) = - p + 2-8p) (78 + ' S ) , (1)

Page 2: The testing of dominants for heterozygosity

184 c. c. LI

These results are simple and clear. Then, letting L be the logarithm of the likelihood of observing the set of numbers y8 and 4, Haldane gives the following six lines without further explanation:

L = Zy810g(1 -p+24p)+Z6810gp+ constant, (3)

Q-8.. ,

l l = y+& [l +-j Z2-8ya

6 approximately.

'provided 8 is never very small', an important qualification. It is this qualification that prompted us to re-examine the derivations cited above. The smaller the number of progeny produced by a mating, the greater the probability of misclassification and the greater the need for correction. What is the use of a method that works for large s but not for small ones?

The first two expressions, (3) and (4), are straightforward and the third expression, (a), is the estimation equation for p . In the next three expressions, (6, 7, 8), however, several approxi- mations have been introduced. From (5) to (6), I guess that Haldane hrts split the right hand numerator of ( 5 ) into two terms, y8 and - 2-8y8, and then taken the first term as

which is consistent with (1). Summation of such quantities gives y+S. Whether this is what Haldane actually did or not, this substitution will yield precisely (6). From (6) to (7), it involves an expansion of the type

(ii)

which converges rapidly when A is much smaller than A. In the present context, A = 1 - p and A = 2-#p. The expansion (7) converges rapidly only when 2-p is much smaller than 1 -p; that is, when 8 is not very small. The approximate expression (8) amounts to using only the first term of the expansion series. Not only that, Haldane has also substituted the observed 6/y forp/(l -p ) in passing from (7) to (8). The latter substitution has ignored the very nature of the problem: the observed numbers 6 and y are not proportional t o p and 1 -p on account of the misclassifica- tions.

SIMPLE ESTIMATE

We shall now first suggest a simple estimate which is consistent but not f d y efficient. In the next section we shall show how to obtain a maximum likelihood estimate without solving the equation ( 5 ) andat the same time obtain thevariance of the estimate. Assuming the relationahip (i)

Page 3: The testing of dominants for heterozygosity

The testing of dominants for heterozygosity 185 holds for every s and substituting in the right hand side of the maximum likelihood equation (6 ) we obtain the simple result:

s P (9) - = q 1 - 2-5) (y5 + as),

s y + 8 - x2-5(y5 + 8,) ’ P =

This formula is applicable to all values of 8, including the case s = 1. In this respect, the problem of testing mating is different from that of segregation analysis which requires s 2 2. We also note that the estimate (10) is equivalent to summing the expectations (2) with respect to s. Thus,

s = zs5 = p q i - 2-51 (y5 +a5). (9’, 10’)

Anticipating an improved estimate to be made in the next section, we shall now, as an illustra- tion, consider the following data from testing the dominants by crossing to recessives:

No. of

tested S = I s = 2 s = 3 s = 4 Total dominants tested

Number of offspring produced dominants I -l

I 0 y = 40 (homo+hetero) 8. 4 5 9 6 8 = 24 (certain hetero) Y. I1 6 I 3

y* i- 8. 1.5 I1 22 16 y+8 = 64

where, for brevity, homo = homozygotes and hetero = heterozygotes. The denominator of (10) is

64- (I) 16 - 11 - (4)’22 - (&)4 16 = 50

so that p = 24/60 = 0.480 (11)

which is considerably higher than the uncorrected fraction 24/64 = 0.376. We shall refer to the estimate p = 0.480 again in the next section.

For samples of plants, if the dominant plants are tested by selfing, everything said above still applies by simply substituting (t)5 for (8)” = 2-8.

A different situation arises when the dominant individuals of the sample are tested by matings among themselves. This case has also been considered by Haldane in the same paper. As before let p be the frequency of heterozygotes among the dominants. When two dominants are paired to mate, they are capable of producing recessive offspring only when both dominant parents are heterozygous; the probability of such a mating is p2. For such matings the probability of pro- ducing at least one recessive offspring is 1 - ( t y , where s is the number of progeny produced by the mating. Of the pairs producing s progeny, let e5 have given all dominants and c5 have given at least one recessive. Further, let E = Xe5 and 5 = CC,. Then

W 8 ) = r1 -P2 + (P)”P21 (€5 + 6 ) s

w . 5 ) = P 2 P - WI (€5 + 6)- (12)

(13)

These results are correct for any value of 8. Subsequently, however, Haldane resorts to a pro- cedure similar to that shown in (3)-(8) which we shall not reproduce here. The approximation is probably worse than the preceding case, as ( f ) s is much larger than (+)s, unless additional terms

Page 4: The testing of dominants for heterozygosity

186 c. c. LI

involving (2)a are included, Similar to the previous case, a simple estimate may be obtained by summing the expressions (13) over all values of s. Thus

This is an estimate of p8, on account of the fact that e8 and CI, are numbers of pairs of dominant individuals.

ESTIMATE BY ITERATIVE COUNTINQ

The formal maximum likelihood procedure requires the solution of equation (4) to obtain the estimate and the evaluation of E(daL/dp2) to obtain the variance of the estimate. In the present case, this procedure is obviously very cumbersome. Fortunately the estimate of p in our problem may be readily obtained by the iterative counting method developed by Ceppellini, Siniscalco & Smith (1956) and Smith (1967). Further, these authors show that the estimate obtained by the counting method is the maximum likelihood estimate with full efficiency. Althoughthese authors are primarily interested in estimating gene frequencies and some other genetical para- meters, they have also given some non-genetical applications. In other words, the counting method is a general one; its applicability is not limited to genetical problems. The following calculations illustrate the general idea of iterative counting.

The number 6 = XS8 is the number of ‘sure’ heterozygous dominants; but the numbers ys contain both homozygous (homo) and heterozygous (hetero) dominants in the ratio:

(16) Since p is unknown, we adopt afirst provisional value of p such as that given by (10) to initiate the subdivision of each ‘ye into two parts:

(1 - p ) homo: 2-#p hetero.

2-8p hetero. 1 -p + 2-93 ( l - p ) y s homo and i-p+2-81,

This is to be done for each s separately. Summing the results of such subdivisions, we get the corrected number of homozygous and heterozygous dominants:

so that c’ + d‘ = y + 8, the total number of dominants tested. Then a new and improved estimate of p is taken as

In practice we need only calculate the hetero part of (17) and obtain d’ of (18), unless we desire to calculate c’ as an arithmetic check. Then the value of p’ of (19), or some other value near it, is taken as the second provisional value of p , with which to repeat the calculations (16)-(19) to obtain another improved estimate. The iteration continues until the provisional value and the corresponding improved value are equal (i.e. p = p’) to a certain desired degree of accuracy. This stationary value of p is the maximum likelihood estimate.

Now we shall illustrate the counting procedure by considering the numerical example given

Page 5: The testing of dominants for heterozygosity

The testing of dominants for heterozygosity 187 in the previous section and employing p( 1) = 0.48 given in (1 1) as the first provisional value of p to initiate the calculation shown in the following:

S = 1 2 3 4 Total

2-31 = 0.24 0.12 0.06 0.03 - 1 -p + 2 3 1 = 0.76 0.64 0.58 0.55 - (17) hetero = 3.4737 1.1250 1.3448 0.5456 6.4890

The fist two rows give the observed data, reproduced here for arithmetic convenience. The third row, 2+p, is obtained by dividing p ( 1) = 0.48 by 2, and then by 2 again, etc. The fourth row is obtained by adding 1 -p = 0.52 to each term of the preceding row; thus 0.52 + 0.24 = 0.76; and 0.52 + 0.12 = 0.64, etc. The la& row gives the calculated numbers of heterozygotes among the ya dominants, as given by the hetero part of (17). Thus,

ys= 11 6 13 10 y = 40

11(-) 0.24 = 3.4737, 6(=) = 1.1250, etc. 0.76 0-64

The sum of these numbers is 6.4890. That means, among the y = 40 dominants who produced s dominant offspring, there are 6.4890 heterozygotes and c’ = 40 - 6.4890 = 33.51 1 homozygotes. The total number of heterozygotes among the tested dominantsis then d’ = 6.4800 + 24 = 30.4890, where S = 24 is the number of ‘sure’ heterozygotes observed. The improved estimate is then, remembering that c’ + d’ = y + 6 = 64,

30.4890 64

~ ‘ ( 1 ) = - = 0.47639.

We may of course take 0.47639 as the second provisional value of p to repeat the calculations but the iteration procedure does not require it. Any value close to it will serve the purpose. For arithmetic convenience, we take p(2) = 0.476 as the second provisional value to generate the following calculations. The values of s and y8 remain the same as before and are omitted. Here, 1 -p = 0.524.

2-8p = 0.238 0.119 0.0595 0.02976 -

1 -p + 2-91 = 0.762 0.643 0.5835 0.55376 - (17) hetero = 3,4357 1.1104 1.3256 0.5372 6.4089

Proceeding the same way, we obtain further improved estimates. The results are summarized as follows:

provisional improved estimate p( 1) = 0.480 p(2) = 0,476

p’( 1) = 0.47639 p’(2) = 0.47514

p(3) = 0.475 p’(3) = 0.47483 (20) p(4) = 0.4748 p(5) = 0.47475

p’(4) = 0.47477 p’(6) = 0,47475

The stationary value p = 0.47475 is the maximum likelihood estimate as it satisfies equation (4) or (5). In’most problems it is unnecessary to be accurate to the fifth decimal place. In our

Page 6: The testing of dominants for heterozygosity

188 c. c. LI

present problem we should have stopped at p’( 3) = 0.47483 and accepted p = 0.476 as the h a 1 estimate, being accurate to the third decimal place.

Let p ( i ) be the ith provisional value and p’(i) be the corresponding improved estimate. In order to find the variance of the estimate, we need first to calculate the quantity

where both p( i ) and p ( j ) are close to the stationary value. Our stationary value 0.47475 is not exact and p‘(6) should have more decimal places. Since we are assumed to have stopped a t ~ ’ ( 3 ) ~ we will use

~ ’ ( 2 ) -p’(3) 0*00031 ~ ( 2 ) - p ( 3 ) 0.001

A = = - = 0.31.

Then the variance of the estimate is (Ceppellini et al. 1955; Smith, 1957)

where T = y+S, the total number of individuals under consideration. The standard error is s . E . ( ~ ) = 0.075.

Once the stationary value of p has been found, the final corrected numbers of homozygotes and heterozygotes among the tested dominants are, respectively,

homo: c = (1-p)(y+S), hetero: d = p ( y + S ) (23) so that c +d = y + S < a, the total number of dominants in the sample. In the following section it is assumed that this has been done.

GENE FREQUENCY RATIO AND INBREEDING COEFFICIENT

The last section of Haldane’s paper deals with the problem of estimating the gene frequency ratio and the inbreeding coefficient from the sample. Much of Haldane’s early work on population genetics is in terms of the gene frequency ratio, u = freq(Q)/’req(g). Let the inbreeding coefficient be a, which is the same as F i n American literature. Then the genetic composition of the population will be

(24) u2+au, 2(1-a)u, a u + l , ( ~ + 1 ) ~

Suppose that the sample consists of a dominants and b recessives, and that p has been calculated from information on test mating to recessives so that the numbers of homozygotes and hetero- zygotes among the tested dominants are calculated to be c and d , respectively. Hence the following two equations:

Total ctc Cg 99

- au+l b

2(1-a) d

-- (u+ l)* - a 3 ’

-=- u+a c ’

from which the values of u and a may be readily found. Inadvertently Haldane gives equation

2(1-a) d (26) as

(26x1 -=- u+a c+d

Page 7: The testing of dominants for heterozygosity

The testing of dominants for heterozygosity 189 (the a in the denominator was misprinted as a in the original paper). His subsequent solution for a is incorrect. The expression (26 ) combined with (26 ) will yield a complicated quadratic equation in u, a procedure that Haldane has adopted. However, if we make use of the information p = d / ( c + d ) and 1 - p = c / ( c + d ) , we may subdivide the a dominants of the sample into homo- zygotes and heterozygotes and obtain the value of u at a glance.

GO: a(1 -p ) = ac / ( c+d) ,

Gg: ap = a d / ( c + d ) ,

Hence,

and

99: b = b(c + d) / ( c + d).

2ac+ad - 2c+d ad + 2b(c +d) - a+%(---) c + d ’ U =

In the rare event that all dominant individuals have been tested, a = c + d , then (27) and (28 ) will reduce to the familiar expressions for no dominance:

2c+d 4bc - d2 u=- 2b+d’ a = ( 2 c + d ) ( 2 b + d ) ’

For those who prefer to use gene frequencies, we have from (27 ) :

1 a d b freq(g)=- =- - +-, u+l a + b [ 2 ( c + d J a + b

freq(G) = - = U

(27’, 28’)

The meaning of these expressions should be clear. Consider the frequency of the recessive gene first. When the individuals belong to the dominant group, an event with probability a / ( a + b ) , the frequency of g is d/2(c + d). When they belong to the recessive group, an event with probability b/(a + b) , the frequency of g is unity; hence (29) . The interpretation of (30) for the frequency of G is the same, noting that when the individuals belong to the recessive group, its frequency is zero.

In closing I wish to warn the reader not to think that Haldane’s 1938 paper is full of errors. It is not. Much of the material in that pioneer paper remains good and valid, constituting a funda- mental treatise on the subject. But I do hope the present note will facilitate the application of the method in human genetics.

SUMMARY

A simple method of estimating the frequency of heterozygotes among the dominants by test-mating has been described; and then a maximum likelihood estimate is obtained by an iterative counting procedure. Early results of Haldane have been reviewed and modified. Methods of estimating the gene frequencies, their ratio, and the inbreeding coefficient from test-mating results have also been given.

Page 8: The testing of dominants for heterozygosity

c. c. LI I am grateful to Professor C. A. B. Smith for pointing out that the counting method is readily applicable

to the problem under considerstion. I thank him for the generous advice given me during the preparation of the manuscript.

REFERENUEB

CEPPELLINI, R., SINISOALCO, M. & SMITE, C. A. B. (1966). The estimation of gene frequencies in a random

WANE, J. B. S. (1938). Indirect evidence for the mating system in natural populations. J. ffenet. 36,

SMIITH, C. A. B. (1967). Counting methods in genetical statistics. Ann. Hum. Genet., Lod. 21, 264-276.

mating population. A m . Hum. Genet., Lod. 20, 97-116.

213-220.