hum. genet. parametric and nonparametric linkage analysis: a … · 2018-08-03 · kruglyak et al.:...

Am. J. Hum. Genet. 58:1347-1363, 1996

Parametric and Nonparametric Linkage Analysis: A UnifiedMultipoint ApproachLeonid Kruglyak,' Mark J. Daly,' Mary Pat Reeve-Daly,' and Eric S. Lander',2'Whitehead Institute for Biomedical Research and 2Department of Biology, Massachusetts Institute of Technology, Cambridge

Summary Introduction

In complex disease studies, it is crucial to performmultipoint linkage analysis with many markers and touse robust nonparametric methods that take account ofall pedigree information. Currently available methodsfall short in both regards. In this paper, we describe howto extract complete multipoint inheritance informationfrom general pedigrees of moderate size. This informa-tion is captured in the multipoint inheritance distribu-tion, which provides a framework for a unified approachto both parametric and nonparametric methods of link-age analysis. Specifically, the approach includes the fol-lowing: (1) Rapid exact computation of multipoint LODscores involving dozens of highly polymorphic markers,even in the presence of loops and missing data. (2) Non-parametric linkage (NPL) analysis, a powerful new ap-proach to pedigree analysis. We show that NPL is robustto uncertainty about mode of inheritance, is much morepowerful than commonly used nonparametric methods,and loses little power relative to parametric linkage anal-ysis. NPL thus appears to be the method of choice forpedigree studies of complex traits. (3) Information-con-tent mapping, which measures the fraction of the totalinheritance information extracted by the availablemarker data and points out the regions in which typingadditional markers is most useful. (4) Maximum-likeli-hood reconstruction of many-marker haplotypes, evenin pedigrees with missing data. We have implementedNPL analysis, LOD-score computation, information-content mapping, and haplotype reconstruction in a newcomputer package, GENEHUNTER. The packageallows efficient multipoint analysis of pedigree data to beperformed rapidly in a single user-friendly environment.

Received January 12, 1996; accepted for publication March 4,1996.Address for correspondence and reprints: Dr. Eric S. Lander or

Dr. Leonid Kruglyak, Whitehead Institute for Biomedical Research, 9Cambridge Center, Cambridge, MA 02142. E-mail: [email protected] or [email protected]© 1996 by The American Society of Human Genetics. All rights reserved.0002-9297/96/5806-0028$02.00

Linkage analysis aims to extract all available inheritanceinformation from pedigrees and to test for coinheritanceof chromosomal regions with a trait. In principle, onecan use either parametric methods, which involve testingwhether the inheritance pattern fits a specific model fora trait-causing gene, or nonparametric methods, whichinvolve testing whether the inheritance pattern deviatesfrom expectation under independent assortment.Although easily stated, this goal has proved hard to

implement in practice. A major obstacle has been thecomputational difficulty of making inferences based onimperfect information, arising from incomplete struc-ture of human pedigrees and incomplete informativenessof genetic markers. Parametric and nonparametric meth-ods have generally adopted rather different solutions,neither of which is wholly satisfactory:

1. Parametric analysis. The LOD-score method is themost widely used approach to parametric linkageanalysis (Morton 1955); its theoretical foundationsare well understood, and computer programs tocarry out LOD-score calculations are available (Ott1991; Terwilliger and Ott 1994). The major diffi-culty is computational-extracting the full linkageinformation in a pedigree requires the use of a densegenetic linkage map, but such multipoint analysis isinfeasible for more than a handful of loci because ofthe inherent constraints of the Elston-Stewart algo-rithm (Elston and Stewart 1971). The problem hasbeen circumvented in the case of specific pedigreestructures, through the use of alternative algorithms(Lathrop et al. 1986; Lander and Green 1987; Krug-lyak et al. 1995), and recent improvements to theElston-Stewart algorithm promise to make multi-point analysis with a limited number of loci morepractical (O'Connell and Weeks 1995). Nonetheless,complete multipoint analysis remains a bottleneckfor general pedigrees-even those of moderate size.

2. Nonparametric analysis. Because parametric linkageanalysis can be highly sensitive to misspecificationof the linkage model (Clerget-Darpoux et al. 1986),nonparametric analysis is a key tool for all but thesimplest of traits. Nonparametric analysis has beenperformed primarily by one of two methods. The

1347

Am. J. Hum. Genet. 58:1347-1363, 1996

first approach is to break pedigrees into nuclear fami-lies and apply sib-pair analysis; this is inefficient be-cause it wastes a great deal of inheritance informa-tion contained in pedigree structure. To partly utilizepedigree information, Weeks and Lange (1988,1992) developed the affected-pedigree-membermethod (APM). APM is not a true linkage method.It sidesteps the thorny issue of tracing the inheritancepattern in a pedigree by focusing on whether affectedrelatives happen to show the same alleles at a locus(i.e., identity/identical by state [IBS]), regardless ofwhether the allele is actually inherited from a com-mon ancestor (i.e., identity/identical by descent[IBD]). The extent of IBS sharing among all pairs ofaffected members of the pedigree is compared withMendelian expectation under the hypothesis of nolinkage. The APM approach has several drawbacks:(i) It focuses only on IBS information and ignoresgenotype information for additional members of thepedigree, even when this information can be used toresolve whether shared alleles are actually IBD. (ii)It involves comparisons only among pairs of individ-uals, which can be less powerful than tests based onlarger sets of affected individuals (Whittemore andHalpern 1994a; also, see below). (iii) It lacks a truemultipoint formulation. Multilocus APM simplyadds together statistics from several marker loci(Weeks and Lange 1992), rather than extracting link-age information about any given point. It thus testsfor linkage to an extended chromosomal regionrather than to a point, and therefore it cannot beused to localize a particular locus relative to a mapmarkers. By failing to extract the full inheritance in-formation, APM is potentially prone to false-positiveand false-negative results.

To avoid these inherent problems of IBS-based meth-ods, Curtis and Sham (1994) have recently proposed anapproach, called extended relative pair analysis (ERPA),that uses the risk-calculation facility of the LINKAGEpackage (Lathrop et al. 1984) to compute IBD-sharingprobabilities for all pairs of affected individuals in apedigree. ERPA is thus a true linkage approach to non-parametric analysis. It is limited, however, in severalkey respects: the comparisons are inherently confined torelative pairs; the statistical test for linkage is ad hoc;and the method cannot handle large numbers of loci,because of the basic algorithm used in the LINKAGEpackage. Other approaches to nonparametric analysishave also been described (e.g., by Curtis and Sham1995).The purpose of this paper is to describe a unified

approach to both parametric analysis and nonparamet-ric analysis. The key is to separate two issues: (1) ex-tracting information about the inheritance pattern in a

pedigree (which depends only on the genetic markers)and (2) defining a statistic to assess linkage for a giveninheritance pattern (which depends only on the natureof the trait).

This approach generalizes our recent methods forcomplete multipoint sib-pair analysis (Kruglyak andLander 1995) to the situation of arbitrary pedigrees.The generalization required the development of a newlinkage algorithm for arbitrary pedigrees, as well as thedefinition of new statistics for performing nonparamet-ric analysis.The paper is organized in four parts. First, we discuss

how to extract all available inheritance informationfrom a pedigree. Specifically, we present a completemultipoint algorithm for determining the probabilitydistribution over possible inheritance patterns at eachpoint in the genome. Second, we apply these conceptsto define a unified multipoint framework for both para-metric and nonparametric analysis. In the former case,the approach provides a rapid multipoint linkage algo-rithm for traditional LOD-score calculations. In the lat-ter case, it provides a powerful new approach to pedi-gree analysis, which we refer to as nonparametriclinkage (NPL) analysis. Third, we evaluate the power ofNPL analysis in applications to both simulated and ac-tual data. In all cases examined, NPL analysis is consid-erably more powerful than APM. Finally, we show howthe framework presented here also allows reconstructionof haplotypes in pedigrees.We have implemented these methods in a computer

program, GENEHUNTER, for both parametric andnonparametric analysis. With current workstations, theprogram can rapidly analyze moderately sized pedigreesof the sort used in genetic studies of complex traits.

Definitions

Given a pedigree, we define nonfounders to be thoseindividuals whose parents are in the pedigree. Withoutloss of generality, we will assume that pedigrees are de-fined to include both parents of any individual who hasa sib, half-sib, or parent in the pedigree. (If such parentsare unavailable for study, they are simply included inthe pedigree with unknown phenotypic and genotypicstatus). Individuals whose parents are not in the pedigreeare designated as founders. Throughout, n will denotethe number of nonfounders, and f the number of found-ers, in a pedigree. Founders will be assumed to be unre-lated; that is, they are assumed to carry 2f alleles thatare distinct by descent (although some may be IBS).

Representing and Computing Inheritance Information

The Inheritance VectorLinkage analysis can be divided into two steps: (i)

inferring information about the inheritance pattern of a

1348

Kruglyak et al.: Parametric and Nonparametric Linkage Analysis

pedigree and (ii) deciding whether the inheritance infor-mation indicates the presence of a trait-causing gene.

Ideally, one would like to know the precise inheritancepattern at every locus in the genome. The inheritancepattern at each point x is completely described by a

binary inheritance vector v(x) = (plml Pp2,m2,... ,pn,mn), whose coordinates describe the outcome ofthe paternal and maternal meioses giving rise to the n

nonfounders in the pedigree (Lander and Green 1987).Specifically, pi = 0 or 1, according to whether the grand-paternal or grandmaternal allele was transmitted in thepaternal meiosis giving rise to the ith nonfounder; micarries the same information for the corresponding ma-

ternal meiosis. Thus, the inheritance vector completelyspecifies which of the 2f distinct founder alleles are in-herited by each nonfounder. The notion of the inheri-tance vector is illustrated in figure 1A. The set of all 22npossible inheritance vectors will be denoted V. Similarrepresentations of inheritance have been proposed inthe context of Monte Carlo linkage analysis (Sobel andLange 1993; Thompson 1994), as well as in other appli-cations (Whittemore and Halpern 1994a, 1994b; Guo1995).

The Inheritance DistributionIn practice, it is not feasible to determine the true

inheritance vector at every point in the genome, sincethis would require genotyping all pedigree memberswith an infinitely dense map of fully informative mark-ers. Because key pedigree members are frequently un-

available and genetic markers have limited heterozygos-ity, genotype data will provide only partial informationabout inheritance.

Partial information extracted from a pedigree can berepresented by a probability distribution over the possi-ble inheritance vectors at each locus in the genome-

that is, P(v(x) = w) for all inheritance vectors wE V. Inthe absence of any genotype information, all inheritancevectors are equally likely according to Mendel's firstlaw, and the probability distribution is uniform (abbre-viated as Puniform). As genotype information is added,the probability distribution is concentrated on certaininheritance vectors. The probability distribution over

possible inheritance vectors will be referred to as theinheritance distribution; the notion is illustrated in figure1B and C.

Calculating the Inheritance Distribution by Use ofHidden Markov Models (HMMs)To extract the full information from a data set, one

must calculate the inheritance distribution conditionalon the genotypes at all marker loci (abbreviatedPcomplete). Lander and Green (1987) described how, inprinciple, an HMM can be used to solve this problem.In brief, the approach considers the inheritance pattern

A.

[xl,x2] [x3,x4]

[xl,x3] [x5,x6]

[xl ,x5] 5

C.

inheritance vector0000000100100011010001010110011110001001101010111100110111101111

prior1/161/161/161/161/161/161/161/161/161/161/161/161/161/161/161/16

posterior1/81/80

0

1/81/80

0

1/81/80

0

1/81/80

0

B.

AB AC

AC3 BB

AB 5

true1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Figure 1 Illustration of the inheritance vector and its distribu-tion, for a simple pedigree. A, Pedigree shown with individuals labeled"1" through "5." The distinct-by-descent founder alleles are labeled"xl" through "x6"; they are assumed to be phase known, with thepaternally derived allele listed first. The four meiotic events whoseoutcomes determine inheritance in the pedigree are indicated byarrows; the labels correspond to the coordinates in the inheritancevector. The inheritance outcome shown is specified by inheritancevector (0,0,0,0)-that is, the paternally derived allele is transmittedin every meiosis. B, Same pedigree, now shown with actual genotypesat a marker with three alleles, A, B, and C. Only the outcome ofmeiosis 3 is unambiguously determined by the genotype data-thepaternally derived allele is transmitted, fixing the third bit in the inheri-tance vector at 0. C, Inheritance distribution for the 16 possible inheri-tance vectors. "prior" denotes distribution before any genotyping hasbeen performed; "posterior" denotes distribution based on genotypesin panel B; and "true" denotes distribution based on fully informative,phase-known genotypes as in panel A.

across the genome as a Markov process (with recombi-nation causing transitions among states) that is ob-served, imperfectly, only at marker loci. One uses theimperfect observations at each marker (more precisely,the probability distribution over inheritance vectors ateach marker locus, conditional only on the data for thelocus itself [abbreviated as Pmarker]), to reconstruct theprobability distribution at any point, conditional on theentire data set, according to the standard forward-back-ward conditioning approach employed in HMMs (Rabi-ner 1989). In the basic Lander-Green algorithm, the time

1349

-a

Am. J. Hum. Genet. 58:1347-1363, 1996

required for theHMM reconstruction step with m mark-ers is O(m 2"). Because this scales linearly with thenumber of loci but exponentially with the number ofnonfounders, the approach is best suited to completemultipoint analyses in pedigrees of moderate size. Incontrast, the Elston-Stewart algorithm scales exponen-tially with loci but linearly with nonfounders and thusis best suited for studying one or a few markers in largepedigrees.

First SpeedupKruglyak et al. (1995) recently showed how to de-

crease the time required for the HMM reconstructionstep from O(m * 2") to O(m * n22n), thereby effectivelydoubling the pedigree size to which the HMM approachcan be applied. With this speedup, the approach hasbeen implemented in special cases, to allow completemultipoint analysis for homozygosity mapping, linkageanalysis in nuclear families, and sib-pair analysis (Krug-lyak et al. 1995; Kruglyak and Lander 1995). To applythe approach to general pedigrees, it is necessary to havean algorithm for calculating the initial distributions usedin the HMM, Pmarker for pedigrees of arbitrary structure.We have now devised such an algorithm, which is de-scribed in appendix A.

Second SpeedupWe have devised a further substantial acceleration of

the HMM, by taking advantage of a certain degeneracyamong the inheritance vectors. Since a pedigree containsno information about founder phase, inheritance vectorsthat differ only by phase changes in the founders arecompletely equivalent and must therefore have equalprobabilities. In a pedigree with f founders, the inheri-tance vectors can thus be organized into equivalenceclasses consisting of 2f equivalent members. The HMMalgorithm can be modified to work with just a singlerepresentative from each equivalence class, as describedin appendix B. This reduces both the time and spacerequirements of the calculation by a factor of 2f, furtherincreasing the size of pedigrees that may be analyzed.The running time for analysis of m markers is thusO(m - n22-f )

Computer ImplementationWe have implemented theHMM approach with these

two speedups in a new computer package, GENE-HUNTER. On current workstations, GENEHUNTERcan comfortably handle pedigrees with 2n - f - 16,or, typically, approximately a dozen nonfounders. Someexamples of pedigrees that can be readily analyzed aregiven in figure 2.The same methods also can be used to estimate the

number of recombination events between two markers.GENEHUNTER includes an option to compute this

A

B

C

Figure 2 Examples of pedigrees that can be analyzed by usingGENEHUNTER. A, Simple three-generation pedigree segregating adominant disorder. Individuals in the last two generations are avail-able for study. B, Complex inbred pedigree that occurred in the studyof Werner syndrome (Thompson and Wijsman 1994). Only the af-fected individual in the last generation is available for study. C, Pedi-gree with three affected fourth cousins. Only the affected individualsin the last generation are available for study.

number for pairs of consecutive markers, which can beuseful for detecting genotyping errors that cause mapinflation.

Information-Content MappingIn studying a pedigree, it is useful to know how much

of the total inheritance information has been extracted

1350

OIT-*2

25 26

30 31 0 41 42


at each point in the genome, given the available genotypedata. We introduced a notion of "information-contentmapping" in our previous work on sib-pair analysis(Kruglyak and Lander 1995). Information content pro-vides a measure of how closely a study approaches thegoal of completely determining the inheritance outcome,and it points out the regions where typing additionalmarkers is most useful. Here, we modify our previousapproach and extend it to arbitrary pedigrees.The classical information-theoretic measure of resid-

ual uncertainty in a probability distribution is its en-tropy, defined by E = -IPilog2Pi, where Pi is the proba-bility of the ith outcome and where log2 is used in orderfor the entropy to be measured in bits (Shannon 1948).The entropy of the probability distribution over inheri-tance vectors thus naturally reflects information content.

In the absence of genotype data, the probability distri-bution is uniform over all 22n-f equivalence classes ofinheritance vectors. The entropy of the distribution iseasily seen to be E = 2n - f bits. This result makesintuitive sense, since we are completely uncertain aboutthe outcome of the 2n - fmeioses for which informationcan be obtained. If the inheritance vector is known withcertainty (e.g., at a fully informative marker), the proba-bility distribution is completely concentrated on a singleoutcome. The entropy is thus E = 0, which again makesintuitive sense.The information content of the inheritance pattern at

point x will be defined by

IE(X) = 1 - E(x)/Eo, (1)

where E(x) is the entropy of the multipoint inheritancedistribution at x and where Eo = 2n - f bits is theentropy in the absence of genotype data. Informationcontent IE(X) = 1 indicates perfect informativeness at x,whereas information content IE(X) = 0 indicates totaluncertainty about inheritance in the pedigree at x. Sinceentropy is an additive measure, it can be summed overall pedigrees in the data set. Equation (1) is then usedwith total entropy to obtain the overall information con-tent of a study.

IE is a general measure of information content. It doesnot depend on any particular test for linkage and hasthe desirable property that it always lies between 0 and1. (This contrasts with a somewhat different measure ofinformation content, which we discussed in previouswork on sib-pair analysis [Kruglyak and Lander 1995].)An example of information content for different mapdensities is shown in figure 3.

Unified Linkage Analysis

We now define both parametric and nonparametricanalysis from a unified perspective, which is based on

cu

0(a

Chromosome position (cM)

Figure 3 Information-content mapping for various marker den-sities. Genotypes for markers spaced every 2 cM on a 100-cM map,with typical microsatellite informativeness levels (heterogeneity .75),were simulated for 10 sibships having four sibs each, with missingparents. The five curves show the information content with markersgenotyped at SO-cM, 25-cM, 12-cM, 6-cM, and 2-cM average spacing(corresponding, respectively, to 3, 5, 9, 17, and 51 markers genotypedacross the map). The average information content increases from-40% to 54%, 63%, 72%, and 85%, respectively.

the notion of inheritance vectors. In the ideal situation-the precise inheritance vector v(x) at each point x isknown with certainty-linkage analysis simply involvesquantifying the extent to which the inheritance vectorindicates the presence of a disease gene. This can be doneby specifying a scoring function S(vA)) that depends onthe inheritance vector v and the observed phenotypes 1in the pedigree.To extend the analysis to the more realistic situation

in which one has only a probability distribution overv(x), one can generalize the scoring function by takingits expected value over the inheritance distribution:

(2)S(xA) = I S(w,$)P[v(x) = w].wEV

Given the probability distribution over inheritance vec-tors at every point x, it is then straightforward to calcu-late S throughout the genome. Specifically, one couldcalculate once and store the 22n-f values of S(vA). Foreach point x, one could then compute the linear combi-nation in equation (2) in time 0(22n-f ). We now con-sider various choices of scoring functions S that corre-spond to parametric linkage analysis and NPL analysis.

Parametric Linkage Analysis

Scoring FunctionIn parametric linkage analysis, one assumes a model

describing the probability of phenotype given genotype

1351

Am. J. Hum. Genet. 58:1347-1363, 1996

at the disease locus and calculates the likelihood ratiounder the hypothesis that a disease gene is at x, versusthe hypothesis that it is unlinked to x. In the specialcase when the inheritance vector is known, the scoringfunction S is simply the likelihood ratio. It is given by

LR(v) = P(( I v)E P((F W)Puniform(W)

wE- V

P((D v) is simply the likelihood of observed phenotypes(D, conditioned on the particular inheritance vector v; itdepends only on the penetrance values and allele fre-quencies at the disease locus. For each v, one can effi-ciently compute P((D v) by a simple adaptation of stan-dard peeling methods for pedigrees without loops(Elston and Stewart 1971; Lange and Elston 1975; Can-nings et al. 1978; Whittemore and Halpern 1994b) andby a combination of peeling, loop breaking, and enumer-ation of founder genotypes for pedigrees with loops (fordetails, see appendix C). Calculating the likelihood foreach of the 22n-f equivalence classes of inheritance vec-tors is rapid for moderate-sized pedigrees, both with andwithout loops.

In the general case, we take the expectation of thescoring function over the inheritance distribution, as inequation (2):

LR(x) = A LR(w)P(v(x) = w)we V

I P(4) W)Pcomplete(W)= wEV

I P(F W)Puniform(W)wEV

This expression is easily seen to be equivalent to thetraditional definition of the likelihood ratio-the nu-merator is proportional to the multipoint likelihoodwhen the disease gene is at x, whereas the denominatoris proportional to the unlinked likelihood. According tolong-standing tradition, one reports the LOD score,loglo(LR).

Because traditional LOD-score analysis can be ex-pressed in the unified framework above, the fast HMMapproach provides a rapid algorithm for performingcomplete multipoint linkage analysis in moderate-sizedpedigrees. The LOD scores obtained by this method areexact-no approximations are involved. The only dif-ference with conventional algorithms is the speed ofcomputation when many markers are considered simul-taneously.

ImplementationWe have implemented parametric linkage analysis

within GENEHUNTER. The program can computeLOD scores for arbitrary pedigrees under particular

models of inheritance, allowing the user to specify allelefrequencies at the disease locus and penetrances for lia-bility classes (including age- and sex-dependent pene-trances). The program also allows the user to test forlinkage under genetic heterogeneity by using an admix-ture model (Ott 1991; Terwilliger and Ott 1994) to esti-mate the proportion of linked families a. Alternatively,the user can specify the admixture parameter a.To illustrate its performance, GENEHUNTER was

applied to simulated data for the pedigrees shown infigure 2. For each pedigree, we simulated genotype datafor a genetic map of 20 markers under the hypothesisof a disease-causing gene located in the middle of themap. We then calculated complete multipoint LODscores at each marker and at four points within eachinterval between markers, that is, at 96 distinct maplocations (fig. 4). On a DEC Alpha workstation, thecomputation times for these 96 21-point LOD scores(disease locus plus all 20 markers) were 24 min, 82 min,and 280 min, for pedigrees A, B. and C, respectively.(The respective values of 2n - f are 14, 15, and 16).For each of the three pedigrees, the maximum LOD

score computed by using complete multipoint analysisapproaches the theoretical maximum LOD score thatwould be obtained with an infinitely polymorphicmarker located at a recombination fraction of zero fromthe disease gene. In particular, for pedigree C in figure2, the three isolated fourth cousins have a probability of( 1/2)13 of sharing an allele IBD, resulting in a theoreticalmaximum LOD score of 3.91. The multipoint LODscore nearly achieves this maximum, with a LOD of3.84 (fig. 4C), indicating that it has extracted essentiallyall inheritance information. In contrast, the maximumLOD score attainable with a single marker is only 1.87,and the maximum LOD score with two flanking markersis 1.98. In this case, multipoint analysis increases theLOD score from moderately interesting to significant,providing almost 100-fold-higher odds in favor of link-age than does two-point analysis.To further explore the value of multipoint analysis,

we considered the simpler case of a pedigree with twoaffected fourth cousins and all other pedigree membersunavailable for study. We once again simulated a 20-marker map under the hypothesis of a linked rare domi-nant gene. The IBD-sharing probability for two fourthcousins is 1/256, yielding a theoretical maximum LODscore of 2.41. In figure 5, we plot the maximum LODscore achieved by analyzing k = 1, . . . ,20 consecutivemarkers simultaneously. Complete 20-marker analysisyields a LOD score of 2.2 (91% of theoretical maxi-mum). In contrast, the highest two-point LOD scoreis only 0.83 (34% of theoretical maximum), and evensimultaneous six-marker analysis yields, at most, a LODscore of 1.74 (72% of theoretical maximum). These re-sults underscore the value that multipoint analysis with

1352


A

(M

00

0

B

0w

'00

-Jlaao

C

a)0nU,

'00

-..

Map position (cM) Map position (cM) Map position (cM)

Figure 4 Multipoint LOD-score plots for the pedigrees shown in figure 2. Genotypes for 20 markers were simulated under the assumptionof a disease gene at the location indicated by an arrow. A total of 96 21-point LOD scores were computed, with the disease locus tested ateach marker and at four evenly spaced locations in each interval between markers. Marker positions are indicated by tick marks on thehorizontal axis. A, Pedigree of figure 2A, with a rare dominant gene (frequency 10-4). B, Pedigree of figure 2B, with a rare recessive gene(frequency 10-4). C, Pedigree of figure 2C, with a very rare dominant gene (frequency 10-6).

many markers has for extracting the full inheritanceinformation. Such multipoint analysis is clearly desir-able, since it requires only 40 s on a SUN SPARC work-station running GENEHUNTER.To compare the performance of GENEHUNTER with

that of other linkage packages, we analyzed the pedigreewith two affected fourth cousins, using FASTLINK(Cottingham et al. 1993) and VITESSE (O'Connell andWeeks 1995), both running on a SUN SPARC worksta-tion. FASTLINK required 32 min to compute LODscores when using overlapping sets of two markers (28three-point calculations), with a maximum LOD scoreof 0.98. Four-point calculations failed to complete after-1-00 h. VITESSE required 85 s to compute LOD scoreswhen using two markers simultaneously, 30 min to com-pute LOD scores when using three markers simultane-ously (54 four-point calculations; maximum LOD scoreof 1.28), and 19 h 14 min to compute lod scores whenusing four markers simultaneously (68 five-point calcu-lations; maximum LOD score of 1.43). Six-point calcu-lations failed to complete after '100 h. These otherprograms thus can perform multipoint analysis with ahandful of markers, but not the complete multipointcalculations necessary to extract all available inheritanceinformation. On the other hand, these programs are ableto handle very large pedigrees that are beyond the com-putational limitations of GENEHUNTER.GENEHUNTER's speed is independent of the number

of alleles per marker (thereby allowing highly polymor-phic markers to be used without recoding) and is essen-tially independent of the amount of missing informationin the pedigree. The program has been tested extensively

by comparing the results with those produced by LINK-AGE (Lathrop et al. 1984) and FASTLINK (Cottinghamet al. 1993), for a variety of family structures and modesof inheritance (in analyses using a small number ofmarkers). In all case examined, the three programs pro-duced identical answers.

NPL Analysis

Scoring FunctionsWe begin by considering the special case in which the

inheritance vector is known with certainty. The inheri-tance vector fully determines which of the 2f distinctfounder alleles was inherited by each person and thuscompletely specifies IBD sharing in the pedigree. Theonly issue is to define a suitable scoring function tomeasure whether affected individuals share alleles IBDmore often than expected under random segregation.One simple approach would be to assign a score of 1 ifall affected individuals in a pedigree share an allele IBDand to assign a score of 0 otherwise (Thomas et al.1994). However, this statistic is likely not to be robustin the presence of phenocopies and common disease al-leles. We consider below two useful scoring functions,Spairs and Sall, previously discussed by Whittemore andHalpern (1994a); other scoring functions can be defined.

1. IBD sharing in pairs.-One possible approach is tocount pairwise allele sharing among affected relatives.Given the inheritance vector v, Spairs(v) is defined to bethe number of pairs of alleles from distinct affected pedi-gree members that are IBD. The traditional APM statis-tic (Weeks and Lange 1988) also counts pairwise allele

1353

Am. J. Hum. Genet. 58:1347-1363, 1996

I-

0

0

2-

..s.............................................................................................

*0 4

000

00

0

t i-I- . .

0 5 10 15 20

Marker number

Figure 5 Maximum LOD score achieved in a pedigree with twoaffected fourth cousins, plotted as a function of k, the number ofmarkers analyzed simultaneously. Genotypes for 20 markers weresimulated by assuming the presence of a very rare dominant diseasegene (frequency 10-6) in the middle of an 18-cM map, as in figure4C. LOD scores were computed with the disease locus tested at themarkers and at four points within each interval between markers;genotype data from overlapping sets of k consecutive markers wereused. Black dots show results for multipoint analysis with sets of1,2,3,. . . ,11, and 20 markers; and the dotted line shows the theoreti-cal maximum LOD score of 2.41 for this pedigree. The maximumLOD score of 2.2, achieved with 20 markers, is the highest possiblewith this marker density and polymorphism; doubling the markerdensity and performing multipoint analysis with 40 markers raises themaximum LOD score to 2.4 (data not shown).

sharing, but it is based on sharing IBS rather than onsharing IBD; the two statistics will coincide only atmarkers for which IBS unambiguously determines IBD.

2. IBD sharing in larger sets.-One can often increasestatistical power by considering larger sets of affectedrelatives, rather than just pairs. For example, it is moreimpressive to find that five affected relatives share thesame allele IBD than to find that each pair of them sharessome allele IBD. Whittemore and Halpern (1994a) pro-posed an interesting statistic to capture the allele sharingassociated with a given inheritance vector v. Let a denotethe number of affected individuals in the pedigree, let hbe a collection of alleles obtained by choosing one allelefrom each of these affected individuals, and let bi(h)denote the number of times that the ith founder alleleappears in h (for i = 1, . .. ,2f). The score Sall is definedas

Sali(V) = 2 a L[ bi(h)!1h i= 1

where the sum is taken over the 2a possible ways tochoose h. In effect, the score is the average number of

permutations that preserve a collection obtained bychoosing one allele from each affected person. It givessharply increasing weight as the number of affected indi-viduals sharing a particular allele increases.

For either approach, we define a normalized score

Z(v) = [S(v) - p1/a, (3)

where J and a are the mean and SD of S under Puniformthe uniform distribution over the possible inheritancevectors. (These quantities can be calculated by enumera-tion over all vectors.) Under the null hypothesis of nolinkage (i.e., Puniform), the normalized score Z has mean0 and variance 1.To combine scores among m pedigrees, one can take

a linear combination

m

Z= xYiZi,i=l

(4)

where m is the number of pedigrees, Zi denotes the nor-malized score for the ith pedigree, and the y, areweighting factors. The weighting factors should be cho-sen so that i = 1, so that Z has mean 0 and variance1 under the null hypothesis of no linkage. We will useYj = i1Fm in the applications below; this choice appearsto provide a good compromise between small and largepedigrees. It may be possible to increase power by select-ing y, according to the nature of the pedigrees, but wewill not explore this issue here, other than to note thatthe optimal choice will likely depend on the (usuallyunknown) genetic architecture of particular diseases.We will refer to Z as the NPL score for the collection

of pedigrees. In some cases, we will speak of NPLpairsand NPLall scores, to indicate the scoring function underconsideration.

Statistical SignificanceSuppose that analysis of pedigrees yields an NPL sta-

tistic of Zobs. What is the significance level of this obser-vation? There are two simple approaches:

1. Exact distribution. It is straightforward to computethe exact probability distribution of the overall scoreZ under the null hypothesis of no linkage. Specifi-cally, one can calculate the distribution for each pedi-gree by enumerating all possible inheritance vectors;the distribution for the collection of pedigrees is thenobtained by convolving these distributions. One canthen simply look up the exact value, P(Z ' Zobs).

2. Normal approximation. Under the null hypothesisof no linkage, the score Z will tend toward a standardnormal variable as one studies many similar pedi-grees. (This follows from the central limit theorem,since Z is an appropriately normalized sum of inde-

1354

I


pendent random variables.) The significance level ofan observation Zobs can then be approximated byconsulting a table of tail probabilities for the stan-dard normal. Although less precise than the exactdistribution, the normal approximation is useful insome settings.

Imperfect DataWe have so far considered the situation in which the

inheritance vector is known with certainty. In fact, it isstraightforward to extend Z to the general case, by tak-ing its expected value over the inheritance distribution,as in equation (2): Z(x,F) = 4wev Z(wA) Prob[v(x)= w], where the probability distribution over inheri-tance vectors here refers to the joint distribution overall pedigrees. To be precise, for a single pedigree wereplace S(v) by SKin equation (3); the normalized scoresfor individual pedigrees are then combined into an over-all score as in equation (4).The only complication is in evaluating the statistical

significance of Z. Because Z is the expectation over theobserved inheritance distribution, its statistical proper-ties depend on the distribution of possible inheritancedistributions (given the markers and pedigree structure).This distribution could be explicitly studied by MonteCarlo sampling from all possible realizations of themarker data. However, it is not hard to show that Zhas the following properties under the null hypothesisof no linkage (see appendix D):

1.

2.

mean(Z) = mean(Z) = 0;

variance(Z) - variance(Z) = 1;

3. Z is asymptotically normally distributed as one stud-ies a large number of similar pedigrees.Moreover, Z approaches Z as information content ap-

proaches 100%, under both the null hypothesis of no

linkage and the alternative hypothesis of linkage. Giventhese properties, it seems reasonable to evaluate the sta-tistical significance of an observation Zobs by using thenull distribution of Z expected in the case of completeinformativeness. The significance level is likely to beconservative (in view of 1 and 2; eqq. [5] and [6]) andbecomes increasingly accurate as information contentincreases. We will refer to this approach as the perfect-data approximation.

Simulation studies (see below) show that the perfect-data approximation is indeed conservative but that itsacrifices relatively little power except when informationcontent is very low. Indeed, significance levels appear tobe within twofold of the empirical values obtained fromsimulations. The approximation should thus not hamperinitial detection of interesting regions and should grow

I

II

inl

Figure 6 Pedigree structure used in the power simulations. Dis-ease inheritance was simulated for four models: dominant, recessive,and two intermediate models (for details on the models, see table 1).Pedigrees were selected for analysis if they had at least three affectedsin generation III, including at least one affected individual in eachsibship. Individuals in generation I were assumed to be unavailablefor genotyping. Genotypic data were simulated for 11 markers spacedevery 10 cM on a 100-cM map; all markers had five equally frequentalleles (heterogeneity .8). The disease locus was assumed to lie in themiddle of the map, exactly at marker 6. A total of 100 pedigrees weresimulated for each model, and 100 sets of 5 or 10 pedigrees each(depending on the model; see table 1) were resampled from this initialset for power calculations.

increasingly accurate as one genotypes additional mark-ers in these regions.

ImplementationWe have implemented the calculation of both NPLpairs

and NPLall scores within GENEHUNTER. Given theinheritance distribution at each point in the genome, thecalculation of NPL scores is rapid. The program reportsthe normalized score Zi for each pedigree, the overallstatistic Z, and the significance levels for the perfect-data approximation based on the exact approach.Although we have focused only on Spairs and Sal, it is

straightforward to substitute other scoring functions toinclude further information about sharing among af-fected individuals or even about nonsharing betweenaffected individuals and unaffected individuals. Suchscoring functions can be easily incorporated into GENE-HUNTER.

Evaluation of NPL Analysis

Power ComparisonsWe compared the performance of the various linkage

methods on simulated data, assuming dominant, reces-sive, and two intermediate models and a 10-cM geneticmap with markers having heterozygosity of 80%. Thepedigree structure used in the simulations is shown infigure 6 (for details, see the legend to fig. 6). The pedi-grees were analyzed by using complete multipoint para-metric linkage analysis (under the model used to gener-ate the data), complete multipoint NPL analysis (usingboth the Spairs and Saii scoring functions), and APM anal-ysis (Weeks and Lange 1988, 1992). The performance

1355

Am. J. Hum. Genet. 58:1347-1363, 1996

Table 1

Power Comparisons Based on Simulations

P == =

.05 .01 .001 .0001 .00001 .05 .01 .001 .0001 .00001

Power to Detect Linkage with N Pedigrees Expected No. of Pedigrees Required for(%) Detection of Linkage

Dominant: Dominant:NPLai1 100 99 95 92 81 NPLaii 1 1 2 3 4NPLpairs 99 95 67 33 10 NPLpairs 2 3 5 7 9APM 62 42 15 5 2 APM 5 8 15 22 29LOD 100 100 100 100 100 LOD 1 1 2 2 3

Recessive: Recessive:NPLall 100 99 96 79 66 NPLaii 1 2 3 3 4NPLpairs 100 100 97 82 62 NPLpairs 1 2 3 4 5APM 87 59 34 19 10 APM 2 4 6 9 11LOD 100 100 100 100 100 LOD 1 1 2 3 3

Complex 1: Complex 1:NPLail 64 40 17 7 4 NPLaIj 10 20 34 50 65NPLpairs 58 25 7 1 0 NPLpairs 16 31 55 79 102APM 40 23 10 2 1 APM 21 42 74 107 141LOD 77 57 27 4 1 LOD 6 11 19 28 36

Complex 2: Complex 2:NPLaii 100 99 98 92 79 NPLaii 2 3 4 6 8NPLpairs 100 99 92 71 49 NPLpairs 2 4 6 8 11APM 83 68 41 14 5 APM 4 8 14 20 26LOD 99 99 97 95 87 LOD 1 2 4 5 7

NoTE.-N = S was used for the dominant and recessive models, and N = 10 was used for the two complex models. Power was defined asthe number of data sets (from 100) in which the appropriate threshold was exceeded. Model parameters were as follows (fd = disease genefrequency; P++, P+d, and Pdd = penetrances of + +, +d, and dd genotypes, respectively): for the dominant model, fd = .01, p++= .001, P+d = .999, and Pdd = .999; for the recessive model, fd = .05, pI+ = .001, P+d = .001, and pdd = .999; for the complex model 1, fd =.05, P++ = .05, P+d = .4, and Pdd = .6; and, for the complex model 2, fd = .01, p++ = .01, P+d = .45, and Pdd = .75. These parameters correspondto disease incidence of 2.1%, 0.35%, 8.5%, and 1.9%, respectively, and to phenocopy rates of 5%, 29%, 53%, and 52%, respectively. Thethresholds used for asymptotic significance levels of .05, .01, .001, and .0001, and .00001 were .59, 1.17, 2.07, 3.00, and 3.95, respectively, forthe LOD score (LOD) and 1.65, 2.33, 3.09, 3.72, and 4.27, respectively, for the normal scores (NPL and APM). LOD scores were computed byusing GENEHINTER, in order to carry out multipoint analysis with 11 markers in reasonable time. Multipoint NPL statistics and multipointLOD scores were computed by using all 11 markers simultaneously. As noted in the text, multilocus APM does not compute a multipoint statisticas a function of location. Instead, a statistic testing linkage to a region is computed. Recombination between loci is not fully taken into account,with the result that the statistic can decrease as additional flanking markers are considered, even in the presence of linkage. Therefore, we computedsingle-locus APM statistics by using the marker at the true locus, as well as multilocus APM statistics including 1, 2, 3, 4, and 5 closest flankingmarkers on each side, and we chose the highest statistic for each replicate when estimating power.

of the parametric LOD-score method under the correctmodel was used as a benchmark, although it should benoted that the correct model is usually unknown andthat model misspecification can lead to considerable lossof power.We used two criteria to assess performance: (1) the

power to detect a locus in a fixed sample of pedigreesand (2) the expected number of pedigrees required todetect a locus. Both measures were completed for vari-ous nominal significance levels (p = .05, .01, .001,.0001, and .00001). The results are summarized in table1. Three main conclusions emerge.

First, the NPLai statistic performed better than theNPLpairs statistic in all cases studied (except for the reces-

sive case, where the two showed comparable perfor-mance). This accords with the intuition that testingwhether the same allele is found IBD in many affectedrelatives is a more powerful strategy than consideringone relative pair at a time. The NPLa,, statistic thus ap-pears to have the desirable property of robustness, andit was used in all other comparisons.

Second, the NPL statistic was much more powerfulthan the APM statistic, for all models examined. Onaverage, at a given significance level, NPL required twoto seven times fewer pedigrees for detecting linkage. Thegreater power of NPL is explained by its efficient use ofall available information from simultaneous consider-ation of both all relatives and all markers.

13S6


Table 2

Empirical False-Positive Rates Observed in 50,000 Simulations

FALSE-POSITIVE RATE AT P =

.05 .01 .001 .0001 .00001

Sall, exact .03 .004 .0003 .00002 0Sall, normal .04 .008 .001 .0002 .00006Spairs, exact .03 .005 .0004 0 0Spairs, normal .04 .008 .001 .0001 0

NoTE.-Genotypes for 50,000 data sets consisting of seven smalltwo- and three-generation pedigrees were simulated, under the as-sumption that there is no linked disease-causing locus. "exact" refersto p values obtained from the exact distribution of scores; and "nor-mal" refers to p values obtained from the normal approximation. Theperfect-data approximation is used in both cases.

Third, the performance of NPL was roughly compara-ble to that of the LOD-score analysis under the correctmodel. NPLaI thus appears to provide a nonparametricpedigree-analysis method that loses relatively little powerwhen compared with the best parametric method. Thisfeature is particularly significant because the NPL methodrequires neither consideration of multiple models of inher-itance (thus avoiding corrections for multiple testing) noradvance knowledge of the correct model of inheritance(thus avoiding problems of misspecification).

In addition to the power comparisons, we also exam-ined the false-positive rate of NPL via simulation (table2). The theoretical significance levels based on the perfect-data approximation provide a somewhat conservative test(with empirical false-positive rates roughly half those ex-pected from theory), whereas those based on the asymp-totic approximation of normality are closer to (and occa-sionally exceed) the empirical values. In summary, theprocedures for evaluating statistical significance that areoutlined above appear to be reasonable.

Application to Idiopathic Generalized EpilepsyTo compare NPL and APM on real data, we reana-

lyzed pedigrees with idiopathic generalized epilepsy(IGE) that recently were reported by Zara et al. (1995).IGE is a neurological disorder of unknown etiologycharacterized by recurring seizures. The pedigrees areshown in figure 7. Zara et al. used APM to obtain evi-dence for linkage of IGE to chromosome 8q24. Single-locus APM gave the strongest evidence for linkage atD8S256, with an APM statistic of 3.44, when allelefrequencies taken from GDB were used, and 2.90, whenallele frequencies estimated from the study sample wereused. (We quote only the APM scores obtained with the1/V weighting function, which gave the strongest re-sults). These APM scores correspond, respectively, totheoretical p values of .0003 and .002 when the statistic

is assumed to follow a normal distribution and to empir-ical p values of .002 and .006 on the basis of simulations(Zara et al. 1995). Multilocus APM using D8S284,D8S256, and D8S534 gave somewhat weaker evidencefor linkage. The statistics were 2.647 (GDB allele fre-quencies) and 1.478 (sample allele frequencies), corre-sponding to theoretical p values of .004 and .018, re-spectively, and to empirical p values of .008 and .07,respectively. Zara et al. considered these results as sug-gestive of the presence of an IGE susceptibility locus on8q24 and stressed the need for confirmation in addi-tional family sets.We reanalyzed these data, using the NPL statistic

Spairs, which is the appropriate IBD generalization of theIBS APM statistic. A single-marker analysis yielded ascore of 2.26 at D8S256 (p = .02). A completemultipoint analysis involving all three markers yieldeda lower score, 1.79 (p = .063). The results were almostidentical for both choices of allele frequencies.

Interestingly, NPL detects less evidence for linkagethan does APM. Why? It turns out that the APM analysisgives weight to several instances of allele sharing thatare IBS but not IBD. For example, it is clear that D8S256is completely uninformative for linkage in family 13,since both parents are homozygous for the "10" allele(fig. 7). NPL assigns a score of 0 to this pedigree, sincethe allele sharing among affected individuals does notreflect IBD. In contrast, APM gives substantial weightto the observation of allele sharing at this locus. Indeed,the APM score for this pedigree is 1.08. Similarly, af-fected individuals 4 and 5 in family 8 share the "9"allele at D8S256 and thus contribute to the APM score.However, consideration of haplotypes clearly showsthat this allele is not shared IBD. NPL analysis correctlydoes not detect this as sharing.

This example illustrates key advantages of NPL. Be-cause NPL assesses IBD sharing on the basis of informa-tion from multiple markers and all genotyped relatives,it is less likely to be misled by chance sharing of allelesand is less sensitive to specification of allele frequencies.In contrast, APM has been reported to be prone to falsepositives, particularly when the single-locus method isused and correct allele frequencies are not known(Babron et al. 1993; Weeks and Harby 1995).

Application to SchizophreniaTo further evaluate the performance of NPL, we ap-

plied it to the data reported by Straub et al. (1995) onthe 265 pedigrees in the Irish Study of High DensitySchizophrenia Families (ISHDSF). This study used bothparametric and nonparametric methods to map a poten-tial schizophrenia-vulnerability locus to chromosome6p24-22 and provided evidence for genetic heterogene-ity. A total of 16 markers spanning 38.4 cM on 6p wereexamined.

1357

Am. J. Hum. Genet. 58:1347-1363, 1996

Pedigree 3 Pedge 5 Pedigree 7

1 28 4 27 9 5 94 7 6 2

3 4 7432 448 48 4 2 478 4a89 9 9 5 9 5 9 9 9 5 9 57 2 7 6 7 6 7 2 7 6 7 6

8 4 8 4 8 65 1 S a 6 56 1 6 1 2 6

Pedigree 13

12a 6 8 310 10 10 106 6 4 2

3 4a a 8 310 10 10 106 4 6 2

Pedigree 15 Pedigree 17 Pedigree 19

1 26 3 2 45 2 6 106 2 9 7

3 42 6 2 66 5 6 29 6 9 2

1 26 8

4 42 2

3 4

736 5 310 4 3 107 2 3 6

563 43 2

1 23 6 6 6

6 3 6 310 10 10 105 4 8 8

Pedigree 8

E--\1 2

7 69 91 2

4 3 5 68 7 66 8 6 3 62 9 6 9 2 9 9 97 1 6 2 7 2 3 5

7 8 96 7 6 7 3 69 9 9 9 9 92 1 2 1 3 2

Pedigree 30

1 24 34 106 8

S

8 4 2 8 4 3

2 4 105 2 49 6 6 5 6 6

6 7

2 4 8 8104 5 266 5 9

Figure 7 Ten pedigrees used in the IGE study, redrawn from the report by Zara et al. (1995). Blackened symbols indicate individualsconsidered as affected in the analysis. Genotypes for D8S284, D8S256, and D8S534 (from top to bottom) are shown under each individual'ssymbol.

In their parametric analysis, Straub et al. computedtwo-point LOD scores under the assumption of hetero-geneity, using four genetic models each with four diag-nostic categories. They obtained the strongest evidencefor linkage at marker D6S296, under the "Pen" modeland the broad diagnostic category (D1-D8), with a two-point LOD score of 3.51 (p = .0002) at 0 = .004 andproportion of linked pedigrees a = .40. To extract allavailable linkage information, we extended the paramet-ric study from single-marker analyses to a complete 16-marker multipoint analysis under the same model anddiagnostic category (fig. 8A). The LOD curve peaked atD6S470 (2.6 cM proximal of D6S296), with a LODscore of 2.96 and a = .26. The multipoint results are

likely to be more accurate because they do not rely on

properties of a single marker. Indeed, the estimate ofthe heterogeneity parameter a agrees more closely withthe estimate of 15%-30% that Straub et al. (1995) ob-tained by other means. Multipoint analysis also allowedus to compute the two-LOD support region, which ex-

tended over a 24-cM interval from D6S477 to D6S422.Straub et al. also performed nonparametric single-

marker analysis with ESPA (extended-sib-pair analysis[Sandkuijl 1989]), but they obtained much weaker evi-dence of linkage. Under the broad diagnostic category,they found values of p = .17 at D6S296 and p = .03 atD6S285, which is 16 cM proximal of D6S296. (Under thenarrower D1-DS diagnostic category, a p value of .005

was found at D6S285.) To compare the power of NPLanalysis, we performed a complete 16-marker analysiswith the NPLaII statistic. Under the broad diagnostic cate-

gory, we obtained a maximum NPL score of 3.25 (p =

.0005) at D6S470, in the same position as that of themultipoint LOD-score peak (fig. 8B). Secondary peaks of-2.9 were obtained at D6S260 and D6S422.Our results support the findings by Straub et al.

(1995) and illustrate the advantages of the new analysismethod. NPL analysis provided the same degree of evi-dence for linkage as did the parametric LOD-scoremethod, without the need to examine multiple modelsof inheritance. (Of course, the choice of diagnostic cate-gories remains important.) NPL provided strong evi-dence for linkage, whereas the other nonparametricmethod, ESPA, showed, at best, weak evidence. We at-

1358

Pedigree 12


A B C

4-

3-

0

0~

z2-

1-

Map position (cM)

1-

0.8-4,

r.0To~~~~~~~~~~~~~~~~~~~~~'

wo0.4-

0.2-

10 20 30

Map position (cM)

* *. .0* .000

0

10 20 30

Map position (cM)

Figure 8 Analysis of pedigree data from the Irish schizophrenia study by Straub et al. (1995). Map position is in Kosambi centimorgams,from D6S477. A, Multipoint LOD scores computed under the Pen model, against a fixed map of 16 markers on chromosome 6p (LOD scores

were computed at all markers and at four points within each interval between markers). B, Multipoint NPL scores (Sai, statistic). C, Information-content mapping. The solid line shows the multipoint information content across the map when all 16 markers are used simultaneously; andblack dots show the information content of individual markers.

tribute the superior performance of NPL to its efficientuse of information from multiple markers and from allfamily members. To examine this point, we calculatedthe information content for single-marker analysis ver-

sus that for the complete multipoint analysis (fig. 8C).Whereas the information content of individual markersranged from -50% to 70%, the information contentfor the map of markers was -80% across the entireregion, indicating a substantial gain in informativenessfor the multipoint approach. In summary, the variousanalyses indicate that NPL provides a useful, powerful,and robust method for demonstrating linkage to a dis-ease with an uncertain mode of inheritance in a heteroge-neous data set.

Haplotype Determination

Last, we turn to the problem of inference of haplo-types-that is, the determination of the particularfounder alleles carried on each chromosome. It is oftenuseful to infer haplotypes in order to identify doublecrossovers that may reveal erroneous data, to visualizesingle crossovers that may help confine a gene hunt,and to seek evidence of an ancestral chromosome inan isolated population. Some systematic methods forhaplotype reconstruction that have been suggested pre-

viously have been based on rule-based heuristics, ap-

proximate-likelihood calculations, or exact-likelihoodcalculations (Weeks et al. 1995). The first two methodsare ad hoc and sometimes fail to produce the best haplo-type reconstruction, particularly in the presence of miss-ing data. The third method has hitherto been limited tosmall numbers of markers and pedigrees with few miss-

ing data, owing to computational problems (Weeks etal. 1995).

Inheritance vectors provide a general framework forhaplotype reconstruction. Since the inheritance vectorscompletely determine the haplotype, the problem re-

duces to choosing the "optimal" inheritance vector atthe loci to be haplotyped. In the HMM literature, thisis a well-studied question known as the "hidden-statereconstruction problem." There are two standard solu-tions, based on somewhat different optimality criteria(Rabiner 1989):

1. The first approach is to treat each locus separatelyand select the most likely inheritance vector at eachlocus (i.e., such that Pcomplete(w) is largest). Thismethod is clearly trivial to implement, given the in-heritance distribution.

2. The second approach is to treat the loci together andselect the most likely set of vectors at the loci (i.e.,the vectors having the largest joint probability whenconsidered as a sample path of the underlying Mar-kov chain). This can be accomplished by using theViterbi algorithm (Rabiner 1989).

The first method has the advantages that it is simpleand easily reveals regions of uncertainty (in which dis-tinct vectors have similar probabilities). The secondmethod has theoretical appeal because it finds the glob-ally most likely inheritance pattern. In practice, we findthat both approaches tend to yield similar performanceand results.We have implemented both methods for haplotype

reconstruction within GENEHUNTER. The program

0

a0go0-j

us 9 rl

1359

0.61

A-

Am. J. Hum. Genet. 58:1347-1363, 1996

259 4 1 45 4 2 S4 4 4 337158362

I

349

5

4

3

7

S

8362

95

4

3715

83

3

2

42269

1S

6

9S

4173473

6 162 6

-Q279152441377314577336 16 6

31112 5x44212763945713566I6

621710

55

249455

x4 36627137810 71155

54

404955

34637736

x7 5

10 7125756

22

69

5

15

6

6738715

4

289 95 84 43 37 76 1

5 47 112 27 5

6 6

414 95 83 46 37 73 18 47 111 25 5

4 6

254142446222167910 5

11X6 5

26

I

95

4371S

8362

24

22

69S

15

6

Figure 9 Example of haplotype reconstruction. Eleven markerswere spaced every 2 cM across a 20-cM region. Individuals in genera-

tion 1 were not genotyped. Crossovers are denoted by x's.

reports the most likely haplotype, marking crossovers

and highlighting double crossovers. It also indicates re-

gions of haplotype uncertainty. The methods work forany number of markers and find the most likely haplo-type even in pedigrees with missing data. An example isshown in figure 9.

Discussion

Collecting and genotyping data sets of a size sufficientfor mapping complex disease genes is a formidable task,and it is desirable to have powerful analysis tools thatefficiently make use of all available data. This requires(a) multipoint methods that extract inheritance informa-tion from all markers and (b) robust, nonparametriclinkage methods that take account of all pedigree mem-bers. Currently available methods fall short in both re-

gards.In this paper, we describe algorithms for extracting

all available inheritance information about segregationat every point in the genome, based on genotypes atany number of markers considered simultaneously. Suchmultipoint analysis is important for several reasons: (i)

Even with relatively polymorphic microsatellite loci,multipoint analysis of many markers is required to inferIBD across several generations in pedigrees with sub-stantial missing data; it can thus substantially increasethe power to detect linkage. (ii) Multipoint analysis ismore robust to misspecification of allele frequencies andstatistical fluctuations at individual markers and can

provide a confidence interval for the location of the gene.

(iii) We expect that human genetic studies will soon

employ a third-generation genetic map consisting of bi-allelic markers, because such markers are potentiallyamenable to high-throughput automation; their lowerdegree of informativeness can be offset through the use

of a somewhat denser map, but this will depend cruciallyon the ability to perform extensive multipoint analysis.The available inheritance information is captured in

the multipoint inheritance distribution. This distributionprovides a natural definition of information-contentmapping, which measures the extent to which all inheri-tance information has been extracted. In addition, theinheritance distribution allows reliable reconstruction ofmany-marker haplotypes, even in pedigrees with missingdata.The inheritance distribution provides a unified frame-

work for both parametric and nonparametric analysis.We have shown how to apply it to perform multipointparametric LOD-score calculations and to define a

multipoint nonparametric method, NPL. The frame-work also makes it straightforward to incorporate otherlinkage statistics.We have studied the performance of NPL in applica-

tions to both simulated and actual data. NPL appears

to have many advantages over the commonly used APMmethod, including much greater power to detect linkageand less sensitivity to misspecification of allele frequen-cies. In fact, in our comparisons, NPL was nearly as

powerful as LOD-score analysis under the correct para-

metric model-but without the need to know and spec-

ify the model in advance. Because it appears to be robustto uncertainty about mode of inheritance and to loselittle power compared with parametric methods, NPLwould seem to be the method of choice for linkage analy-sis of pedigree data for complex traits. Of the two NPLmethods described, we favor the NPLaii statistic and rec-

ommend using the perfect-data approximation to thesignificance level, for both the exact and the normaldistributions. (The exact distribution provides a more

accurate estimate of significance.)The methods described in this paper are computation-

ally feasible for pedigrees of moderate size (2n - fS 16, with current workstations), although not for largemultigenerational pedigrees of the sort used to map sim-ple dominant disorders such as Huntington disease. For-tunately, moderate-sized pedigrees are precisely the kindof pedigrees being used in most complex-disease studies.

1360


Such pedigrees are easier to collect for diseases charac-terized by late onset, low penetrance, and diagnosticuncertainty. They are also more likely to reflect the ge-netic etiology of the disease in the general populationand are less likely to show intrafamilial genetic heteroge-neity.The methods described here have all been incorpo-

rated into a new interactive computer package, GENE-HUNTER. The computer program is written in C andis freely available from the authors, by anonymous ftp(at ftp-genome.wi.mit.edu, in the directory distribution/software/genehunter) or from our World Wide Web site(http: / / www-genome.wi.mit.edu / ftp/distribution /software/genehunter). We hope that GENEHUNTERwill ease the task of efficiently analyzing pedigree datafrom genetic studies of complex traits.

AcknowledgmentsL.K. would like to thank David Haussler for useful discus-

sions and the Aspen Center for Physics for its hospitality dur-ing the 1995 workshop on biological sequences. We thankRichard Straub, Kenneth Kendler, and colleagues for sharingdata from ISHDSF; Massimo Pandolfo and colleagues for shar-ing their IGE data; and Melanie Mahtani, Elizabeth Widen,Mark McCarthy, Leif Groop, and other members of the Botniadiabetes project for helpful discussions and for sharing unpub-lished data. This work was supported in part by NationalCenter for Human Genome Research grants HGO0017 (toL.K.) and HG00098 (to E.S.L.).

Appendix AComputing the Single-Locus Inheritance Distributionat Codominant Loci

We describe an algorithm for computing Pmarker, the in-heritance distribution at a codominant marker locus,conditional only on the data for that locus. We beginby noting that it is sufficient to calculate P(Omarker v),the probability of observing the marker data for eachinheritance vector v. One can then apply Bayes's theo-rem, together with the fact that all inheritance vectorsare equally likely a priori, to calculate the probabilitydistribution over inheritance vectors.

Let X = {x1 ,x2, .. . ,X2f} be symbols corresponding tothe 2f founder alleles at the marker locus, which areassumed to be distinct by descent. An inheritance vectorv specifies the precise founder alleles inherited by eachindividual in the pedigree; let xi1(v) and xi2(v) denote thealleles carried by the ith individual. Let A = {a1, . .. ,ak}denote the observable allelic states; note that distinctfounder alleles may have the same state. Let ai1 and ai2be the two observed alleles carried by the ith individual.An assignment of the founder alleles is a mapping

function, f: X -- A. For any inheritance vector v, an

assignment f is said to be v-compatible with the observedmarker data if {f[x,1(v)],f[xi2(v)]) = ja,1 ,aiJ2 for all indi-viduals who have been genotyped. (In other words, theassignment of founder alleles specified by f and the trans-mission specified by v are compatible with the observedgenotype data.) Let p(ai) denote the population fre-quency of alleles having state ai. The probability of theassignment f is 17i p[f(xi)], which is the chance that thefounder alleles will happen to have the states specifiedin the assignment.

For a given inheritance vector v, the quantityP(oFmarker v) is equal to the sum of the probabilities ofall v-compatible assignments. It thus suffices to find allv-compatible assignments. This can be done through asimple graph-theoretic process. Given v, define a graphG(v) whose vertices are the founder alleles{x1,x2, ... ,X2f} and whose edges are ei =[Xil(V),Xi2(V)bwhere i runs over all genotyped individuals. (Pairs ofvertices in G(v) can be connected by multiple edges.)Label edge ei with the corresponding genotype fai1 ,ai2.The v-compatible assignments are those such that thelabel on each edge is consistent with the assignment ofthe two vertices of that edge. Choose an arbitrary start-ing vertex y. If y has no edges, then the correspondingfounder allele does not appear in any genotyped individ-ual, and it may be assigned to any ai. If y has edges, thenits assignment necessarily must lie in the intersection ofthe labels on all edges from y; there are thus, at most,two choices for y. Given the assignment of y, the assign-ment of each neighboring vertex z is uniquely deter-mined (since the pair of assignments of y and z mustcorrespond to the label on any edge connecting them).Similarly, assignments are uniquely determined forneighbors of neighbors of y, and so on. In other words,the assignment of y automatically forces the assignmentof all other vertices in the same connected component.(If this process leads to no assignment conflicts, it pro-duces the unique v-compatible assignment for the com-ponent, given the assignment of y. If it produces a con-flict, there is no v-compatible assignment, given theassignment of y.) Each connected component can betreated separately.

For any given inheritance vector v, the running timeis easily seen to be O(n) to find all v-compatible assign-ments of graph G(v) and thus to compute P(oFmarker v).The overall running time is thus O(n22n-f) to computeP(oFmarker v) for all v and to apply Bayes's theorem tocalculate Pmarker-

Appendix B

The REDUCE Algorithm for Fast HMM Computation:Second SpeedupAs in the earlier description of the algorithm (Kruglyaket al. 1995), we identify the set of all n-bit binary vectors

1361

Am. J. Hum. Genet. 58:1347-1363, 1996

with (Z2)V, the additive vector space over the field withtwo elements (i.e., vector addition is component-wisemodulo 2). Switching the phase of the ith founder corre-sponds to addition of a vector s, in which the bits repre-senting that founder's meioses equal 1 and all other bitsare 0. Let S denote the subspace spanned by the vectorssl,... ,sf. Equivalence classes of vectors that differ onlyby founder phase are precisely the cosets of S; each cosetcontains 2fvectors. Because a pedigree contains no infor-mation about founder phase, vectors that differ only byfounder phase have equal probability; that is, probabil-ity is constant on equivalence classes (cosets). The rowsof the matrices Wk are also constant on cosets:Wa,= W~p , where a and ,3 are two vectors in the same coset.We therefore can interpret probability vectors and Wkmatrices as indexed by cosets rather than by vectors andcan perform the matrix-reduction algorithm as before,with cosets replacing vectors. This reduces the complex-ity of the problem by a factor of 2f (i.e., from 22n to22-f ), resulting in comparable savings in both time andmemory.

Appendix CComputing the Single-Locus Inheritance Distributionat Disease Loci

We describe an algorithm for computing Pdisease the in-heritance distribution at a disease-causing locus, condi-tional only on the phenotype data (disease for the disease.The disease will be assumed to have an arbitrary butspecified mode of inheritance and two alleles, normaland disease, of specified frequency. As in appendix A,we note that it is sufficient to calculate P(Ddisease v), theprobability of observing the phenotypic data for eachinheritance vector v.We need to compute P(Fdisease v) = 1GP({gi} v)

x HliP(4i Igi), where P({gi) v) is the joint probability ofthe genotypes {gi) of all pedigree members, conditionalon the inheritance vector v, and where P(DI, gi) is theprobability that the ith pedigree member has phenotype(i, conditional on having genotype gi (this probabilityis determined by the penetrance function for pedigreemember i). Conditioning on the inheritance vectormeans that {gi) is completely determined by the foundergenotypes. The sum over (gil can be computed in threeways:

1. Direct summation over founder genotypes. With ffounders, this requires computing 0(4 ) terms,which grows exponentially with the number offounders.

2. Peeling in pedigrees without loops. In peeling (Elstonand Stewart 1971; Lange and Elston 1975; Whitte-more and Halpern 1994b), one identifies peripheral

nuclear families that are connected to the rest of thepedigree by a single individual, designated the pivot.One then computes the probability of the phenotypedata in the nuclear family, conditional on the geno-type of the pivot, and replaces the penetrance func-tion of the pivot with these probabilities. In tradi-tional peeling, one sums over all possible alleletransmissions by the parents in the nuclear families.Conditioning on an inheritance vector specifies alltransmissions; thus, only one term needs to be com-puted for each pair of parental genotypes. Peelingruns in time to O(NF), where NF is the number ofnuclear families in the pedigree.

3. Loop breaking in pedigrees with loops. If loops re-main after all peripheral families have been peeledoff, one can either (a) sum over the genotypes of allremaining founders, as in procedure 1, or (b) breakloops, by creating loop breakers or "twins" and con-ditioning on their genotypes (Lange and Elston 1975;Whittemore and Halpern 1994b), and then peel offadditional founders, as in procedure 2. Each founderwho must be summed over in direct enumeration andeach "twin" contribute a factor of 4 to the numberof terms that must be computed; that is, the algo-rithm runs in time O(NF + 4r+t) where NF is thenumber of nuclear families that can be peeled off, ris the number of founders summed over at the end,and t is the number of loop breakers. We minimizethe running time by choosing to break or not breakloops so that r + t is as small as possible.

In pedigrees without loops, peeling runs in constanttime for each inheritance vector and is very rapid. Al-though computing time increases for more-complex ped-igrees, we consider only pedigrees of moderate size,which cannot contain both many loops and manyfounders. Therefore, in practice, we can rapidly computeP(Ddisease v) -and hence LOD scores -for any pedigreefor which the REDUCE algorithm is computationallyfeasible.

Appendix D

Distribution of Z

We now prove equations (5) and (6), concerning thedistribution ofZ under the null hypothesis of no linkage.Let the operator EG denote expectation over all possiblerealizations of the observed marker genotype data, G.Note that, for every inheritance vector w, EG[P(W G)]= Puniform(W) under the null hypothesis.To prove equation (5), we note that Z = YwEvP(w G)

Z(w) for observed data G. Applying the operator EG,we have

1362

Kruglyak et al.: Parametric and Nonparametric Linkage Analysis 1363

mean (Z) = EG(Z)

= EG LE P(WIG) Z(w)1

- E EG[P(W G)Z(w)]wEV

E P.niform(W)Z(W)wEV

- mean(Z).

To prove equation (6), we note that

2

E: P(wl G)Z(w) S E P(wl G)[Z(W)]2 .

The inequality follows from a straightforward applica-tion of Jensen's inequality (Royden 1968), since f(x)= x2 is a convex function. When the operator EG isapplied to both sides, the inequality becomes var(Z)

var(Z).

ReferencesBabron MC, Martinez M, Bonaiti-Pellie C, Clerget-DarpouxF (1993) Linkage detection by the affected-pedigree-membermethod: what is really tested? Genet Epidemiol 10:389-394

Cannings C, Thompson EA, Skolnick MH (1978) Probabilityfunctions on complex pedigrees. Adv Appl Prob 10:26-61

Clerget-Darpoux F, Bonaiti-Pellie C, Hochez J (1986) Effectsof misspecifying genetic parameters in lod score analysis.Biometrics 42:393-399

Cottingham RW Jr, Idury RM, Schaffer AA (1993) Fastersequential genetic linkage computations. Am J Hum Genet53:252-263

Curtis D, Sham PC (1994) Using risk calculation to implementan extended relative pair analysis. Ann Hum Genet 58:151 -162

(1995) Model-free linkage analysis using likelihoods.Am J Hum Genet 57:703-716

Elston RC, Stewart J (1971) A general model for the geneticanalysis of pedigree data. Hum Hered 21:523-542

Guo S-W (1995) Proportion of genome shared identical bydescent by relatives: concept, computation, and applica-tions. Am J Hum Genet 56:1468-1476

Kruglyak L, Daly MJ, Lander ES (1995) Rapid multipointlinkage analysis of recessive traits in nuclear families, includ-ing homozygosity mapping. Am J Hum Genet 56:519-527

Kruglyak L, Lander ES (1995) Complete multipoint sib-pairanalysis of qualitative and quantitative traits. Am J HumGenet 57:439-454

Lander ES, Green P (1987) Construction of multilocus geneticmaps in humans. Proc Natl Acad Sci USA 84:2363-2367

Lange K, Elston RC (1975) Extensions to pedigree analysis.I. Likelihood calculations for simple and complex pedigrees.Hum Hered 25:95-105

Lathrop GM, Lalouel JM, Julier C, Ott J (1984) Strategies formultilocus linkage analysis in humans. Proc Natl Acad SciUSA 81:3443-3446

Lathrop GM, Lalouel JM, White RL (1986) Construction ofhuman linkage maps: likelihood calculations for multipointanalysis. Genet Epidemiol 3:39-52

Morton NE (1955) Sequential tests for the detection of link-age. Am J Hum Genet 7:277-318

O'Connell JR, Weeks DE (1995) The VITESSE algorithm forrapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance. Nat Genet 11:402-408

Ott J (1991) Analysis of human genetic linkage, rev. ed. JohnsHopkins University Press, Baltimore and London

Rabiner LR (1989) A tutorial on hidden Markov models andselected applications in speech recognition. Proc IEEE 77:257-286

Royden HL (1968) Real analysis, 2d ed. Macmillan Publish-ing, New York

Sandkuijl LA (1989) Analysis of affected sib-pairs using infor-mation form extended families. In: Elston RC, Spence MA,Hodge SE, MacCluer JW (eds) Multipoint mapping andlinkage based upon affected pedigree members: GeneticAnalysis Workshop 6. Alan R Liss, New York

Shannon CE (1948) A mathematical theory of communication.Bell Syst Tech J 27:379-423

Sobel E, Lange K (1993) Metropolis sampling in pedigree anal-ysis. Stat Methods Med Res 2:263-282

Straub RE, MacLean CJ, O'Neill FA, Burke J, Murphy B,Duke F, Shinkwin R, et al (1995) A potential vulnerabilitylocus for schizophrenia on chromosome 6p24-22: evidencefor genetic heterogeneity. Nat Genet 11:287-293

Terwilliger JD, Ott J (1994). Handbook of human geneticlinkage. Johns Hopkins University Press, Baltimore

Thomas A, Skolnick MH, Lewis CM (1994) Genomic mis-match scanning in pedigrees. Int Math Assoc J Math ApplMed Biol 11:1-16

Thompson EA (1994) Monte Carlo likelihood in genetic map-ping. Stat Sci 9:355-366

Thompson EA, Wijsman EM (1994) Multilocus homozygositymapping and autozygosity patterns. Am J Hum Genet SupplS5:A167

Weeks DE, Harby LD (1995) The affected-pedigree-membermethod: power to detect linkage. Hum Hered 45:13-24

Weeks DE, Lange K (1988) The affected-pedigree-membermethod of linkage analysis. Am J Hum Genet 42:315-326

(1992) A multilocus extension of the affected-pedigree-member method of linkage analysis. Am J Hum Genet 50:859-868

Weeks DE, Sobel E, O'Connell JR, Lange K (1995) Computerprograms for multilocus haplotyping of general pedigrees.Am J Hum Genet 56:1506-1507

Whittemore AS, Halpern J (1994a) A class of tests for linkageusing affected pedigree members. Biometrics 50:118-127

(1994b) Probability of gene identity by descent: com-putation and applications. Biometrics 50:109-117

Zara F, Bianchi A, Avnanzini G. Di Donato S, Castellotti B,Patel PI, Pandolfo M (1995) Mapping of genes predisposingto idiopathic generalized epilepsy. Hum Mol Genet 4:1201 -1207

hum. genet. parametric and nonparametric linkage analysis: a … · 2018-08-03 · kruglyak et al.:...

Documents