association study for the relationship between a haplotype or haplotype set and multiple...

10
Computational Statistics and Data Analysis 55 (2011) 2104–2113 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses Makoto Tomita a,, Noboru Hashimoto b , Yutaka Tanaka c a Clinical Research Center, Tokyo Medical and Dental University Hospital, Faculty of Medicine, 113-8519, Japan b Nagoya University Graduate School of Medicine, 466-8550, Japan c Department of Information Systems and Mathematical Sciences, Nanzan University, 489-0863, Japan article info Article history: Received 18 February 2010 Received in revised form 4 November 2010 Accepted 4 January 2011 Available online 21 January 2011 Keywords: Multivariate analysis Quantitative responses Haplotype Likelihood ratio test abstract Though there have been several works on the analysis of the association between geno- type and phenotype, little can be found for the association analysis between a haplotype or haplotype sets and multivariate quantitative responses. For example, QTLmarc is available for the analysis of multivariate responses, but it cannot be applied to the case of stochas- tic diplotype configurations and complex genetic models. The present paper proposes a method of association analysis between diplotype configuration and multivariate quan- titative responses assuming the dominant, recessive and additive models. A comparative study is performed between the proposed method and QTLmarc by applying the two meth- ods to numerical examples and small size simulated data sets with actual genotype infor- mation taken from the data set of the Hapmap project and artificial quantitative phenotype data which follow multivariate normal distributions. The results show that the proposed method is superior to QTLmarc in finding the assumed association. © 2011 Elsevier B.V. All rights reserved. 1. Introduction Recently the association has been actively studied between genotype and phenotype in post genomic research. Here ‘genotype’ means not only genotype itself but also haplotypes and diplotype configurations that are estimated from genotypes of the sample, and ‘phenotype’ indicates qualitative or quantitative variables which may be related to some specific diseases. A quantitative phenotype variable, called QTL (quantitative trait locus), includes covariates such as BMI and glucose level. Some algorithms have been proposed so far to analyze the association between the genotype information and the quantitative phenotypic QTL. The algorithm QTLhaplo (Shibata et al., 2004) deals with the association between the genotype and the univariate phenotype, assuming the normality of the conditional distribution of the phenotype given the genotype information. The likelihood is calculated on the basis of frequencies of diplotype configurations (joint probability of the frequencies of the haplotypes that compose the dipolotype) and the density function of a normal distribution. The algorithm QTLmarc (Kamitsuji and Kamatani, 2006) has been proposed for multivariate analysis of multiple quantitative responses, however, it can deal with only the case where each subject’s haplotype is determined uniquely from its genotype. It is doubtful whether it can evaluate the association properly in general cases. Therefore, it is valuable to develop a general method of association analysis for multivariate quantitative responses. In this paper, we extend the algorithm QTLhaplo so that it can deal with the association between the genotype and multiple quantitative variables assuming three types of models, i.e., the dominant, recessive and additive models. Corresponding author. E-mail address: [email protected] (M. Tomita). 0167-9473/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2011.01.002

Upload: makoto-tomita

Post on 26-Jun-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

Computational Statistics and Data Analysis 55 (2011) 2104–2113

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis

journal homepage: www.elsevier.com/locate/csda

Association study for the relationship between a haplotype or haplotypeset and multiple quantitative responsesMakoto Tomita a,∗, Noboru Hashimoto b, Yutaka Tanaka c

a Clinical Research Center, Tokyo Medical and Dental University Hospital, Faculty of Medicine, 113-8519, Japanb Nagoya University Graduate School of Medicine, 466-8550, Japanc Department of Information Systems and Mathematical Sciences, Nanzan University, 489-0863, Japan

a r t i c l e i n f o

Article history:Received 18 February 2010Received in revised form 4 November 2010Accepted 4 January 2011Available online 21 January 2011

Keywords:Multivariate analysisQuantitative responsesHaplotypeLikelihood ratio test

a b s t r a c t

Though there have been several works on the analysis of the association between geno-type and phenotype, little can be found for the association analysis between a haplotype orhaplotype sets and multivariate quantitative responses. For example, QTLmarc is availablefor the analysis of multivariate responses, but it cannot be applied to the case of stochas-tic diplotype configurations and complex genetic models. The present paper proposes amethod of association analysis between diplotype configuration and multivariate quan-titative responses assuming the dominant, recessive and additive models. A comparativestudy is performed between the proposedmethod andQTLmarc by applying the twometh-ods to numerical examples and small size simulated data sets with actual genotype infor-mation taken from the data set of the Hapmap project and artificial quantitative phenotypedata which follow multivariate normal distributions. The results show that the proposedmethod is superior to QTLmarc in finding the assumed association.

© 2011 Elsevier B.V. All rights reserved.

1. Introduction

Recently the association has been actively studied between genotype and phenotype in post genomic research. Here‘genotype’ means not only genotype itself but also haplotypes and diplotype configurations that are estimated fromgenotypes of the sample, and ‘phenotype’ indicates qualitative or quantitative variables which may be related to somespecific diseases. A quantitative phenotype variable, called QTL (quantitative trait locus), includes covariates such as BMIand glucose level.

Some algorithms have been proposed so far to analyze the association between the genotype information and thequantitative phenotypic QTL. The algorithm QTLhaplo (Shibata et al., 2004) deals with the association between the genotypeand the univariate phenotype, assuming the normality of the conditional distribution of the phenotype given the genotypeinformation. The likelihood is calculated on the basis of frequencies of diplotype configurations (joint probability of thefrequencies of the haplotypes that compose the dipolotype) and the density function of a normal distribution. The algorithmQTLmarc (Kamitsuji and Kamatani, 2006) has been proposed for multivariate analysis of multiple quantitative responses,however, it can deal with only the case where each subject’s haplotype is determined uniquely from its genotype. It isdoubtful whether it can evaluate the association properly in general cases. Therefore, it is valuable to develop a generalmethod of association analysis for multivariate quantitative responses. In this paper, we extend the algorithm QTLhaploso that it can deal with the association between the genotype and multiple quantitative variables assuming three types ofmodels, i.e., the dominant, recessive and additive models.

∗ Corresponding author.E-mail address: [email protected] (M. Tomita).

0167-9473/$ – see front matter© 2011 Elsevier B.V. All rights reserved.doi:10.1016/j.csda.2011.01.002

Page 2: Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113 2105

2. Method

2.1. Univariate models

Shibata et al. (2004) describe the algorithm QTLhaplo as follows. Suppose that there exist l linked loci. As DNA is of doublehelix structure and each haplotype has its counterpart, the number of possible haplotypes is L = 2l in total. Let the relativefrequencies of the haplotypes be given as Θ = (θ1, . . . , θj, . . . , θL), where θj ≥ 0 and

∑L

j=1 θj = 1. As each subject has acombination of two haplotypes, there is a possibility that it has L2 possible combinations a1, a2, . . . , aL2 . The probabilitythat the ith subject has a diplotype configuration ak of the lth and the mth haplotypes is given by P(di = ak|Θ) = θlθm,where di is a diplotype configuration for the ith subject. Also suppose that the ith subject has quantitative phenotype ψiwith probability density function f . Now consider that a sample of size N is observed in an experiment. The phenotypefor each diplotype configuration is assumed to follow a normal distribution with a common variance but with a meanwhich varies depending on the diplotype configuration. The outcome of the experiment can then be expressed as (Θ,D,Ψ ),where D = (d1, . . . , dN) indicates the vector of the diplotype configurations and Ψ = (ψ1, . . . , ψN) indicates the matrixof the quantitative phenotypes. The observations for N subjects are classified into the two groups of subjects with andwithout specified haplotype ht in the diplotype configurations, and from these observations the group means µk and thecommon variance σ 2 can be estimated for the distributions of quantitative phenotypes, respectively. The problem is to testwhether there exists any difference in the distribution of the phenotypes between the two groups. For a dominant model,D+ is defined as the set of subjects with diplotype configurations containing haplotype ht , and D− is defined as the set ofthose without ht . Then, the distribution of phenotypes is given by N(µ1, σ

2) for diplotype di ∈ D+ or by N(µ2, σ2) for

di ∈ D−. Denote the probability density functions by fµj(x), j = 1, 2. Thus the probability density function for ψi is definedby fµ1(x) = f (ψi = x|di ∈ D+) in case di ∈ D+ and fµ2(x) = f (ψi = x|di ∈ D−) in case di ∈ D−.

Let A and B denote haplotypes with specified ht and those without ht , respectively. Then every diplotype configurationis expressed as AA, AB, or BB, and sets D+ and D− can be defined as follows. In the case of dominant models, AA and ABbelong to D+, while BB belongs to D−. In the case of recessive models, AA belongs to D+, while AB and BB belong to D−. Foradditive models, the distributions of ψi for AA, BB and AB are given by N(µ1, σ

2), N(µ2, σ2) and N(µ3, σ

2), respectively,where µ3 = (µ1 + µ2)/2.

2.2. Extension to multivariate models

We try to extend the above univariate model to a multivariate model. Suppose that the quantitative phenotypevector Ψi follows a multinormal distribution with a common variance–covariance matrix but with different mean vectorcorresponding to the group defined by the diplotype configurations. In dominant/recessive models, the density function isgiven by

f (Ψi = x|di = ak, µ,Σ) =

1

(2π)p/2|Σ |−

12

e−

12 (x−µ1)

′Σ−1(x−µ1) if ak ∈ D+,1

(2π)p/2|Σ |−

12

e−

12 (x−µ2)

′Σ−1(x−µ2) if ak ∈ D+,

(1)

while in an additive model it is given by

f (Ψi = x|di = ak, µ,Σ) =

1

(2π)p/2|Σ |−

12

e−

12 (x−µ1)

′Σ−1(x−µ1) if ak ∈ DAA,1

(2π)p/2|Σ |−

12

e−

12 (x−µ2)

′Σ−1(x−µ2) if ak ∈ DBB,1

(2π)p/2|Σ |−

12

e−

12 (x−

µ1+µ22 )′Σ−1(x− µ1+µ2

2 ) if ak ∈ DAB,

(2)

where µ, Σ indicate the mean vector and the variance–covariance matrix, respectively, x is the vector of individualquantitative phenotypes, and p is the number of phenotype variables.

2.3. Likelihood function

The observed data consist of the genotype and quantitative phenotype of N subjects. Let Gobs = (g1, g2, . . . , gN)and Ψobs = (w1,w2, . . . ,wN) be the vectors of the observed genotypes and the matrix of the quantitative phenotypes,respectively. Then the likelihood function is given by

Page 3: Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

2106 M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

L(Θ, µ,Σ) ∝

N∏i=1

−ak∈Ai

P(di = ak|Θ)f (ψi = wi|di = ak, µ,Σ),

where Ai denotes the set of diplotype configurations ak, which is consistent with gi, and f is the probability density functionfor N(µ,Σ), where µ depends on ak. Under the null hypothesis it is assumed that the distribution of the phenotype doesnot depend on the haplotype, i.e., the mean vector µ is equal to a common vector µ0. Under the alternative hypothesis,two multinormal distributions, N(µ1,Σ) and N(µ2,Σ), are defined, if the model is dominant or recessive, and threemultinormal distributions,N(µ1,Σ),N(µ2,Σ), andN((µ1+µ2)/2,Σ), are defined, if themodel is additive. The phenotypex of the ith subject follows a distribution given by Eq. (1) or (2), depending on the assumed model.

2.4. Estimation of parameters and likelihood ratio test

If the complete data of d1, d2, . . . , dN and Ψ1,Ψ2, . . . ,ΨN were available, the maximum likelihood estimators for µ, Σand the haplotype frequenciesΘ = (θ1, θ2, . . . , θL) could be obtained as

θj = nj/(2N) (j = 1, 2, . . . , L),

µ1 =

−di∈D+

ψi/N+, µ2 =

−di∈D+

ψi/N−,

Σ =

−di∈D+

(Ψi − µ1) (Ψi − µ1)′+

−di∈D+

(Ψi − µ2) (Ψi − µ2)′

N,

where nj counts how often the jth haplotype appears among the N subjects, and N+ and N− denote the numbers of subjectswho possess or do not possess haplotype ht , respectively. However, actually the complete data are not available and we canobserve only genotypes and phenotypes of the subjects.

For a subject, two or more haplotypes might correspond to one genotype. Therefore, there are two or more candidatesfor the diplotype configuration of the subject, given the haplotype. Several procedures have been developed for estimatinghaplotype frequencies Θ using the EM-algorithm, MCMC , or other algorithms. However, as the purpose of our study isdifferent, we omit here a detailed explanation of those algorithms.

Here we assume that the vector of frequencies of the haplotypes have been estimated using some appropriate software.Their results can be used to estimate the mean vectors and the variance–covariance matrix in our multivariate models.

In the case of dominant and recessive models the log-likelihood function is expressed as l = log L(Θ, µ,Σ) and themaximum likelihood estimators for µ1 and µ2 are obtained by

µ1 =

N∑i=1ψi(ub/u0)

N∑i=1(ub/u0)

and µ2 =

N∑i=1ψi(vb/v0)

N∑i=1(vb/v0)

,

where

ub =

−ak∈D+∩Ai

P(di = ak|Θ)f (ψi|di = ak, µ1, σ ),

u0 =

−ak∈Ai

P(di = ak|Θ)f (ψi|di = ak, µ, σ ),

vb =

−ak∈Ai∩D−

P(di = ak|Θ)f (ψi|di = ak, µ2,Σ),

v0 =

−ak∈Ai

P(di = ak|Θ)f (ψi|di = ak, µ,Σ).

The variance–covariance matrix is estimated as

Σ =1n

N−i=1

(ψi − µ1) (ψi − µ1)T (ub/u0)+

N−i=1

(ψi − µ2) (ψi − µ2)T (vb/v0)

,

where n is∑N

i=1(ub/u0) +∑N

i=1(vb/v0). D+ is a set of the diplotype configurations which contain haplotype ht in sucha way that it is consistent with dominant/recessive models. In the case of additive models, the mean vectors and thevariance–covariance matrix are estimated by solving the following equations.

N−i=1

ub

u0+

14

N−i=1

wb

w0

µ1 +

14

N−i=1

wb

w0µ2 =

N−i=1

ub

u0+

12

u0

ub

xi

Page 4: Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113 2107

14

N−i=1

wb

w0µ1 +

N−i=1

vb

v0+

14

N−i=1

wb

w0

µ2 =

N−i=1

vb

v0+

12

u0

ub

xi

Σ =1n

N−i=1

(ψi − µ1) (ψi − µ1)T (ub/u0)+

N−i=1

(ψi − µ2) (ψi − µ2)T (vb/v0)

+

N−i=1

(ψi − (µ1 + µ2)/2) (ψi − (µ1 + µ2)/2)T (wb/w0)

.

In the above equations, u’s, v’s andw’s are defined by

ub =

−ak∈Ai∩AA

P(di = ak|Θ)f (ψi|di = ak, µ2,Σ),

u0 =

−ak∈Ai

P(di = ak|Θ)f (ψi|di = ak, µ,Σ),

vb =

−ak∈Ai∩BB

P(di = ak|Θ)f (ψi|di = ak, µ2,Σ),

v0 =

−ak∈Ai

P(di = ak|Θ)f (ψi|di = ak, µ,Σ),

wb =

−ak∈Ai∩AB

P(di = ak|Θ)f (ψi|di = ak, µ2,Σ) and

w0 =

−ak∈Ai

P(di = ak|Θ)f (ψi|di = ak, µ,Σ).

These two systems of equations can be solved by using iterative procedures with appropriate initial values. The convergedsolution provides maximum likelihood estimates.

Likelihood ratio tests can be applied to the association analysis between the haplotypes and the phenotypes. Let L0maxand Lmax be the likelihood functions under the null and alternative hypotheses, respectively. It is known that under thenull hypothesis the log-likelihood ratio −2 log(L0max/Lmax) asymptotically follows a chi-squared distribution. The degree offreedom is given by the difference of the numbers of parameters corresponding to both hypotheses.

2.5. QTLmarc algorithm

Kamitsuji and Kamatani (2006) proposed an algorithm ‘‘QTLmarc’’ for estimating haplotypes associated with severalquantitative phenotypes. Their method is a discriminant analysis of the two groups defined by whether or not the genotypeof the sample contains a specified haplotype on the basis of a linear combination of quantitative phenotype values. Thegoodness of discrimination is evaluated with the area under the ROC curve (AUC). As the two groups are defined simplyas above, their method cannot deal with the differences among the dominant, recessive and additive models. In the nextsection, we show two numerical examples, where the comparison of the performance will be made between our methodand the QTLmarc.

3. Numerical study and the results of analyses

We consider the case where there are three loci with two kinds of alleles as genotypes and two quantitative phenotypevariables.

As a genotype input data set, an actual data set for 44 subjects was downloaded from the Hapmap project, and theinformation of the region (107,189 loci) of the X chromosome was used. Among a large number of loci, 40 loci with linkagedisequilibria were selected by Tomita et al. (2008), where they studied this area of the X chromosome from the Hapmapproject. See Table 1 for detailed information (rs numbers, chromosome positions) on loci of data and Table 2 formajor/minorallele frequencies. Note that we do not have any information on haplotypes as the data set does not contain the phaseinformation and there is no missing observation.

Fig. 1 shows the loci and LD(D′) map with chromosome positions. Various linkage disequilibriums can be found in Fig. 1.The LD map has been made using GUI ‘‘Haploview’’ software (Barrett et al., 2005). In Fig. 1, we can confirm that there areblocks ‘‘Block 1 and 2’’ defined by the method of Wang et al. (2002).

We estimated haplotype–diplotype configurations and selected htSNPs (haplotype tagging SNPs) using GUI ‘‘IntegratedEnvironmental System for SNPs Data Analysis’’ software (Tomita et al., 2004). After using the software, we selected 4 loci {7,11, 13, 19} for Blocks 1 and 2, since we wanted to have as many varieties as possible for stochastic diplotype configurations.To select htSNPs, we used the method of Kamatani et al. (2004) with cumulative haplotype frequency 0.95. As a result, weobtained the diplotype configurations shown in Table 3.

Page 5: Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

2108 M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

Table 1Information of loci. (rs number, chromosome position).

Locus# 1 2 3 4 5 6rs# rs197000 rs197005 rs197006 rs197012 rs197014 rs197016Position 28409449 28413819 28416655 28424066 28430651 28433916

Locus# 7 8 9 10 11 12rs# rs197018 rs197021 rs197022 rs5943527 rs642519 rs404274Position 28435927 28441493 28442352 28442857 28445155 28446901

Locus# 13 14 15 16 17 18rs# rs196983 rs115126 rs115125 rs196985 rs196986 rs17348455Position 28448532 28449670 28449943 28452952 28453150 28453190

Locus# 19 20 21 22 23 24rs# rs196988 rs196990 rs1265497 rs6630730 rs196982 rs196975Position 28453742 28456060 28458373 28468511 28468715 28475390

Locus# 25 26 27 28 29 30rs# rs1468134 rs724087 rs5985808 rs4103136 rs12863731 rs1586093Position 28526074 28644979 28645245 28645826 28650807 28675987

Locus# 31 32 33 34 35 36rs# rs5985930 rs5985809 rs5943575 rs2521807 rs634270 rs6630793Position 28680456 28681164 28681984 28703122 28705667 28709418

Locus# 37 38 39 40rs# rs5943579 rs628704 rs629965 rs11095138Position 28723040 28724158 28724381 28725017

Table 2Summary for major/minor allele frequencies for each locus.

Locus Major allele Minor allele MissingName Count Frequency Name Count Frequency Name Count Total

1 A 49 0.5568 G 39 0.4432 * 0 882 A 49 0.5568 G 39 0.4432 * 0 883 T 49 0.5568 C 39 0.4432 * 0 884 G 49 0.5568 T 39 0.4432 * 0 885 T 49 0.5568 C 39 0.4432 * 0 886 A 49 0.5568 C 39 0.4432 * 0 887 T 56 0.6364 C 32 0.3636 * 0 888 A 49 0.5568 G 39 0.4432 * 0 889 A 49 0.5568 G 39 0.4432 * 0 88

10 C 49 0.5568 T 39 0.4432 * 0 8811 G 64 0.7273 A 24 0.2727 * 0 8812 G 46 0.5227 A 42 0.4773 * 0 8813 C 76 0.8636 T 12 0.1364 * 0 8814 T 67 0.7614 C 21 0.2386 * 0 8815 A 67 0.7614 C 21 0.2386 * 0 8816 A 67 0.7614 G 21 0.2386 * 0 8817 T 46 0.5227 C 42 0.4773 * 0 8818 T 54 0.6136 A 34 0.3864 * 0 8819 T 46 0.5227 A 42 0.4773 * 0 8820 C 75 0.8523 T 13 0.1477 * 0 8821 G 67 0.7614 A 21 0.2386 * 0 8822 C 67 0.7614 T 21 0.2386 * 0 8823 A 45 0.5114 G 43 0.4886 * 0 8824 C 66 0.75 T 22 0.25 * 0 8825 T 45 0.5114 C 43 0.4886 * 0 8826 A 44 0.5 G 44 0.5 * 0 8827 T 59 0.6705 C 29 0.3295 * 0 8828 G 59 0.6705 C 29 0.3295 * 0 8829 A 72 0.8182 T 16 0.1818 * 0 8830 C 66 0.75 A 22 0.25 * 0 8831 T 62 0.7045 C 26 0.2955 * 0 8832 C 66 0.75 T 22 0.25 * 0 8833 T 62 0.7045 C 26 0.2955 * 0 8834 G 60 0.6818 A 28 0.3182 * 0 8835 T 62 0.7045 C 26 0.2955 * 0 8836 A 52 0.5909 G 36 0.4091 * 0 8837 G 70 0.7955 A 18 0.2045 * 0 8838 A 70 0.7955 C 18 0.2045 * 0 8839 A 68 0.7727 G 20 0.2273 * 0 8840 T 58 0.6591 A 30 0.3409 * 0 88

Page 6: Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113 2109

Table 3Diplotype configurations estimated by htSNPs loci {7, 11, 13, 19} for each subject.

Sub# Diplotype# Probability Pop. freq. Diplotype configuration

1 1 1 0.0812 TGCATGCA2 1 1 0.0573 TGCT TGCA3 1 1 0.0812 TGCATGCA4 1 1 0.0405 TGCT TGCT5 1 1 0.0105 CACA CACA6 1 1 0.0069 CGCT CACT7 1 1 0.0028 CGCA CGCA8 1 1 0.0321 TGCA TGTT9 1 1 0.0268 TGCT CACT

10 1 1 0.0014 TACA TACA11 1 1 0.0405 TGCT TGCT12 1 0.5671 0.007 CGCA CACT

2 0.4329 0.0053 CGCT CACA13 1 1 0.0177 CACT CACT14 1 1 0.0812 TGCATGCA15 1 1 0.0812 TGCATGCA16 1 1 0.0321 TGCA TGTT17 1 1 0.0105 CACA CACA18 1 1 0.0405 TGCT TGCT19 1 1 0.0812 TGCATGCA20 1 1 0.0321 TGCA TGTT21 1 1 0.0069 CGCT CACT22 1 1 0.0405 TGCT TGCT23 1 1 0.0812 TGCATGCA24 1 0.9367 0.0291 TGCA CACA

2 0.0633 0.002 TACA CGCA25 1 0.6271 0.0379 TGCA CACT

2 0.3406 0.0206 TGCT CACA3 0.0324 0.002 TACA CGCT

26 1 1 0.0127 TGTT TGTT27 1 1 0.0573 TGCT TGCA28 1 1 0.0105 TGCT CGCT29 1 1 0.0321 TGCA TGTT30 1 1 0.0105 TGCT CGCT31 1 0.6271 0.0379 TGCA CACT

2 0.3406 0.0206 TGCT CACA3 0.0324 0.002 TACA CGCT

32 1 0.6271 0.0379 TGCA CACT2 0.3406 0.0206 TGCT CACA3 0.0324 0.002 TACA CGCT

33 1 1 0.0028 CGCA CGCA34 1 1 0.0136 CACT CACA35 1 1 0.0321 TGCA TGTT36 1 1 0.0006 CGTT CGTT37 1 1 0.0405 TGCT TGCT38 1 0.6271 0.0379 TGCA CACT

2 0.3406 0.0206 TGCT CACA3 0.0324 0.002 TACA CGCT

39 1 1 0.0075 TGCT TACA40 1 0.6271 0.0379 TGCA CACT

2 0.3406 0.0206 TGCT CACA3 0.0324 0.002 TACA CGCT

41 1 1 0.0812 TGCATGCA42 1 0.929 0.0115 TGTT CACA

2 0.071 0.0009 TACA CGTT43 1 1 0.0127 TGTT TGTT44 1 1 0.0177 CACT CACT

TGCA is the target haplotype.

We assumed haplotype ‘‘TGCA’’ as the target haplotype for our study. Then, the data set of our study were divided intothree subsets. The first subset consists of the subjectswho have two ‘‘TGCA’’s, i.e., ‘‘TGCA–TGCA’’, as the diplotype, the secondsubset consists of those with one target haplotype, and the third subset consists of those without the target haplotype.

To describe the association we consider three models, i.e., the dominant, recessive, additive models. In the dominantmodel, group D+ is composed of the first and second subsets and group D− is composed of the third subset. In the recessivemodel, group D+ is composed of the first subset and group D− is composed of the second and third subsets. In the additivemodel, there are three groups which correspond to the three subsets defined above.

Page 7: Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

2110 M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

Fig. 1. LD(D′) map with rs number and chromosome position for data.

Regarding the two quantitative variables, observations were generated assuming that they follow a bivariable normaldistribution N(µ1,Σ), N(µ2,Σ) or N(µ3,Σ) depending on the number the subjects had haplotype ‘‘TGCA’’, where

µ1 = (120, 19), µ2 = (118, 20), Σ =

256 4040 8

.

The bivariate normal random numbers were generated by using the library ‘‘mvtnorm’’ in R.In our numerical study,we first generated twodata sets. In the first data set, it is assumed that the genotype-to-phenotype

relationship can be described by a recessive model, where the phenotype follows N{µ1, Σ} or N{µ2, Σ}, dependingwhether it belongs to group D+ or group D−, respectively. In the second data set, it is assumed that the genotype-to-phenotype relationship can be described by an additive model, where in the first group the phenotype follows N{µ1, Σ},in the second group the phenotype follows N{µ2, Σ}, and in the third group the phenotype follows N{µ3, Σ}, whereµ3 = (µ1 + µ2)/2.

We obtained the results of association analysis with our method as shown in Tables 5 and 7 by using our R program. The‘q-value’ in Tables 5 and 7 indicates q value of the false discovery rate (FDR) (Benjamini and Hochberg, 1995). Tables 4 and6 show the results of analysis using QTLmarc (Kamitsuji and Kamatani, 2006).

The results of our method show that p values for the target haplotype ‘‘TGCA’’ are smaller than q values, and that p valuesfor other haplotypes are larger than q values both in Tables 5 and 7. On the other hand the results of QTLmarc show that thep value for the target haplotype is larger than 0.05, and in addition p values for other six haplotypes are smaller than 0.05 inTable 4 and that the p value for the target haplotype is 0.042, while p values for other five haplotypes are also smaller than0.05 in Table 6. There are many other haplotypes which have similar or smaller p values than the target haplotype by theirmethod. Therefore we have to judge that the target haplotype is not significant in Table 6. Thus we could find that the targethaplotype ‘‘TGCA’’ is significant by our method, but not significant by the QTLmarc. These results confirm that QTLmarc isnot applicable besides the dominant model.

For the dominant model we carried out a small size simulation study on the comparison of the powers between QTLmarcand our method. We assumed a dominant model with µ1 = (120, 19): null case, Σ = (σ11, σ12, σ22) = (256, 40, 8), setthe values of µ’s for alternative cases so that the Mahalanobis distances became approximately 0.25, 0.50, 0.75, 1.00 and1.25, and then generated 100 sets of data for each case.

The results of analysis are given in Table 8. This table shows clearly that our method has higher power than QTLmarc fordetecting the genotype-to-phenotype relationship in the case of the dominant model.

Page 8: Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113 2111

Table 4Results of analysis using QTLmarc of the first data set based on a recessive model.

Haplotype Number of carriers AUC P-value

TACA 9 0.8114 0.00068CGCA 10 0.7367 0.00349TGTA 6 0.7286 0.00583TGTT 8 0.7192 0.00607CACA 11 0.7175 0.00768TGCT 22 0.6906 0.00270TACT 8 0.6727 0.03782CGTT 2 0.6274 0.15388CGCT 12 0.6106 0.10776CACT 13 0.6097 0.12171TGCA 22 0.5371 0.33023CATA NA NA NATATA NA NA NATGTA NA NA NACATT NA NA NATATT NA NA NA

AUC is an area under the ROC (receiver operating characteristic) curve. TGCA is the target haplotype.

Table 5Results of the analysis using our method of the first data set based on a recessive model.

Haplotype Haplotype frequency χ2 P-value q-value df

TGCA 0.2368 15.3214 0.00047 0.00625 2CGTT 0.0263 4.9060 0.08603 0.01250 2CGCA 0.0526 2.4720 0.29054 0.01875 2CACT 0.1228 2.0721 0.35484 0.02500 2TACA 0.0877 1.5365 0.46381 0.03125 2TGCT 0.1842 0.4474 0.79955 0.03750 2CACA 0.1140 0.3173 0.85331 0.04375 2TGTT 0.0877 0.2663 0.87533 0.05000 2

TGCA is the target haplotype.

Table 6Results of analysis using QTLmarc of the second data set based on an additive model.

Haplotype Number of carriers AUC P-value

TACA 9 0.7605 0.00011TGCT 22 0.6825 0.00448TACT 8 0.6928 0.00960CGCA 10 0.6873 0.01345CACA 11 0.6624 0.03368TGCA 22 0.6328 0.04209CGTT 2 0.6774 0.06749TGTA 6 0.6745 0.06923CGCT 12 0.6092 0.09593CACT 13 0.6034 0.11959TGTT 8 0.5836 0.23181CATA NA NA NATATA NA NA NACGTA NA NA NACATT NA NA NATATT NA NA NA

AUC is an area under the ROC (receiver operating characteristic) curve. TGCA is the target haplotype.

Table 7Results of the analysis using our method of the second data set based on an additive model.

Haplotype Haplotype frequency χ2 P-value q-value df

TGCA 0.2368 11.8377 0.00269 0.00625 2CGTT 0.0263 5.7841 0.05546 0.01250 2TACA 0.0877 3.3868 0.18390 0.01875 2TGCT 0.1842 1.7475 0.41738 0.02500 2CACT 0.1228 1.2107 0.54588 0.03125 2CGCA 0.0526 0.5627 0.75476 0.03750 2CACA 0.1140 0.5247 0.76925 0.04375 2TGTT 0.0877 0.0452 0.97768 0.05000 2

TGCA is the target haplotype.

Page 9: Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

2112 M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113

Table 8Power comparison between QTLmarc and our method for dominant models based on a simulation study with 100 iterations.

Mean vector µx {119, 19.5} {118.5, 19.7} {118.1, 19.84} {118, 20} {117.77, 20.12}Mahalanobis distance 0.25 0.508 0.753 1 1.252

Power QTLmarc 0.23 0.56 0.73 0.83 0.90Ours 0.23 0.59 0.77 0.94 0.98

Table 9The chi-squared (top) and AIC (bottom) statistics obtained with the analysis assuming dominant, recessive and additive models of the three datasets.

Hypothesis test byDominant model Recessive model Additive model

True model

Dominant 11.5837132 0.0384127 4.2594829−7.5837132 3.9615873 −0.2594829

Recessive 3.4820052 15.3214361 8.21259500.5179948 −11.3214361 −4.2125950

Additive 8.7507100 11.6215961 11.8376920−4.7507100 −7.6215961 −7.8376920

4. Discussions and summary

We proposed a method of multivariate association analysis and developed an R program. It is an extension of QTLhaploto the case of multivariate analysis, where we can analyze the cases where the diplotype configuration is not determineduniquely from the genotype and we can assume any dominant, recessive and additive models. To study the effectivenessof our method for the recessive and additive models, we carried out a numerical study to analyze two data sets with realgenotype data (Hapmap project) and simulated QTL data, and for the dominant model we performed a small size simulationstudy to compare the powers with QTLmarc. The results showed that our method was more effective for detecting thegenotype-to-phenotype relationship than QTLmarc of Kamitsuji and Kamatani (2006) in all the dominant, recessive andadditive models.

We speculate on the reason for the big difference of the results for the recessive and additivemodels. In Tables 4 and 6, thenumber of carriers of the target haplotype ‘‘TGCA’’ is 22. However, afterwe estimatedhaplotype anddiplotype configurationsby the maximum likelihood method, there are only 7 subjects with both haplotypes ‘‘TGCA–TGCA’’ (so that there are only 7carriers in the recessive model), and there are only 13 subjects with the haplotype ‘‘TGCA’’ (so that there are only 7 and 13carriers in the additive model) stochastically. The QTLmarc algorithm selected also subjects #39 and #42 as the carriers, butthey have no target haplotype. It may be due to their method of haplotype estimation being based on allele counting, andmodel fitting can be applicable only for the dominant model. For the dominant model the power of our method is higherthan that of QTLmarc which may be explained by the first advantage which will be mentioned below. For dominant modelsthere are few differences, possibly becauseQTLmarc considers subjects #39 and #42with no target haplotype as the carriers.

So far we discussed the comparison of the powers between our method and QTLmarc. In actual data analysis, however,the objective is to find out if there exists any haplotype which is closely related to the phenotypes assuming an appropriateone among the dominant, recessive and additive models. For this purpose we may operationally choose the model with thehighest significance. Note that, if we wish to use the AIC statistics we can compute them up to an additive constant basedon the values of likelihood ratio statistics, because the log-likelihood statistics for the null models are common among thedominant, recessive and additive models. To study how this idea works in actual data analysis we analyzed each of thethree datasets generated in the manner described in Section 3 assuming the three models, i.e., the dominant, recessive andadditive models. The chi-squared statistics and the AIC statistics are given in Table 9. It is noted that the true models couldbe detected correctly in all cases but that there might exist a tendency that the dataset generated assuming the additivemodel is easily misjudged to be taken from other models.

There are two advantages of our method compared to the QTLmarc algorithm. One is that our method can treat genotypedata with stochastically determined diplotypes and the other is that we can assume any model among dominant, recessiveand additive models. It is expected that our method will be useful in association studies of complex diseases such asschizophrenia and autism,where the causes of the diseases are not yet resolved and there existmultiple candidate responses.

Acknowledgements

We are deeply grateful to referees who provided kindly considered feedback and valuable comments. This work waspartly supported by KAKENHI (21700317; Grant-in-Aid for Young Scientists (B)).

References

Barrett, J.C., Fry, B., Maller, J., Daly, M.J., 2005. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 21 (2), 263–265.Benjamini, Y., Hochberg, Y., 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical

Society. Series B 57 (1), 289–300.

Page 10: Association study for the relationship between a haplotype or haplotype set and multiple quantitative responses

M. Tomita et al. / Computational Statistics and Data Analysis 55 (2011) 2104–2113 2113

Kamatani, N., Sekine, A., Kitamoto, T., Iida, A., Saito, S., Kogame, A., Inoue, E., Kawamoto, M., Harigai, M., Nakamura, Y., 2004. Large-scale single-nucleotidepolymorphism (SNP) and haplotype analyses, using dense SNPmaps, of 199 drug-related genes in 752 subjects: the analysis of the association betweenuncommon SNPs within haplotype blocks and the haplotypes constructed with haplotype-tagging SNPs. American Journal of Human Genetics 75 (2),190–203.

Kamitsuji, S., Kamatani, N., 2006. Estimation of haplotype associated with several quantitative phenotypes based on maximization of are under a receiveroperating characteristic (ROC) curve. Journal of Human Genetics 51 (4), 314–325.

Shibata, K., Ito, T., Kitamura, Y., Iwaaki, N., Tanaka, H., Kamatani, N., 2004. Simultaneous estimation of haplotype frequencies and quantitative traitparameters: applications to the test of association between phenotype and diplotype configuration. Genetics 168, 525–539.

Tomita, M., Hatsumichi, M., Kurihara, K., 2008. Identify LD blocks based on hierarchical spatial data. Computational Statistics and Data Analysis 52 (4),1806–1820.

Tomita, M., Inoue, E., Kamatani, N., 2004. Integrated environmental system for SNPs data analysis. In: Program and Abstracts of The 13th Takeda ScienceFoundation Symposium on Bioscience, 92.

Wang, N., Akey, J.M., Zhang, K., Chakraborty, R., Jin, L., 2002. Distribution of recombination crossovers and the origin of haplotype blocks: the interplay ofpopulation history, recombination, and mutation. American Journal of Human Genetics 71 (5), 1227–1234.