entropy coe¢ cient of determination and its applicationeshima,italy,seminar2012).pdf · entropy...

30
Entropy Coe¢ cient of Determination and Its Application Nobuoki Eshima Department of Biostatistics, Faculty of Medicine, Oita University, Oita 879-5593, Japan. E-mail: [email protected] The objective of this seminar is to introduce the entropy co- e¢ cient of determination (ECD) for measuring the explanatory or predictive power of GLMs and to consider how ECD is used in data analysis. In the rst section, the classical regression and GLM frameworks are compared, and properties of GLMs concern- ing entropy are discussed. First, the information of an event and the entropy of a random variable are explained, and the Kullback- Leibler information that describes the di/erence between two dis- tributions is treated. Second, the log odds ratio and the mean in the GLM are considered from a view point of entropy. ECD is interpreted as the ratio of variation of a response variable ex- plained by the explanatory variables, and is compared with some other explanatory power measures with respect to the following properties: (i) interpretability; (ii) being the multiple correlation coe¢ cient or the coe¢ cient of determination in normal linear re- gression models; (iii) entropy-based property; (iv) applicability 1

Upload: dangque

Post on 21-Jun-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Entropy Coe¢ cient of Determination and ItsApplication

Nobuoki Eshima

Department of Biostatistics, Faculty of Medicine, Oita University, Oita

879-5593, Japan.

E-mail: [email protected]

The objective of this seminar is to introduce the entropy co-

e¢ cient of determination (ECD) for measuring the explanatory

or predictive power of GLMs and to consider how ECD is used

in data analysis. In the �rst section, the classical regression and

GLM frameworks are compared, and properties of GLMs concern-

ing entropy are discussed. First, the information of an event and

the entropy of a random variable are explained, and the Kullback-

Leibler information that describes the di¤erence between two dis-

tributions is treated. Second, the log odds ratio and the mean

in the GLM are considered from a view point of entropy. ECD

is interpreted as the ratio of variation of a response variable ex-

plained by the explanatory variables, and is compared with some

other explanatory power measures with respect to the following

properties: (i) interpretability; (ii) being the multiple correlation

coe¢ cient or the coe¢ cient of determination in normal linear re-

gression models; (iii) entropy-based property; (iv) applicability

1

to all GLMs; in addition to these, it may be appropriate for a

measure to have the following property: (v) monotonicity in the

complexity of the linear predictor. In Section 2, �rst the asymp-

totic properties of the maximum likelihood estimator (MLE) of

ECD are discussed. The con�dence interval of ECD is consid-

ered on the basis of an approximate normality of non-central chi

square distribution. Second, in canonical link GLMs the contri-

butions of factors are treated according to the decomposition of

ECD. Numerical examples are also given.

1 Coe¢ cient of determination for generalized

linear models

1.1 Information Theory

LetX be a categorical variable with categories = fC1; C2; :::; CKg, in which

follows the categories are formally described as = f1; 2; :::; Kg. Then, the

information of X = k is de�ned by

I (X = k) = log 1Pr(X=k)

. (1.1)

In the above information, the bottom of the logarithm is e and the unit is

nat. If the bottom is 2, the unit is bit. In this seminar, the bottom of the

logarithm is e. The mean of the above information, which is called entropy,

is de�ned by

2

H(X) �PK

k=1 Pr (X = k) I (X = k) =PK

k=1 Pr (X = k) log 1Pr(X=k)

. (1.2)

Entropy is a measure of uncertainty in random variable X or sample space

. Let pk = Pr (X = k) (k = 1; 2; :::; K). Then, we have the following

theorem.

Theorem 1.1. Let p = fp1; p2; :::; pKg and q = fq1; q2; :::; qKg be two

distributions. Then,PKk=1 pk log pk �

PKk=1 pk log qk. (1.3)

Proof:PKk=1 pk log pk �

PKk=1 pk log qk =

PKk=1 pk log

pkqk= �

PKk=1 pk log

qkpk

�PK

k=1 pk

�1� qk

pk

�= 0.

�* log x � 1� x; x = qk

pk

�The equation holds if and only if pk = qk (k = 1; 2; :::; K). �

From (1.3) it follows that

H (p) = �PK

k=1 pk log pk � �PK

k=1 pk log qk,

where H (p) implies the entropy of distribution p. Setting qk = 1K

(k = 1; 2; :::; K), we have

H (p) � �PK

k=1 pk log1K= logK.

The following quantity is referred to as the Kullback-Leibler (KL)

information or divergence.

3

D(pjjq) �PK

k=1 pk logpkqk(� 0). (1.4)

This information is interpreted as the di¤erence or loss of information using

distribution q instead of true distribution p.

Example 1.1. Let X � BN(3;14) and Let Y � Pr(Y = k) = 1

4

(k = 1; 2; 3; 4). Then, we have

D(XjjY ) =�18

�log 1=8

1=4+�18

�log 1=8

1=4+�38

�log 3=8

1=4+�38

�log 3=8

1=4

= 34ln 3

2� 1

4ln 2 = 0:130 81.

The above quantity is the di¤erence between BN�3; 1

2

�and the uniform

distribution on f1; 2; 3; 4g. The reciprocal KL information is

D(Y jjX) = 2�14

�log 1=4

1=8+ 2

�14

�log 1=4

3=8= 0:143 8:

In this case, D(XjjY ) 6= D(Y jjX). �

For continuous distributions, the KL information can be de�ned similarly

as

D (f(x)jjg(x)) �Zf(x) log f(x)

g(x)dx (� 0), (1.5)

where f(x) and g(x) are density functions.

Example 1.2. Let f(x) � N(�1; �2) and g(x) � N(�2; �2). Then,

D (f(x)jjg(x)) = (�1��2)22�2

(= D (g(x)jjf(x))).

4

1.2 Entropy in GLMs

LetX and Y be a p� 1 explanatory variable vector and a response variable,

respectively, and let f(yjx) be the conditional probability or density function

of Y given X = x. The function f(yjx) is assumed to be a member of the

following exponential family of distributions:

f(yjx) = exp�y��b(�)a(')

+ c(y; ')�; (1.6)

where � and ' are parameters, and a('), b(�) (> 0) and c(y; ') are speci�c

functions. This is a random component. Let �T = (�1; �2; :::; �p)T : For a

link function h(u) (link component) and the linear predictor � = �Tx

(systematic component), the conditional expectation of Y given X = x is

described as follows:

E(Y jX = x) = db(�)d�

= h�1(�Tx): (1.7)

Let us assume that the link function h(u) is a strictly increasing

di¤erentiable function. The conditional variance of response Y given

X = x is as follows:

Var(Y jX = x) = a(')d2b(�)

d�2.

From this, a(') relates to the dispersion of Y , so it is referred to as a

dispersion parameter. Since � is a function of � = �Tx, for simpli�cation

the function is denoted by � = �(�Tx). Let us consider the following log

odds ratio:

5

logOR(x;x0; y; y0) = logf(yjx)=f(y0jx)f(yjx0)=f(y0jx0) = log

f(yjx)f(y0jx0)f(y0jx)f(yjx0)

= 1a(')(y�y0)

��(�Tx)� �(�Tx0)

�; (1.8)

where x0 and y0 are baselines of X and Y , respectively. The above log

odds ratio is viewed as an inner product of �(�Tx) and y with respect to

the dispersion parameter a('). Since

logOR(x;x0; y; y0) = f� log f(y0jx))� (� log f(yjx))g

�f(� log f(y0jx0))� (� log f(yjx0))g ;

the log odds ratio (1.8) is the change of the uncertainty of response Y in

explanatory variable vector X, and as seen in the above log odds ratio,

predictor �Tx is related to the reduction of uncertainty of response Y

through link functions. For levels of the factor vector X = x1;x2; : : : ;xK ;

the averages of Y , � and Y � are de�ned as follows:

E(�) =PKk=1 �(�

Txk)

K; E(Y ) =

PKk=1 E(Y jX=xk)

K;

and

E(Y �) =PKk=1 E(Y jX=xk)�(�Txk)

K:

Remark 1.1. Let nk be sample sizes at factor levels xk (k = 1; 2; :::; K),

and n =PK

k=1 nk. Then, the above averages (expectations) are replaced by

the weighted ones, e.g.

E(�) =PK

k=1nkn�(�Txk).

6

When we take the expectation of the inner product (1.8), we have

Cov(�;Y )a(')

+(E(�)��(�Tx0))(E(Y )�y0)

a('),

where Cov(�; Y ) =E(�Y )�E(�)E(Y ): For y0 = E(Y ), the quantity becomes

Cov(�;Y )a(')

(1.9)

and it can be viewed as the average change of uncertainty of response

variable Y in explanatory variable vector X. We have Theorem 1.1 and

Corollary 1.1.

Theorem 1.1. In the GLM with (1.6), the quantity (1.9) is expressed by

the Kullback-Leibler Information:

Cov(�;Y )a(')

=PKk=1 KL(f(y);f(yjxk))

K(1.10)

where f(y) =PKk=1 f(yjxk)

Kand

KL(f(y); f(yjxk)) =Rf(yjxk) log

�f(yjxk)f(y)

�dy +

Rf(y) log

�f(y)

f(yjxk)

�dy

= D (f(yjxk)jjf(y)) +D (f(y)jjf(yjxk))

(k = 1; 2; : : : ; K):

Corollary 1.1. In the GLM with (1.6), the covariance of Y and �(�TX)

is nonnegative, and it is zero if and only if X and Y are independent, i.e.

f(yjxk) = f(y) (k = 1; 2; :::; K).

Example 1.3. An ordinary linear regression model is

Y = �+ �Tx+ e,

7

where e is a normal error with mean 0 and variance �2. Let f(yjx) be a

normal density function with mean � and variance �2. Then, the random

component is

f(yjx) = 1p2��2

exp�� (y��)2

2�2

�= exp

��y� 1

2�2

�2+ �y2

2�2� log

�p2��2

��.

In this expression, setting

� = �, a(') = �2, b(�) = 12�2 and c(y; ') = �y2

2�2� log

�p2��2

�,

and for linear predictor � = �+ �Tx and link function � = �, the normal

linear regression model can be viewed as a GLM.

Example 1.4. Let Y be a binary variable with p = Pr(Y = 1) (= �).

The random component is

f(yjx) = py(1� p)1�y = fp(1� p)�1gy(1� p)

= expny log p

1�p + log(1� p)o.

Then,

� = log p1�p , a(') = 1, b(�) = � log(1� p) and c(y; ') = 0.

Setting link function h(p) = log p1�p , we have the following logistic

regression (logit) model:

8

f(yjx) = exp f(�+ �x)y + log(1� p)g

= expf(�+�x)yg1+exp(�+�x)

.

For the logit model with explanatory variables X1; X2; :::; Xp, the model is

expressed as

f(yjx) = expf(�+Ppi=1 �ixi)yg

1+exp(�+Ppi=1 �ixi)

. (1.11)

1.3 Basic Predictive Power Measures for GLMs

In the sense of the previous discussion, it may be appropriate to assess the

predictive or explanatory power of factors based on entropy. In the GLM

framework, predictive power measures are compared, and the advantage of

ECD is mentioned. First, some predictive power measures for regression mod-

els are brie�y discussed. In general regression models, for variation function

D, the predictive power can be measured as follows:

R2D =D(Y )�D(Y jX)

D(Y ), (1.12)

where D (Y ) and D(Y jX) imply a variation function of Y and a conditional

or error variation function given X, respectively (Efron, 1978; Agresti,

1986; Korn & Simon, 1991). Predictive power measures based on the

likelihood function (Theil, 1970; Goodman, 1971) are made according to

powers of the likelihood function, i.e.

R2L = 1��l(0)l(�)

� 2n

,

9

where l (�) is the likelihood function and n is the sample size. Let R be the

multiple correlation coe¢ cient in ordinary linear regression model. The

above measure becomes R2 in ordinary linear regression cases and increases

with model complexity; however it is di¢ cult to interpret the measure in

general (Zhen & Agresti, 2000). The entropy measure (Haberman, 1982)

for categorical responses is based on entropy of Y , H(Y ), and the

conditional entropy H(Y jX), i.e.

R2E =H(Y )�H(Y jX)

H(Y ).

The above measure is included in (1.12). The correlation coe¢ cient of

response Y and its conditional expectation given factor X,

Corr(E (Y jX) ; Y ), is recommended for measuring the predictive power of

GLMs, because the correlation measure can be applied to all types of

GLMs except polytomous response cases (Zheng & Agresti, 2000). This

measure is the correlation coe¢ cient between response Y and the regression

on X, referred to as the regression correlation coe¢ cient.

With respect to entropy, R2L and R2E may be suitable for GLMs. By

considering the average change of log odds ratio, Eshima & Tabata (2007)

proposed the following basic predictive power measure:

mPP (Y jX) � Cov(�;Y )a(')

. (1.13)

The above measure is expressed by the Kullback-Leibler information

(Eshima & Tabata, 2007), and it is increasing in Cov(�; Y ) and decreasing

in a('). Since

10

Var(Y jX = x) = a(')d2b(�)

d�2,

function a(') may be interpreted as the error variation of Y in entropy, i.e.

residual randomness of Y given X. From this, Cov(�; Y ) can be interpreted

as the explained entropy of Y by X. Hence, measure (1.13) is the ratio of

the explained variation of Y for the error variation of Y in entropy.

Entropy variation function DE is de�ned by

DE(Y ) � Cov(�; Y ) + a(').

Since � is a function of X, Cov(�; Y jX) = 0. From this, the conditional

entropy variation of Y given X is

DE(Y jX) � a(').

Considering this, ECD is de�ned as follows:

ECD(X;Y ) = Cov(�;Y )Cov(�;Y )+a(') =

mPP (Y jX)mPP (Y jX)+1

�= DE(Y )�DE(Y jX)

DE(Y )

�. (1.14)

From (1.14), ECD is included in (1.12), and ECD can be viewed as the

proportion of explained variation of Y in entropy. For the normal linear

regression model, it follows that

ECD(X; Y ) = R2.

Let � =��1; �2; :::; �p

�Tbe a regression coe¢ cient vector. For canonical

links � =Pp

i=1 �iXi, ECD (1.14) and the entropy correlation coe¢ cient

(ECC) are decomposed as follows:

11

ECD(X; Y ) =Ppi=1 �iCov(Xi;Y )Pp

i=1 �iCov(Xi;Y )+a('), (1.15)

ECorr(X; Y ) =Ppi=1 �iCov(Xi;Y )pVar(�)

pVar(Y )

:

None of the predictive power measures except ECD and ECC can make

the above type of decomposition for GLMs with canonical links. In addition

to the desirable properties of measures for GLMs, (i) to (v), decomposability

such as (1.15) may also be a suitable property for a predictive power mea-

sure, because the relative importance of Xi may be assessed by �iCov(Xi; Y ).

Moreover, ECD is scale-invariant in GLMs with multivariate responses; how-

ever ECC is not. In this respect, ECD is superior to ECC.

Table 1.1 Properties of �ve predictive power measures in GLMs

Property Corr(E (Y jX) ; Y ) R2L R2E ECC ECD

(i) interpretability �

(ii) R or R2 �

(iii) entropy �

(iv) all GLMs �

(v) monotonicity 4 4

(vi) decomposition � � �

Table 1.1 summarizes the properties of the �ve measures mentioned above.

Measures Corr(E (Y jX) ; Y ) and ECorr(X; Y )may have property (v) in most

of cases; however it is not easy to prove the property in general. From this

table, ECD(X;Y ) is the most desirable predictive power measure for GLMs.

12

Example 1.5. Let X and Y be p and q dimensional random vectors,

respectively; and the joint distribution is assumed to be a (p + q)�variate

normal distribution with the following covariance matrix:

� =

0B@ �XX �XY

�Y X �Y Y

1CA.Let the inverse of the above matrix be denoted by

��1 =

0B@ �XX �XY

�Y X �Y Y

1CA.Then, � = �Y XX and a(') = 1, so we have

mPP (Y jX) = tr�Y X�XY .

From this,

ECD(X; Y ) = tr�Y X�XYtr�Y X�XY +1

.

Let �i (i = 1; 2; :::;minfp; qg) be the squared canonical correlation

coe¢ cients. Then,

ECD(X; Y ) =Pminfp;qgi=1

�i1��iPminfp;qg

i=1�i

1��i+1.

For q = 1, ECD is reduced to the usual coe¢ cient of determination

�1 (= R2). �

Example 1.6. In the logistic regression model (1.11), we have

ECD(X; Y ) =Ppi=1 �iCov(Xi;Y )Pp

i=1 �iCov(Xi;Y )+1.

13

2 Application of ECD

2.1 Asymptotic Property of the ML Estimator of ECD

Let f(y) and g (x) be the marginal density or probability function of Y and

X, respectively. Then, the association measure is expressed as

mPP (Y jX) =RRf(yjx)g(x) log

�f(yjx)f(y)

�dxdy

+RRf(y)g(x) log

�f(y)f(yjx)

�dxdy. (2:1)

If Y is discrete, the integral is replaced with the summation. If X is not

random and take values xk (k = 1; 2; :::; K), the above measure can be

modi�ed as follows:

mPP (Y jX)

=PK

k=1nkn

�Rf(yjxk) log

�f(yjxk)f(y)

�dy +

Rf(y) log

�f(y)

f(yjxk)

�dy�, (2:2)

where nk are sample sizes at levels xk (k = 1; 2; :::; K), and n =PK

k=1 nk.

We have the following theorem.

Theorem 2.1. Let[mPP (Y jX), bf(yjxk), and bf(y) be the ML estimatorsof mPP (Y jX), f(yjxk), and f(y), respectively. If the null model, i.e. � = 0,

holds, the ML estimator of (2.2) multiplied by sample size n, i.e.

n�[mPP (Y jX)

=PK

k=1 nk

�R bf(yjxk) log � bf(yjxk)bf(y)�dy +

R bf(y) log � bf(y)bf(yjxk)�dy�,

is asymptotically distributed according to the chi-square distribution with

degrees of freedom p as the sample sizes ni tend to in�nity.

14

Proof. For simplicity of the discussion, the theorem is proven in the case

where Y is a polytomous variable with levels or categories f1; 2; : : : ; Jg. Let

�jjk = Pr (Y = jjX = xk) and �j = Pr (Y = j); and let b�jjk and b�j be theML estimators of the �jjk and �j, respectively. Under the null hypothesis

and for su¢ ciently large nk, we have

n�[mPP (Y jX) =PK

k=1 nkPJ

j=1

nb�jjk log b�jjkb�j + b�j log b�jb�jjko

=PK

k=1

PJj=1

(nkb�jjk�nkb�j)22nkb�jjk +

PKk=1

PJj=1

(nkb�jjk�nkb�j)22nkb�j + o(n)

=PK

k=1

PJj=1

(nkb�jjk�nkb�j)2nkb�jjk + o(n),

where

o(n)n

P�! 0 (n �!1) :

Hence, the theorem follows. �

When the explanatory variables X are random, the following theorem

holds similarly.

Theorem 2.2. If the null model, i.e. � = 0, holds, the ML estimator of

(2.1) multiplied by sample size n, i.e.

n�[mPP (Y jX)

= n��RR bf(yjx)bg(x) log � bf(yjx)bf(y)

�dxdy +

RR bf(y)bg(x) log � bf(y)bf(yjx)�dxdy

�is asymptotically distributed according to the chi-square distribution with

degrees of freedom p as the sample size n tends to in�nity.

15

Since

ECD(X;Y ) = Cov(�;Y )=a(')Cov(�;Y )=a(')+1 =

mPP (Y jX)mPP (Y jX)+1 ,

the ML estimator of ECD(X;Y ) is

[ECD(X;Y ) = [mPP (Y jX)[mPP (Y jX)+1

.

From this, we can test the hypothesis ECD(X; Y ) = 0 based on the

following statistic:

�2 = n�[mPP (Y jX)�= n

dCov(�;Y )a(b')

�. (2.3)

The above statistic is asymptotically distributed according to a non-central

chi square distribution with non-centrality

� = n�mPP (Y jX)

and degrees of freedom p. Let

c = 1 + �p+�

and � 0 = p+ �2

p+2�.

Statistic �2

cis asymptotically distributed according to the chi square

distribution with degrees of freedom � 0. As � 0 becomes large, the chi square

distribution tends to a normal distribution with mean � 0 and variance 2� 0.

From this, for su¢ ciently large sample size n, statistic (2.3)

16

�2

n= mPP (Y jX)

is asymptotically normally distributed with mean c�0

nand variance 2c2�0

n2

(Patnaik, 1949). For su¢ ciently large n, we have

c�0

n� mPP (Y jX),

2c2�0

n2� 2c

nmPP (Y jX).

From this, the asymptotic standard error (ASE) of[mPP (Y jX) isq2cn[mPP (Y jX).

Example 2.1. Agresti (2002, pp. 247-250) analyzed the beetle mortality

data with complementary log-log model, in which beetles were exposed to

gaseous carbon disul�de at various concentrations, and the numbers of beetles

killed after the 5-hour-exposure were observed (Table 2.1). Let X be gaseous

carbon disul�de concentration, and � and � the model parameters. Then,

for the complementary log-log link we have

� = log�1�expf� exp(�+�x)gexpf� exp(�+�x)g

�:

For the maximum likelihood estimates � = �39:52 (ASE = 3:23) and

� = 22:01 (ASE = 1:80), ECD is calculated as

ECD = 0:475 (ASE = 0:024):

The ECD indicates that 47:5% of variation of response variable Y in

entropy is explained by the gaseous carbon disul�de concentration X.

17

Table 2.1. Beetles Killed after Exposure to Carbon Disul�de

Log Dose No. of Beetles No. of Beetles Killed

1:691 60 6

1:724 60 13

1:755 62 18

1:784 56 28

1:811 63 52

1:837 59 53

1:861 62 61

1:884 60 60

2.2 GLMs with canonical links

Most regression analyses with GLMs are performed using canonical links.

Let X = (X1; X2; :::; Xp)T be a p � 1 factor or explanatory variable vector;

let Y be a response variable; let � =��1; �2; :::; �p

�Tbe a regression para-

meter vector; and let � =Pp

i=1 �iXi be the canonical links. Then, ECD is

decomposed as

ECD(X; Y ) =Ppi=1 �iCov(Xi;Y )Pp

j=1 �jCov(Xj ;Y )+a('). (2.4)

The above decomposition consists of components that relate to regression

coe¢ cients �i, and the contribution of Xi on Y may be de�ned by using

18

�iCov(Xi; Y ).

If Xi are independent or the experimental design is a multiway layout

experiment, the contribution ratio of Xi on the response is de�ned by

CR (Xi) =�iCov(Xi;Y )Pp

k=1 �kCov(Xk;Y ).

In general, Xi are correlated or the experiment model has higher-order

interactions. Then, the contribution ratio of Xi is de�ned by

CR (Xi) =Cov(�;Y )�Cov(�;Y jXi)

Cov(�;Y ).

Example 2.2. The present discussion is applied to the ordinary two-

way layout experimental design model. Let X1 and X2 be factors with levels

f1; 2; : : : ; Ig and f1; 2; : : : ; Jg, respectively. Then, the linear predictor is a

function of (X1; X2) = (i; j), i.e.

� = �i + �j + (��)ij.

For model identi�cation, the following constraints are placed on these

parameters:

PIi=1 �i =

PJj=1 �j =

PIi=1(��)ij =

PJj=1(��)ij = 0.

Let

Xki =

8><>: 1 (Xk = i)

0 (Xk 6= i)(k = 1; 2):

Then, dummy vectors

19

X1 = (X11; X12; : : : ; X1I)T and X2 = (X21; X22; : : : ; X2J)

T

are identi�ed with factors X1 and X2, respectively. From this, the

systematic component of the above model can be written as follows:

� = �TX1 + �TX2 +

TX1 X2;

where

� = (�1; �2; : : : ; �I)T , � = (�1; �2; : : : ; �J)

T ,

= ((��)11; (��)12; : : : ; (��)1J ; : : : ; (��)IJ)T ,

and

X1 X2 = (X11X21;X11X22; : : : ; X11X2J ; : : : ; X1IX2J)T .

Let Cov(X1; Y ), Cov(X2; Y ) and Cov(X1 X2; Y ) are covariance

matrices. Then the total e¤ect of X1 and X2 is

mPP (Y j (X1;X2)) =Cov(�;Y )

�2

= tr�TCov(X1;Y )�2

+ tr�TCov(X2;Y )�2

+ tr TCov(X1X2;Y )�2

=1I

PIi=1 �

2i

�2+

1J

PJj=1 �

2j

�2+

1IJ

PJj=1

PIi=1(��)

2ij

�2.

The above three terms are referred to as the main e¤ect of X1, that of X2

and the interactive e¤ect, respectively. Then, ECD is calculated as follows:

ECD((X1;X2) ; Y ) =1I

PIi=1 �

2i+

1J

PJj=1 �

2j+

1IJ

PJj=1

PIi=1(��)

2ij

1I

PIi=1 �

2i+

1J

PJj=1 �

2j+

1IJ

PJj=1

PIi=1(��)

2ij+�

2.

In this case,

20

CR (X1) =Cov(�;Y )�Cov(�;Y jXi)

Cov(�;Y )=

1I

PIi=1 �

2i

Cov(�;Y ),

CR (X2) =Cov(�;Y )�Cov(�;Y jXi)

Cov(�;Y )=

1J

PJj=1 �

2j

Cov(�;Y ).

The rest

1� CR (X1)� CR (X2) =1IJ

PJj=1

PIi=1(��)

2ij

Cov(�;Y )

is due to the e¤ect of the interaction.

Table 2.2. Length of Home Visit in minutes by Public Health Nurses by

Nurse�s Age Group and Type of Patient

Factor X2 (nurse�s age group) (years old)

Factor X1

(Type of Patients)

1 2 3 4

(20 to 29) (30 to 39) (40 to 49) (50 and over)

1 (Cardiac)20 25 22

27 21

25 30 29

28 30

24 28 24

5 30

28 31 26

29 32

2 (Cancer)30 45 30

35 36

30 29 31

30 30

39 42 36

42 40

40 45 50

45 60

3 (C.V.A)31 30 40

35 30

32 35 30

40 30

41 45 40

40 35

42 50 40

55 45

4 (Tuberculosis)20 21 20

20 19

23 25 28

30 31

24 25 30

26 23

29 30 28

27 30

21

Table 2.2 shows two-way layout experiment data in a study of length of

time spent on individual home visits by public health nurses (Daniel (1999),

pp. 348-353). In the example, analysis of the e¤ects of factors, i.e. the type

of patient and the age of a nurse, on the nurses�behavior will be signi�cant.

Let Y be length of home visit, and let factors X1 and X2 denote the type

of a patient and the age of a nurse, respectively. The results of two-way

analysis of variance are shown in Table 2.3. The main and interactive e¤ects

of factors are signi�cant. In this case, levels of factor vector X = (X1; X2)

are (i; j) (i = 1; 2; 3; 4; 5; j = 1; 2; 3; 4; 5). Although factors X1 and X2 are

independent (orthogonal), the model has interaction terms between them.

By using the present approach, the variance decomposition in Table 2.3 we

have

ECD((X1;X2) ; Y ) =3226:45+1185:05+704:45

6423:55= 0:796 (SE = 0:018).

From this, 79.6% of entropy is explained by the two variables (factors). The

contributions of the factors are calculated as follows:

CR (X1) =3226:456423:55

= 0:502 and CR (X2) =1185:056423:55

= 0:184 .

The contribution of X1 on Y is about three times greater than that of X2.

22

Table 2.3. Analysis of variance of length of time

spent on individual home visits by public health nurses

Source SS df MS F p

X1 3226:45 3 1075: 5 52: 641 0:000

X2 1185:05 3 395: 02 19: 334 0:000

(X1X2) 704:45 9 78: 272 3: 831 0:000

Residual 1307:6 64 20: 431 �

Total 6423:55 79 � �

Example 2.3. Table 2.4 shows death penalty data to study the e¤ects of

defendant�s and victim�s racial characteristics on whether persons convicted

homicide received the death penalty (Agresti, pp. 48-49), and the data were

analyzed with the following logit model

f(yjx) = expf(�+�dxd+�vxv)yg1+exp(�+�dxd+�vxv)

,

where Xd and Xv imply the defendant�s and victim�s race, i.e. 0= black

and 1=white; the response death penalty Y takes the value 0=no or 1=yes.

The estimated parameters were b� = �3:596 (SE=0.507), b�d = �0:868(SE=0.367) and b�v = 2:404 (SE=0.601) (Agresti, p. 201). From the results,

the odds of death penalty for the white defendant given the victim�s race is

exp (�0:868) = 0:420 times higher than that for the black defendant, and

the odds of death penalty for the white victim given the defendant�s race is

exp (2:404) = 11: 067 times higher than that for black victim. Since the

23

predictive power measured with ECD is ECD = 0:036 (SE = 0:014), the

e¤ects of the defendant�s and victim�s races on the death penalty are small.

The contribution ratios of the explanatory variables on the response are

calculated according to (2.4) as follows:

CR (Xd) = 0:603 and CR (Xv) = 0:813:

Table 2.4. Death Penalty Data

Victims�Race Defendant�s RaceDeath Penalty

Yes No

White White 53 414

Black 11 37

Black White 0 16

Black 4 139

2.3 Application to a generalized logit model

A baseline-category logit model is considered. Let X1 and X2 be categorical

factors that take levels f1; 2; : : : ; Ig and f1; 2; : : : ; Jg, respectively, and let Y

be a categorical response variable with levels f1; 2; : : : ; Kg. Let

Xai =

8><>: 1 (Xa = i)

0 (Xa 6= i)(a = 1; 2) and Yk =

8><>: 1 (Y = k)

0 (Y 6= k). (2.5)

Then, dummy variable vectors

24

X1 = (X11; X12; : : : ; X1I)T ; X2 = (X21; X22; : : : ; X2J)

T

and

Y = (Y1; Y2; : : : ; YK)T

are identi�ed with factors X1; X2 and response Y , respectively. From this,

the systematic component of the baseline-category logit model is assumed

as follows:

� = �+B(1)X1 +B(2)X2;

where

� =

266666664

�1

�2...

�K

377777775, B(1) =

266666664

�(1)11 �(1)12 ::: �(1)1I

�(1)21 �(1)22 ::: �(1)2I

::: ::: ::: :::

�(1)K1 �(1)K1 ::: �(1)KI

377777775,

and B(2) =

266666664

�(2)11 �(2)12 ::: �(2)1J

�(2)21 �(2)22 ::: �(2)2J

::: ::: ::: :::

�(2)K1 �(2)K1 ::: �(2)KJ

377777775.

Then, the logit model is described as

Pr(Y = yjx1;x2) =exp(yTB(1)x1+y

TB(2)x2+yT�)P

z exp(zTB(1)x1+zTB(2)x2+z

T�);

whereP

z implies the summation over all z. In this model, we have

25

ECD(X; Y ) =trB(1)Cov(X1;Y )+ trB(2)Cov(X2;Y )

trB(1)Cov(X1;Y )+ trB(2)Cov(X2;Y )+1.

In the following example, the ECD approach is demonstrated.

Example 2.4. The data for an investigation of factors in�uencing the

primary food choice of alligators (Table 2.5) are analyzed (Agresti, 2002;

pp. 268-271). In this example, explanatory variables are X1: lakes where

alligators live, {1. Hancock, 2. Oklawaha, 3. Tra¤ord, 4. George}; and

X2: sizes of alligators, {1. Small, 2. Large}; and the response variable is Y :

primary food choice of alligators, {1. Fish, 2. Invertebrate, 3. Reptile, 4.

Bird, 5. Other}. In this analysis, the generalized logit model described in

this section is used, and we set I = 4, J = 2, and K = 5 in (2.5). From this

model the following estimates of regression coe¢ cients are obtained (Agresti,

2002; pp. 268-271):

bB1 =

266666666664

�0:826 �0:006 �1:516 0

�2:485 0:931 �0:394 0

0:417 2:454 1:419 0

�0:131 �0:659 �0:429 0

0 0 0 0

377777777775and bB2 =

266666666664

�0:332 0

1:127 0

�0:683 0

�0:962 0

0 0

377777777775:

By using the above estimates, we have

tr bB1\Cov (X1; Y ) = 0:258 and tr bB2

\Cov (X2; Y ) = 0:107.

219�[mPP (Y jX1; X2) = 219� (0:258 + 0:107)

= 79: 935 (df = 16; P = 0:000).

26

From this, the e¤ect of X1 and X2 is signi�cant. From (13), we can

calculate the ECD as follows:

[ECD (X; Y ) =0:258 + 0:107

0:258 + 0:107 + 1= 0:267 (SE=0.042):

Although the e¤ects of factors are statistically signi�cant, the predictive

power of the logit model may be small, i.e. only 26:7% of the variation of

the response variable in entropy is explained by the explanatory variables.

The e¤ect of Lake on Food is about 2:4 times greater than that of Size.

Table 2.5. Alligator Food Choice Data

Primary Food Choice

Lake Size of Alligator Fish Invertebrate Reptile Bird Other

Hancock � 2:3m (S) 23 4 2 2 8

� 2:3m (L) 7 0 1 3 5

Oklawaha S 5 11 1 0 3

L 13 8 6 1 0

Tra¤ord S 5 11 2 1 5

L 8 7 6 3 5

George S 16 19 1 2 3

L 17 1 0 1 3

2.4 Conclusion

In the GLM framework, regression models are described with random, sys-

tematic and link components, and GLMs are widely applied in data analyses;

27

however the explanatory powers of GLMs have not been measured in practical

data analyses except the ordinary linear regression model. In this seminar,

GLMs are discussed from a view point of entropy, and ECD for measuring

explanatory or predictive power of GLMs has been introduced and the utility

of ECD has been shown for practical data analyses.

Acknowledgement 1 The author would like to thank Prof. Claudio Bor-

roni and all the members of Department of Quantitative Methods for Eco-

nomics and Business Sciences, the University of Milan (Universita degli Studi

di Milano) for giving me a valuable opportunity to make the present seminar.

References

[1] Agresti, A. (1986). Applying R2-type measures to ordered categorical

data, Technometrics; 28: 133-138.

[2] Agresti, A. (2002). Categorical Data Analysis, Second Edition, John

Wiley & Sons, Inc.: New York.

[3] Ash, A. & Shwarts, M. (1999). R2: A useful measure of model perfor-

mance with predicting a dichotomous outcome, Statistics in Medicine;

18: 375-384.

[4] Daniel, W. W. (1999). Biostatistics: A Foundation for Analysis in the

Health Sciences, Seventh Edition, John Wiley & Sons, Inc.: New York.

28

[5] Efron, B. (1978). Regression and ANOVA with zero-one data: measures

of residual variation, Journal of the American Statistical Association;

73: 113-121.

[6] Eshima, N. & Tabata, M. (2007). Entropy correlation coe¢ cient for

measuring predictive power of generalized linear models, Statistics and

Probability Letters; 77, 588-593.

[7] Eshima, N & Tabata, M. (2010). Entropy coe¢ cient of determination for

generalized linear models, Computational Statistics and Data Analysis,

54, 1381-1389, 2010.

[8] Eshima, N & Tabata, M. (2011). Three predictive power measures for

generalized linear models: Entropy coe¢ cient of determination, entropy

correlation coe¢ cient and regression correlation coe¢ cient, Computa-

tional Statistics and Data Analysis, 55, 3049-3058.

[9] Goodman, L. A. (1971). The analysis of multinomial contingency tables:

stepwise procedures and direct estimation methods for building models

for multiple classi�cations, Technometrics; 13: 33-61.

[10] Haberman, S. J. (1982). Analysis of dispersion of multinomial responses,

Journal of the American Statistical Association; 77: 568-580.

[11] Kent, J. T. (1983). Information gain and a general measure of correla-

tion, Biometrika, 70, 163-173.

29

[12] Korn, E. L. and Simon, R. (1991). Explained residual variation, ex-

plained risk and goodness of �t, American Statistician; 45: 201-206.

[13] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear models, 2nd

Ed. Chapman and Hall: London.

[14] Mittlebock, M. and Schemper, M. (1996). Explained variation for logistic

regression, Statistics in Medicine; 15: 1987-1997.

[15] Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear

model, Journal of the Royal Statistical Society A; 135: 370-384.

[16] Patnaik, P.B. (1949). The non-central �2 and F-distributions and their

applications, Biometrika, 36, 202-232.

[17] Theil, H. (1970). On the estimation of relationships involving qualitative

variables, American Journal of Sociology; 76: 103-154.

[18] Zheng, B. and Agresti, A. (2000). Summarizing the predictive power of

a generalized linear model, Statistics in Medicine 2000; 19: 1771-1781.

30