research article a unified definition of mutual

13
Research Article A Unified Definition of Mutual Information with Applications in Machine Learning Guoping Zeng Elevate, 4150 International Plaza, Fort Worth, TX 76109, USA Correspondence should be addressed to Guoping Zeng; [email protected] Received 21 December 2014; Revised 16 March 2015; Accepted 17 March 2015 Academic Editor: Zexuan Zhu Copyright © 2015 Guoping Zeng. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ere are various definitions of mutual information. Essentially, these definitions can be divided into two classes: (1) definitions with random variables and (2) definitions with ensembles. However, there are some mathematical flaws in these definitions. For instance, Class 1 definitions either neglect the probability spaces or assume the two random variables have the same probability space. Class 2 definitions redefine marginal probabilities from the joint probabilities. In fact, the marginal probabilities are given from the ensembles and should not be redefined from the joint probabilities. Both Class 1 and Class 2 definitions assume a joint distribution exists. Yet, they all ignore an important fact that the joint or the joint probability measure is not unique. In this paper, we first present a new unified definition of mutual information to cover all the various definitions and to fix their mathematical flaws. Our idea is to define the joint distribution of two random variables by taking the marginal probabilities into consideration. Next, we establish some properties of the newly defined mutual information. We then propose a method to calculate mutual information in machine learning. Finally, we apply our newly defined mutual information to credit scoring. 1. Introduction Mutual information has emerged in recent years as an impor- tant measure of statistical dependence. It has been used as a criterion for feature selection in engineering especially in machine learning (see [13] and references therein). Mutual information is a concept rooted in information theory. Its predecessor, called the rate of transmission, was first introduced by Shannon in 1948 in a classical paper [4] for the communication system. Shannon first introduced a concept called entropy for a single discrete chance variable. He then defined the joint entropy and conditional entropy for two discrete chance variables using the joint distribution. Finally, he defined the rate of transmission as the difference between the entropy and conditional entropy. While Shannon did not define a chance variable in his paper, it is understood to be a synonym of a random variable. Since Shannon’s pioneering work [4], there have been various definitions for mutual information. Essentially, these definitions can be divided into two classes: (1) definitions with random variables and (2) definitions with ensembles, that is, probability spaces in the mathematical literature. Class 1 definitions of mutual information depend on the joint distribution of two random variables. More specifically, Kullback ([5], 1959) defined entropy, conditional entropy, and joint entropy using compact mathematical formulas. Pinsker ([6], 1960 and 1964) treated the fundamental con- cepts of Shannon in a more advanced manner by employing probability theory. His definition of mutual information was more general in that he implicitly assumed the two random variables had different probability spaces. Ash ([7], 1965) explicitly assumed the two random variables had the same probability space and followed Shannon’s way to define mutual information. Cover and omas ([8], 2006) defined mutual information in a simple way by avoiding mentioning probability spaces. Class 2 definitions depend on the joint probability mea- sure of the joint sample space of two ensembles. Among such definitions, Fano ([9], 1961), Abramson ([10], 1963), and Gallager ([11], 1968) developed their definitions in a similar way. ey first defined the entropy of an ensemble, condition entropy, and the joint entropy of two ensembles. Next, they defined the mutual information of a joint event. Noting that the mutual information of a joint event is a random variable, Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2015, Article ID 201874, 12 pages http://dx.doi.org/10.1155/2015/201874

Upload: others

Post on 29-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article A Unified Definition of Mutual

Research ArticleA Unified Definition of Mutual Information withApplications in Machine Learning

Guoping Zeng

Elevate 4150 International Plaza Fort Worth TX 76109 USA

Correspondence should be addressed to Guoping Zeng guopingtxyahoocom

Received 21 December 2014 Revised 16 March 2015 Accepted 17 March 2015

Academic Editor Zexuan Zhu

Copyright copy 2015 Guoping ZengThis is an open access article distributed under theCreativeCommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

There are various definitions of mutual information Essentially these definitions can be divided into two classes (1) definitionswith random variables and (2) definitions with ensembles However there are some mathematical flaws in these definitions Forinstance Class 1 definitions either neglect the probability spaces or assume the two random variables have the same probabilityspace Class 2 definitions redefine marginal probabilities from the joint probabilities In fact the marginal probabilities are givenfrom the ensembles and should not be redefined from the joint probabilities Both Class 1 and Class 2 definitions assume a jointdistribution exists Yet they all ignore an important fact that the joint or the joint probabilitymeasure is not unique In this paper wefirst present a new unified definition of mutual information to cover all the various definitions and to fix their mathematical flawsOur idea is to define the joint distribution of two random variables by taking the marginal probabilities into consideration Nextwe establish some properties of the newly defined mutual information We then propose a method to calculate mutual informationin machine learning Finally we apply our newly defined mutual information to credit scoring

1 Introduction

Mutual information has emerged in recent years as an impor-tant measure of statistical dependence It has been used asa criterion for feature selection in engineering especially inmachine learning (see [1ndash3] and references therein)

Mutual information is a concept rooted in informationtheory Its predecessor called the rate of transmission wasfirst introduced by Shannon in 1948 in a classical paper [4]for the communication system Shannon first introduced aconcept called entropy for a single discrete chance variableHe then defined the joint entropy and conditional entropyfor two discrete chance variables using the joint distributionFinally he defined the rate of transmission as the differencebetween the entropy and conditional entropyWhile Shannondid not define a chance variable in his paper it is understoodto be a synonym of a random variable

Since Shannonrsquos pioneering work [4] there have beenvarious definitions for mutual information Essentially thesedefinitions can be divided into two classes (1) definitionswithrandom variables and (2) definitions with ensembles that isprobability spaces in the mathematical literature

Class 1 definitions of mutual information depend on thejoint distribution of two random variables More specificallyKullback ([5] 1959) defined entropy conditional entropyand joint entropy using compact mathematical formulasPinsker ([6] 1960 and 1964) treated the fundamental con-cepts of Shannon in a more advanced manner by employingprobability theory His definition of mutual informationwas more general in that he implicitly assumed the tworandom variables had different probability spaces Ash ([7]1965) explicitly assumed the two random variables had thesame probability space and followed Shannonrsquos way to definemutual information Cover and Thomas ([8] 2006) definedmutual information in a simple way by avoiding mentioningprobability spaces

Class 2 definitions depend on the joint probability mea-sure of the joint sample space of two ensembles Amongsuch definitions Fano ([9] 1961) Abramson ([10] 1963) andGallager ([11] 1968) developed their definitions in a similarwayThey first defined the entropy of an ensemble conditionentropy and the joint entropy of two ensembles Next theydefined the mutual information of a joint event Noting thatthe mutual information of a joint event is a random variable

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 201874 12 pageshttpdxdoiorg1011552015201874

2 Mathematical Problems in Engineering

they calculated the mean value of this random variable andcalled the result the mean information of two ensembles

However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Moreover except Pinskerrsquos definitionClass 2 definitions either neglect the probability spaces orassume the two random variables have the same probabilityspace Both Class 1 definitions and Class 2 definitions assumea joint distribution or a joint probability measure exists Yetthey all ignore an important fact that the joint distribution orthe joint probability measure is not unique

In this paper we first present a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distributionof two random variables by taking the marginal probabilitiesinto consideration With our new definition of mutual infor-mation different joint distributions will result in differentmutual information Next we establish some properties ofthe newly defined mutual information We then propose amethod to calculatemutual information inmachine learningFinally we apply our newly defined mutual information tocredit scoring

The rest of the paper is organized as follows In Section 2we briefly review the basic concepts in probability theoryIn Section 3 we examine various definitions of mutualinformation In Section 4 we first propose a new unifieddefinition for mutual information and then establish someproperties of the newly defined mutual information InSection 5 we first propose a method to calculate mutualinformation in machine learning We then apply the newlydefined mutual information to credit scoring The paper isconcluded in Section 6

Throughout the paper we will restrict our focus onmutual information for finite discrete random variables

2 Basic Concepts in Probability Theory

Let us review some basic concepts of probability theoryTheycan be found inmany books in probability theory such as [12]

Definition 1 A probability space is a triple (ΩF 119875) where

(1) Ω is a set called a sample space and elements ofΩ aredenoted by 120596 and are called outcomes

(2) F is an 120590-field consisting of all subsets of Ω andelements ofF are called events

(3) 119875 is called a probability measure and it is a mappingfromF to [0 1]with 119875(Ω) = 1 such that if119860

1 119860

2

are pairwise disjoint

119875(⋃119894

119860119894) = sum

119894

119875 (119860119894) (1)

Definition 2 A discrete probability space is a probabil-ity space (ΩF 119875) such that Ω is finite or countable

Ω = 1205961 120596

2 In this case F is chosen to be all the

subsets of Ω and the probability measure 119875 can be definedin terms of a series of nonnegative numbers 119901

1 119901

2 whose

sum is 1 If 119860 is any subset ofΩ then

119875 (119860) = sum120596119894isin119860

119901119894 (2)

In particular

119875 (120596119894) = 119901

119894 (3)

For simplicity we will write 119875(120596) as 119875(120596) From Defini-tion 2 we see that for a discrete probability space the prob-ability measure is characterized by the pointwise mapping 119901120596

1 120596

2 rarr [0 1] in (2)The probability of an event119860 is

computed simply by adding the probabilities of the individualpoints of 119860

Definition 3 A randomvariable 119883 on probability space(ΩF 119875) is a Borel measurable function fromΩ to (minusinfininfin)

such that for every Borel set 119861119883minus1(119861) = 119883 isin 119861 isin F Herewe use notation 119883 isin 119861 = 120596 isin Ω 119883(120596) isin 119861

Definition 4 If 119883 is a random variable then for every Borelsubset 119861 of 119877 we define a function by 120583

119883(119861) = 119875(119883 isin

119861) = 119875(119883minus1(119861)) Then 120583119883

is a probability measure on(RB(R) 120583

119883) and is called the probability distribution of119883

Definition 5 A random variable 119883 is discrete if its range isfinite or countable In particular any random variable on adiscrete probability space is discrete sinceΩ is countable

Definition 6 A (discrete) random variable 119883 on a discreteprobability space (ΩF 119875) is a Borel measurable functionfrom Ω to R where Ω = 120596

1 120596

2 and R is the set of

real numbers If the range of 119883 is 1199091 119909

2 then function

119891119883 119909

1 119909

2 rarr [0 1] defined by

119891119883(119909

119894) = 119875 (119883 = 119909

119894) (4)

is called the probability mass function of 119883 whereas proba-bilities 119875(119883 = 119909

1) 119875(119883 = 119909

2) are called the probability

distribution of119883

Note that in Definition 2

sum119894

119875 (119883 = 119909119894) = 1 (5)

Thus a discrete random variable may be characterized by itsprobability mass function

3 Various Definitions of Mutual Information

Since Shannonrsquos pioneering work [4] there have been variousdefinitions for mutual information Essentially these defi-nitions can be divided into two classes (1) definitions withrandom variables and (2) definitions with ensembles that isprobability spaces in the mathematical literature

Mathematical Problems in Engineering 3

31 Shannonrsquos Original Definition

Definition 7 Let 119909 be a chance variable with probabilities1199011 119901

2 119901

119899 whose sum is 1 Then

119867(119909) = minus

119899

sum119894=1

119901119894log119901

119894 (6)

is called the entropy of 119909Suppose two chance variables 119909 and 119910 have 119898 and 119899

possibilities respectively Let indices 119894 and 119895 range over all the119898 possibilities and all the 119899 possibilities respectively Let 119901(119894)be the probability of 119894 and 119901(119894 119895) the probability of the jointoccurrence of 119894 and 119895 Denote the conditional probability of 119894given 119895 by 119901(119894 | 119895) and conditional probability of 119895 given 119894 by119901(119895 | 119894)

Definition 8 The joint entropy of 119909 and 119910 is defined as

119867(119909 119910) = minussum119894119895

119901 (119894 119895) log119901 (119894 119895) (7)

Definition 9 The conditional entropy of 119910119867119909(119910) is defined

as

119867119909(119910) = minussum

119894119895

119901 (119894 119895) log119901 (119895 | 119894)

= minussum119894119895

119901 (119894 119895) log119901 (119894 119895)

sum119895119901 (119894 119895)

(8)

The conditional entropy of 119909119867119910(119909) can be defined similarly

Then the following relation holds

119867(119909 119910) = 119867 (119909) + 119867119909(119910) = 119867 (119910) + 119867

119910(119909)

119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867

119909(119910)

= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)

(9)

Definition 10 The rate of transmission of information 119877 isdefined as the difference between 119867(119909) and 119867

119910(119909) Then 119877

can be written in two other forms

119877 = 119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867

119909(119910)

= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)

(10)

Remark 11 Shannon did not derive the explicit formula for119877

119877 = sumij119901 (119894 119895) log

119901 (119894 119895)

119901 (119894) 119901 (119895) (11)

However he did imply it in Appendix 7 [4]

32 Class 1 Definitions

321 Kullbackrsquos Definition Kullback [5] redefined entropymore mathematically in a standalone homework question asfollows Consider two discrete random variables 119909 119910 where

119901119894119895= Prob (119909 = 119909

119894 119910 = 119910

119895) gt 0

119894 = 1 2 119898 119895 = 1 2 119899

119901119894sdot=

119899

sum119895=1

119901119894119895

119901sdot119895=

119898

sum119894=1

119901119894119895

msumi=1

nsumj=1

119901119894119895=

119898

sum119894=1

119901119894sdot=

119899

sum119895=1

119901sdot119895= 1

(12)

Define the joint entropy entropy and conditional entropy asfollows

119867(119909 119910) = minussum119894

sum119895

119901119894119895log119901

119894119895

119867 (119909) = minussum119894

119901119894sdotlog119901

119894sdot

119867 (119910) = minussum119895

119901sdot119895log119901

sdot119895

119867 (119910 | 119909119894) = minussum

119895

119901119894119895

119901119894sdot

log119901119894119895

119901119894sdot

119867 (119910 | 119909) = sum119894

119901119894119867(119910 | 119909

119894) = minussum

119894

sum119895

119901119894119895log

119901119894119895

119901119894sdot

(13)

Then119867(119909 119910) = 119867(119909) +119867(119910 | 119909) le 119867(119909) +119867(119910) and119867(119910) ge

119867(119910 | 119909)

322 Information Conveyed Ash [7] beganwith two randomvariables 119883 and 119884 and assumed 119883 and 119884 had the sameprobability space He systematically defined the entropyconditional entropy and joint entropy following Shannonrsquospath in [4] At the end he denoted 119867(119883) minus 119867(119883 | 119884) by119868(119883 | 119884) and called it the information conveyed about 119883 by119884

323 Information of One Variable with respect to the OtherPinsker [6] treated the fundamental concepts of Shannon ina more advanced manner by employing probability theorySuppose 120585 is a random variable defined on a probability space(Ω 119878

120596 119875

120585) and is taking values in a measurable space (119883 119878

119909)

and 120578 is a random variable defined on a probability space(Ψ 119878

120595 119875

120578) and is taking values in a measurable space (119884 119878

119910)

Then the pair 120585 120578 of random variables may be regarded as asingle random variable (120585 120578)with values in the product space119883 times 119884 of all pairs (119909 119910) with 119909 isin 119883 119910 isin 119884 The distribution119875(120585120578)

(sdot) = 119875120585120578(sdot) of (120585 120578) is called the joint distribution of

4 Mathematical Problems in Engineering

random variables 120585 and 120578 By the product of the distributions119875120585(sdot) and 119875

120578(sdot) denoted by 119875

120585times120578(sdot) we mean the distribution

defined on 119878119909times 119878

119910

119875120585times120578

(119864 times 119865) = 119875120585(119864) times 119875

120578(119865) (14)

for 119864 isin 119878119909and 119865 isin 119878

119910 If the joint distribution 119875

120585120578(sdot)

coincides with the product distribution 119875120585times120578

(sdot) the randomvariables 120585 and 120578 are said to be independent If 120585 and 120578 arediscrete random variables say 119883 and 119884 contain countablymany points 119909

1 119909

2 and 119910

1 119910

2 then

119868 (120585 120578) = sum119894119895

119875120585times120578

(119909119894 119910

119895) log

119875120585120578(119909

119894 119910

119895)

119875120585(119909

119894) 119875

120578(119910

119895) (15)

119868 is called the information of 120585 and 120578with respect to the other

324 A Modern Definition in Information Theory Of thevarious definitions of mutual information the most widelyaccepted of recent years is the one by Cover andThomas [8]

Let 119883 be a discrete variable with alphabet 120594 and prob-ability mass function 119901(119909) = Pr119883 = 119909 119909 isin 120594 Let 119884

be a discrete variable with alphabet Υ and probability massfunction 119901(119910) = Pr119884 = 119910 119910 isin Υ Suppose 119883 and 119884 havea joint mass function (joint distribution) 119901(119909 119910) Then themutual information 119868(119883 119884) can be defined as

119868 (119883 119884) = sum119909isin120594

sum119910isinΥ

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (16)

33 Class 2 Definitions In Class 2 definitions random vari-ables are replaced by ensembles and mutual information isso-called average mutual information Gallager [11] adopteda more general and more rigorous approach to introducethe concept of mutual information in communication theoryIndeed he combined and compiled the results from Fano [9]and Abramson [10]

Suppose that discrete ensemble 119883 has a sample space1198861 119886

2 119886

119870 and discrete ensemble 119884 has a sample space

1198871 1198872 119887

119871 Consider the joint sample space (119886

119896 119887119895) 1 le

119896 le 119870 1 le 119895 le 119869 A probability measure on the joint samplespace is given by the join probability 119875

119883119884(119886119896 119887119895) defined for

1 le 119896 le 119870 1 le 119895 le 119869 The combination of a joint samplespace and probability measure for outcomes 119909 and 119910 is calleda joint 119883119884ensemble Then the marginal probabilities can befound as

119875119883(119886

119896) =

119869

sum119895=1

119875119883119884

(119886119896 119887119895) 119896 = 1 2 119870 (17)

In more abbreviated notation this is written as

119875 (119909) = sum119910

119875 (119909 119910) (18)

Likewise

119875119884(119887119895) =

119870

sum119896=1

119875119883119884

(119886119896 119887119895) 119895 = 1 2 119869 (19)

In more abbreviated notation this is written as

119875 (119910) = sum119909

119875 (119909 119910) (20)

If 119875119883(119886119896) gt 0 the conditional probability that outcome 119910 is

119887119895 given that outcome of 119909 is 119886

119896 is defined as

119875119884|119883

(119887119895| 119886

119896) =

119875119883119884

(119886119896 119887119895)

119875119883(119886

119896)

(21)

The mutual information between events 119909 = 119886119896and 119910 = 119887

119895is

defined as

119868119883119884

(119886119896 119887119895) = log

119875119883|119884

(119886119896| 119887

119895)

119875119883(119886

119896)

= log119875119883119884

(119886119896 119887119895)

119875119883(119886

119896) 119875

119884(119887119895)

= log119875119883|119884

(119886119896| 119887

119895)

119875119883(119886

119896)

= 119868119884119883

(119887119895 119886

119896)

(22)

Since the mutual information defined above is a randomvariable on the 119883119884 joint ensemble the mean value which iscalled the average mutual information denoted by 119868(119883 119884) isgiven by

119868 (119883 119884) =

119870

sum119896=1

119871

sum119895=1

119875119883119884

(119886119896 119887119895) log

119875119883119884

(119886119896 119887119895)

119875119883(119886

119896) 119875

119884(119887119895) (23)

Remark 12 By means of an information channel consistingof a transmitter of alphabet 119860 with elements 119886

119894and total

elements 119905 and a receiver of alphabet 119861 with elements 119887119895

and total elements 119903 Abramson [10] denoted 119867(119860) minus 119867(119860 |

119861) = sum119860119861

119875(119886 119887) log(119875(119886 119887)119875(119886)119875(119887)) by 119868(119860 119861) and calledit mutual information of 119860 and 119861

The mutual information 119868(119883 119884) between 2 continuousrandomvariables119883 and119884 [8] (also called rate of transmissionin [1]) is defined as

119868 (119883 119884) = ∬119875 (119909 119910) log119875 (119909 119910)

119875 (119909) 119875 (119910)119889119909 119889119910 (24)

where 119875(119909 119910) is the joint probability density function of119883 and 119884 and 119875(119909) and 119875(119910) are the marginal densityfunctions associated with 119883 and 119884 respectively The mutualinformation between 2 continuous random variables is alsocalled the differential mutual information

However the differential mutual information ismuch lesspopular than its discrete counterpart On the one hand thejoint density function involved is unknown inmost cases andhence must be estimated [13 14] On the other hand datain engineering and machine learning are mostly finite andso mutual information between discrete random variables isused

4 A New Unified Definition ofMutual Information

In Section 3 we reviewed various definitions of mutual infor-mation Shannonrsquos original definition laid the foundation

Mathematical Problems in Engineering 5

of information theory Kullbackrsquos definition used randomvariables for the first time and was more mathematical andmore compact Although Ashrsquos definition followed Shannonrsquospath it was more systematic Pinskerrsquos definition was mostmathematical in that it employed probability theory Gal-lagerrsquos definition was more general and more rigorous incommunication theory Cover and Thomasrsquos definition is sosuccinct that it is now a standard definition in informationtheory

However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Except Pinskerrsquos definition Class 2definitions either neglect the probability spaces or assumethe two random variables have the same probability spaceBoth Class 1 definitions and Class 2 definitions assume a jointdistribution or a joint probability measure exists Yet they allignore an important fact that the joint distribution or the jointprobability measure is not unique

41 Unified Definition of Mutual Information Let 119883 be afinite discrete random variable on discrete probability space(Ω

1F

1 119875

1) with Ω

1= 120596

1 120596

2 120596

119899 and range 119909

1 119909

2

119909119870 with 119870 le 119899 Let 119884 be a discrete random variable on

probability space (Ω2F

2 119875

2) with Ω

2= 120588

1 120588

2 120588

119898 and

range 1199101 119910

2 119910

119871 with 119871 le 119898

If119883 and119884have the sameprobability space (ΩF 119875) thenthe joint distribution is simply

119875119883119884

(119883 = 119909 119884 = 119910) = 119875 (120596 isin Ω 119883 (120596) = 119909 119884 (120596) = 119910)

(25)

However when 119883 and 119884 have different probability spacesand so different probability measures the joint distributionis more complicated

Definition 13 The joint sample space of random variables 119883and 119884 is defined as the product Ω

1times Ω

2of all pairs (120596

119894 120588

119895)

119894 = 1 2 119899 and 119895 = 1 2 119898The joint 120590-fieldF1timesF

2is

defined as the product of all pairs (1198601 119860

2) where119860

1and119860

2

are elements of F1and F

2 respectively A joint probability

measure 119875119883119884

of 1198751and 119875

2is a probability measure on F

1times

F2 119875

119883119884(119860 times 119861) such that for any 119860 sube Ω

1and 119861 sube Ω

2

1198751(119860) = 119875

119883119884(119860 times Ω

2) =

119898

sum119895=1

119875119883119884

(119860 times 120588119895)

1198752(119861) = 119875

119883119884(Ω

1times 119861) =

119898

sum119894=1

119875119883119884

(120596119894 times 119861)

(26)

(Ω1timesΩ

2 F

1timesF

2 119875

119883119884) is called the joint probability space

of 119883 and 119884 and 119875119883119884

(119883 = 119909119894 times 119884 = 119910

119895) for 119894 = 1 2 119899

and 119895 = 1 2 119898 the joint distribution of119883 and 119884

Combining Definitions 2 and 13 we immediately obtainthe following results

Proposition 14 A sequence of nonnegative numbers 119901119894119895 1 le

119894 le 119870 1 le 119895 le 119871 whose sum is 1 can serve as a probabilitymeasure on F

1times F

2 119875

119883119884(120596

119894 120588

119895) = 119901

119894119895 The probability of

any event 119860 times 119861 sube Ω1times Ω

2is computed simply by adding the

probabilities of the individual points of (120596 120588) isin 119860 times 119861 If inaddition for 119894 = 1 2 119870 and 119895 = 1 2 119871 the followinghold

119871

sum119895=1

119901119894119895= 119875

119883(120596

119894)

119870

sum119894=1

119901119894119895= 119875

119884(120588

119895)

(27)

then 119875119883119884

(120596119894 120588

119895) = 119901

119894119895is a joint distribution of119883 and 119884

For convenience from now on we will shorten 119875119883119884

(119883 =

119909119894 times 119884 = 119910

119895) as 119875

119883119884(119909

119894 119910

119895)

This two-dimensional measure should not be confusedwith one-dimensional joint distribution when 119883 and 119884 havethe same probability space

Remark 15 If (Ω1F

1 119875

1) = (Ω

2F

2 119875

2) instead of using

the two-dimensional measure 119875119883119884

(119883 = 119909119894 times 119884 =

119910119895) we may use the one-dimensional measure 119875

1(119883 =

119909119894 and 119884 = 119910

119895) Then (26) always hold In this sense our

new definition of joint distribution reduces to the definitionof joint distribution with the same probability space

Definition 16 The conditional probability 119884 = 119910119895 given 119883 =

119909119894 is defined as

119875119884|119883

(119884 = 119910119895| 119883 = 119909

119894) =

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) (28)

Theorem 17 For any two discrete random variables thereis at least one joint probability measure called the productprobability measure or simply product distribution

Proof Let random variables 119883 and 119884 be defined as beforeDefine a function fromΩ

1times Ω

2to [0 1] as follows

119875119883119884

(120596119894 120588

119895) = 119875

1(120596

119894) 119875

2(120588

119895) (29)

Then

119899

sum119894=1

119898

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119899

sum119894=1

119898

sum119895=1

1198751(120596

119894) 119875

2(120588

119895)

=

119899

sum119894=1

1198751(120596

119894)

119898

sum119895=1

1198752(120588

119895) = 1

(30)

Hence 119875119883119884

can serve as a probability measure on F1times

F2by Definition 2 The probability of any event 119860 times 119861 sube

Ω1times Ω

2is computed simply by adding the probabilities of

6 Mathematical Problems in Engineering

the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596

1198941 120596

1198942 120596

119894119904 isin Ω

1of 119904 elements

119875119883119884

(119860 times Ω2) =

119898

sum119895=1

119875119883119884

(119860 times 120588119895)

=

119898

sum119895=1

119904

sum119906=1

119875119883119884

(120596119894119906 times 120588

119895)

=

119898

sum119895=1

119904

sum119906=1

1198751(120596

119894119906) 119875

2(120588

119895)

=

119898

sum119895=1

1198752(120588

119895)

119904

sum119906=1

1198751(120596

119894119906)

=

119904

sum119906=1

1198751(120596

119894119906) = 119875

1(119860)

(31)

Similarly 119875119883119884

(Ω1times 119861) = 119875

2(119861) for any 119861 isin Ω

2 Hence

119875119883119884

(119883 = 119909119894 times 119884 = 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) is a

joint probability measure of119883 and 119884 by Definition 13

Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875

119883119884(sdot) if 119875

119883119884(sdot) coincides

with the product distribution 119875119883times119884

(sdot)

Definition 19 The joint entropy119867(119883 119884) is defined as

119867(119883 119884) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119883119884(119909

119894 119910

119895) (32)

Definition 20 Theconditional entropy119867(119884 | 119883) is as follows

119867(119884 | 119883) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119884 = 119910

119895| 119883 = 119909

119894)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894)

(33)

Definition 21 The mutual information 119868(119883 119884) between 119883

and 119884 is defined as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

(34)

As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0

It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ

1= 119886

1 119886

2 119886

119870 andΩ

2= 119887

1 1198872 119887

119871

Define random variables 119883 Ω1rarr R and 119884 Ω

2rarr R as

one-to-one mappings as

119883(119886119894) = 119909

119894 119894 = 1 2 119870

119884 (119887119895) = 119910

119895 119895 = 1 2 119871

(35)

Then we have

119875119883119884

(119909119894 119910

119895) = 119875

119883119884(119886

119894 119887119895) (36)

It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example

Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions

1198751(119884 = 0) =

1

3 119875

1(119884 = 1) =

2

3

1198752(119883 = 1) =

1

3 119875

2(119883 = 2) =

1

3 119875

2(119883 = 3) =

1

3

(37)

We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information

(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13

(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0

(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13

(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =

29 119875(3 0) = 19 119875(3 1) = 29

42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]

Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =

1199011 119901

2 119901

119899 and 119876 = 119902

1 119902

2 119902

119899 is defined as

119863 (119875 119876) = sum119894

119901119894log

119901119894

119902119894

(38)

Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902

119894for all 119894

Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions

Mathematical Problems in Engineering 7

The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875

119883119884and the product

distribution 119875119883119875119884

Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875

119883119884and the product distribution 119875

11198752

Proof Using a mapping from 2-dimensional indices to one-dimensional index

(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899

for 119894 = 1 119870 119895 = 1 2 119871(39)

and using another mapping from one-dimensional indexback to two-dimensional indices

119894 = lceil119899

119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871

for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871

(40)

we rewrite 119868(119883 119884) as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(41)

Since119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) = 1

119870119871

sum119899=1

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

=

119870

sum119894=1

119871

sum119895=1

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895) = 1

(42)

we obtain

119868 (119883 119884) =

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(43)

Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies

119868 (119883 119884) ge 0 (44)

with equality if and only if119883 and 119884 are independent

Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875

119883119884(119909

lceil119899119871rceil 119910

119899minus(lceil119899119871rceilminus1)lowast119871) = 119875

1(119883 = 119909

lceil119899119871rceil)119875

2(119884 =

119910119899minus(lceil119899119871rceilminus1)lowast119871

) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +

sdot sdot sdot + 119870 that is 119875119883119884

(119909119894 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) for 119894 =

1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent

Corollary 26 If119883 is a constant random variable that is119870 =

1 then for any random variable 119884

119868 (119883 119884) = 0 (45)

Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875

1(119883 = 119909) = 119875

1(120596) = 1

For any 119895 = 1 2 119871

119875119883119884

(119909 119910119895) =

1

sum119894=1

119875119883119884

(119909 119910119895) = 119875

2(119884 = 119910

119895)

= 1198751(119883 = 119909) 119875

2(119884 = 119910

119895)

(46)

Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0

Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then

0 le 119867 (119883) le log119870 (47)

with equality if and only if the119870 values are equally probable

Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold

119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)

= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)

Proof Consider

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119884|119883

(119909119894 119910

119895)

1198752(119884 = 119910

119895)

8 Mathematical Problems in Engineering

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

+

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

119871

sum119894=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

1198752(119884 = 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= 119867 (119884) minus 119867 (119884 | 119883)

(49)

Combining the above properties and noting that 119867(119883 | 119884)

and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties

Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then

0 le 119868 (119883 119884) le 119867 (119884) le log 119871

0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)

Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent

5 Newly Defined Mutual Information inMachine Learning

Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883

1 119883

2 119883

119872 and their associated output values

119910 from an output space 119884 (ie dependent variable) Here1198831 119883

2 119883

119872are called features that is independent vari-

ables Hence 119878 can be expressed as

119878 = 1199091198941 119909

1198942 119909

119894119872 119910

119894 119894 = 1 2 119873 (51)

where feature 119883119895has values 119909

1119895 119909

2119895 119909

119873119895for 119895 = 1 2

119872

A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883

1 119883

2 119883

119872 Rather it is decided by only a subset of

them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows

(1) Evaluate the relationship between each individualinput feature 119909

119894and the output 119884

(2) Select the best set of attributes according to somecriterion

51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596

1 120596

2 120596

119870 If 119883 has missing values we will use 120596

1

to represent all the missing values Assume 119884 has 119871 differentvalues 120588

1 120588

2 120588

119871

Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874

119894119895be the frequency (could be 0) of (120596

119894 120588

119895) for

119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899

119894sdotand 119899

sdot119895 respectively Then

119899119894sdot= sum

119895

119874119894119895

119899sdot119895= sum

119894

119874119894119895

119873 = sum119894

sum119895

119874119894119895= sum

119894

119899119894sdot= sum

119895

119899sdot119895

(52)

Let us denote the relative frequency119874119894119895119873 by 119901

119894119895We have the

two-way relative frequency table see Table 2Since

119870

sum119894=1

119871

sum119895=1

119901119894119895=

119870

sum119894=1

119901119894sdot=

119871

sum119895=1

119901sdot119895= 1 (53)

119901119894sdot119870119894=1

119901sdot119895119871119895=1

and 119901119894119895119870119894=1

can each serve as a probabilitymeasure

Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables

119883 (Ω1F

1 119875

119883) 997888rarr 119877 (54)

as 119883(120596119894) = 119909

119894 where Ω

1= 120596

1 120596

2 120596

119870 and 119875

119883(120596

119894) =

119899119894sdot119873 = 119901

119894sdotfor 119894 = 1 2 119870 Note that 119909

1 119909

2 119909

119870

could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909

119894) = 119875

119883(120596

119894)

Mathematical Problems in Engineering 9

Table 1 Frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11987411

11987412

sdot sdot sdot 1198741119895

sdot sdot sdot 1198741119871

1198991∙

1205962

11987421

11987422

sdot sdot sdot 1198742119895

sdot sdot sdot 1198742119871

1198992∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1198741198941

1198741198942

sdot sdot sdot 119874119894119895

sdot sdot sdot 119874119894119871

119899119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1198741198701

1198741198702

sdot sdot sdot 119874119870119895

sdot sdot sdot 119874119870119871

119899119870∙

Total 119899∙1

119899∙2

sdot sdot sdot 119899∙119895

sdot sdot sdot 119899∙119871

119873

Table 2 Relative frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11990111

11990112

sdot sdot sdot 1199011119895

sdot sdot sdot 1199011119871

1199011∙

1205962

11990121

11990122

sdot sdot sdot 1199012119895

sdot sdot sdot 1199012119871

1199012∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1199011198941

1199011198942

sdot sdot sdot 119901119894119895

sdot sdot sdot 119901119894119871

119901119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1199011198701

1199011198702

sdot sdot sdot 119901119870119895

sdot sdot sdot 119901119870119871

119901119870∙

Total 119901∙1

119901∙2

sdot sdot sdot 119901∙119895

sdot sdot sdot 119901∙119871

1

Similarly

119884 (Ω2F

2 119875

2) 997888rarr 119877 (55)

as 119884(120588119895) = 119910

119895 where Ω

2= 120588

1 120588

2 120588

119871 and 119875

119884(120588

119895) =

119899sdot119895119873 = 119901

sdot119895for 119895 = 1 2 119871 Also 119910

1 119910

2 119910

119870could be

any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875

119884(119884 =

119910119895) = 119875

119884(120588

119895)

Now define a mapping 119875119883119884

fromΩ1timesΩ

2to 119877 as follows

119875119883119884

(120596119894 120588

119895) = 119901

119894119895=

119874119894119895

119873 (56)

Since119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) = 1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119871

sum119895=1

119901119894119895= 119901

119894sdot= 119875

119883(120596

119894)

119870

sum119894=1

119875119883119884

(120596119894 120588

119895) =

119870

sum119894=1

119901119894119895= 119901

sdot119895= 119875

119884(120588

119895)

(57)

119901119894119895119870119894=1

is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) log

119875119883119884

(120596119894 120588

119895)

119875119883(120596

119894) 119875

119884(120588

119895)

=

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

(58)

It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value

Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)

Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that

(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871

(2) 119901119894119895

= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and

119895 = 1 2 119871

(3) for each 119895 = 1 2 119871 since1198741119895+119874

2119895+sdot sdot sdot+119874

119870119895= 119899

sdot119895

there are 119899sdot119895nonzero 119874

119894119895rsquos or equivalently 119899

sdot119895nonzero

119901119894119895rsquos

(4) 119901119894sdot= 1119873 119894 = 1 2 119870

Using the above observations and the fact that 0 log 0 = 0 wehave

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

=

119870

sum119894=1

1199011198941log

1199011198941

119901119894sdot119901sdot1

+

119870

sum119894=1

1199011198942log

1199011198942

119901119894sdot119901sdot2

+ sdot sdot sdot +

119870

sum119894=1

119901119894119871log

119901119894119871

119901119894sdot119901sdot119871

10 Mathematical Problems in Engineering

= sum1199011198941 =0

1

119873log 1119873

119901sdot1119873

+ sum1199011198942 =0

1

119873log 1119873

119901sdot2119873

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1119873

119901sdot119871119873

= sum1199011198941 =0

1

119873log 1

119901sdot1

+ sum1199011198942 =0

1

119873log 1

119901sdot2

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1

119901sdot119871

=119899sdot1

119873log 1

119901sdot1

+119899sdot2

119873log 1

119901sdot2

+ sdot sdot sdot +119899sdot119871

119873log 1

119901sdot119871

= 119901sdot1log 1

119901sdot1

+ 119901sdot2log 1

119901sdot2

+ sdot sdot sdot + 119901sdot119871log 1

119901sdot119871

= 119867 (119884)

(59)

52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1

To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways

521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs

522 Relative Values From Property 4 we have 0 le

119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs

523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]

Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then

119901 ln119901

119902asymp (119901 minus 119902) +

(119901 minus 119902)2

2119902 (60)

The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is

Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain

2119868 (119883 119884) = 2119873

119870

sum119894=1

119871

sum119895=1

119901119894119895ln

119901119894119895

119901119894sdot119901sdot119895

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895119873

(119899119894sdot119873) (119899

sdot119895119873)

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895

119899119894sdot119899sdot119895119873

asymp 2

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus

119899119894sdot119899sdot119895

119873)

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895minus 2

sum119894119899119894sdot

119873

sum119895119899sdot119895

119873

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2119873 minus 2119873

119873+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

=

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 1205942

(61)

The last equation means the previous expressionsum119870

119894=1sum119871

119895=1((119874

119894119895minus 119899

119894sdot119899sdot119895119873)

2(119899

119894sdot119899sdot119895119873)) follows 1205942 distribu-

tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)

approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)

The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no

relationship between them)1198671119883 and119884 are dependent (ie there is a relationship

between them)

Mathematical Problems in Engineering 11

The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

asymp 2119873 times 119868 (119883 119884) (62)

is greater than 1205942119880 the upper-tail critical value from a Chi-

square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is

Select feature 119883 if 119868 (119883 119884) gt1205942119880

2119873 (63)

Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942

119880to be 169 from a Chi-

square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873

Assume a training set has119873 examples We can do featureselection by the following procedure

(i) Step 1 Choose a level of significance of 120572 say 005

(ii) Step 2 Find 119870 the number of values of feature119883

(iii) Step 3 Build the contingency table for119883 and 119884

(iv) Step 4 Calculate 119868(119883 119884) from the contingency table

(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom

from a Chi-square table or any other sources such asSAS

(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise

(vii) Step 7 Repeat Steps 2ndash6 for all features

If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure

53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1

Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process

(i) Step 1 select features by removing those with smallmutual information

(ii) Step 2 do binning for the rest of numerical features

(iii) Step 3 select features by mutual information

54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning

Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships

The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value

6 Conclusions

In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article A Unified Definition of Mutual

2 Mathematical Problems in Engineering

they calculated the mean value of this random variable andcalled the result the mean information of two ensembles

However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Moreover except Pinskerrsquos definitionClass 2 definitions either neglect the probability spaces orassume the two random variables have the same probabilityspace Both Class 1 definitions and Class 2 definitions assumea joint distribution or a joint probability measure exists Yetthey all ignore an important fact that the joint distribution orthe joint probability measure is not unique

In this paper we first present a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distributionof two random variables by taking the marginal probabilitiesinto consideration With our new definition of mutual infor-mation different joint distributions will result in differentmutual information Next we establish some properties ofthe newly defined mutual information We then propose amethod to calculatemutual information inmachine learningFinally we apply our newly defined mutual information tocredit scoring

The rest of the paper is organized as follows In Section 2we briefly review the basic concepts in probability theoryIn Section 3 we examine various definitions of mutualinformation In Section 4 we first propose a new unifieddefinition for mutual information and then establish someproperties of the newly defined mutual information InSection 5 we first propose a method to calculate mutualinformation in machine learning We then apply the newlydefined mutual information to credit scoring The paper isconcluded in Section 6

Throughout the paper we will restrict our focus onmutual information for finite discrete random variables

2 Basic Concepts in Probability Theory

Let us review some basic concepts of probability theoryTheycan be found inmany books in probability theory such as [12]

Definition 1 A probability space is a triple (ΩF 119875) where

(1) Ω is a set called a sample space and elements ofΩ aredenoted by 120596 and are called outcomes

(2) F is an 120590-field consisting of all subsets of Ω andelements ofF are called events

(3) 119875 is called a probability measure and it is a mappingfromF to [0 1]with 119875(Ω) = 1 such that if119860

1 119860

2

are pairwise disjoint

119875(⋃119894

119860119894) = sum

119894

119875 (119860119894) (1)

Definition 2 A discrete probability space is a probabil-ity space (ΩF 119875) such that Ω is finite or countable

Ω = 1205961 120596

2 In this case F is chosen to be all the

subsets of Ω and the probability measure 119875 can be definedin terms of a series of nonnegative numbers 119901

1 119901

2 whose

sum is 1 If 119860 is any subset ofΩ then

119875 (119860) = sum120596119894isin119860

119901119894 (2)

In particular

119875 (120596119894) = 119901

119894 (3)

For simplicity we will write 119875(120596) as 119875(120596) From Defini-tion 2 we see that for a discrete probability space the prob-ability measure is characterized by the pointwise mapping 119901120596

1 120596

2 rarr [0 1] in (2)The probability of an event119860 is

computed simply by adding the probabilities of the individualpoints of 119860

Definition 3 A randomvariable 119883 on probability space(ΩF 119875) is a Borel measurable function fromΩ to (minusinfininfin)

such that for every Borel set 119861119883minus1(119861) = 119883 isin 119861 isin F Herewe use notation 119883 isin 119861 = 120596 isin Ω 119883(120596) isin 119861

Definition 4 If 119883 is a random variable then for every Borelsubset 119861 of 119877 we define a function by 120583

119883(119861) = 119875(119883 isin

119861) = 119875(119883minus1(119861)) Then 120583119883

is a probability measure on(RB(R) 120583

119883) and is called the probability distribution of119883

Definition 5 A random variable 119883 is discrete if its range isfinite or countable In particular any random variable on adiscrete probability space is discrete sinceΩ is countable

Definition 6 A (discrete) random variable 119883 on a discreteprobability space (ΩF 119875) is a Borel measurable functionfrom Ω to R where Ω = 120596

1 120596

2 and R is the set of

real numbers If the range of 119883 is 1199091 119909

2 then function

119891119883 119909

1 119909

2 rarr [0 1] defined by

119891119883(119909

119894) = 119875 (119883 = 119909

119894) (4)

is called the probability mass function of 119883 whereas proba-bilities 119875(119883 = 119909

1) 119875(119883 = 119909

2) are called the probability

distribution of119883

Note that in Definition 2

sum119894

119875 (119883 = 119909119894) = 1 (5)

Thus a discrete random variable may be characterized by itsprobability mass function

3 Various Definitions of Mutual Information

Since Shannonrsquos pioneering work [4] there have been variousdefinitions for mutual information Essentially these defi-nitions can be divided into two classes (1) definitions withrandom variables and (2) definitions with ensembles that isprobability spaces in the mathematical literature

Mathematical Problems in Engineering 3

31 Shannonrsquos Original Definition

Definition 7 Let 119909 be a chance variable with probabilities1199011 119901

2 119901

119899 whose sum is 1 Then

119867(119909) = minus

119899

sum119894=1

119901119894log119901

119894 (6)

is called the entropy of 119909Suppose two chance variables 119909 and 119910 have 119898 and 119899

possibilities respectively Let indices 119894 and 119895 range over all the119898 possibilities and all the 119899 possibilities respectively Let 119901(119894)be the probability of 119894 and 119901(119894 119895) the probability of the jointoccurrence of 119894 and 119895 Denote the conditional probability of 119894given 119895 by 119901(119894 | 119895) and conditional probability of 119895 given 119894 by119901(119895 | 119894)

Definition 8 The joint entropy of 119909 and 119910 is defined as

119867(119909 119910) = minussum119894119895

119901 (119894 119895) log119901 (119894 119895) (7)

Definition 9 The conditional entropy of 119910119867119909(119910) is defined

as

119867119909(119910) = minussum

119894119895

119901 (119894 119895) log119901 (119895 | 119894)

= minussum119894119895

119901 (119894 119895) log119901 (119894 119895)

sum119895119901 (119894 119895)

(8)

The conditional entropy of 119909119867119910(119909) can be defined similarly

Then the following relation holds

119867(119909 119910) = 119867 (119909) + 119867119909(119910) = 119867 (119910) + 119867

119910(119909)

119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867

119909(119910)

= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)

(9)

Definition 10 The rate of transmission of information 119877 isdefined as the difference between 119867(119909) and 119867

119910(119909) Then 119877

can be written in two other forms

119877 = 119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867

119909(119910)

= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)

(10)

Remark 11 Shannon did not derive the explicit formula for119877

119877 = sumij119901 (119894 119895) log

119901 (119894 119895)

119901 (119894) 119901 (119895) (11)

However he did imply it in Appendix 7 [4]

32 Class 1 Definitions

321 Kullbackrsquos Definition Kullback [5] redefined entropymore mathematically in a standalone homework question asfollows Consider two discrete random variables 119909 119910 where

119901119894119895= Prob (119909 = 119909

119894 119910 = 119910

119895) gt 0

119894 = 1 2 119898 119895 = 1 2 119899

119901119894sdot=

119899

sum119895=1

119901119894119895

119901sdot119895=

119898

sum119894=1

119901119894119895

msumi=1

nsumj=1

119901119894119895=

119898

sum119894=1

119901119894sdot=

119899

sum119895=1

119901sdot119895= 1

(12)

Define the joint entropy entropy and conditional entropy asfollows

119867(119909 119910) = minussum119894

sum119895

119901119894119895log119901

119894119895

119867 (119909) = minussum119894

119901119894sdotlog119901

119894sdot

119867 (119910) = minussum119895

119901sdot119895log119901

sdot119895

119867 (119910 | 119909119894) = minussum

119895

119901119894119895

119901119894sdot

log119901119894119895

119901119894sdot

119867 (119910 | 119909) = sum119894

119901119894119867(119910 | 119909

119894) = minussum

119894

sum119895

119901119894119895log

119901119894119895

119901119894sdot

(13)

Then119867(119909 119910) = 119867(119909) +119867(119910 | 119909) le 119867(119909) +119867(119910) and119867(119910) ge

119867(119910 | 119909)

322 Information Conveyed Ash [7] beganwith two randomvariables 119883 and 119884 and assumed 119883 and 119884 had the sameprobability space He systematically defined the entropyconditional entropy and joint entropy following Shannonrsquospath in [4] At the end he denoted 119867(119883) minus 119867(119883 | 119884) by119868(119883 | 119884) and called it the information conveyed about 119883 by119884

323 Information of One Variable with respect to the OtherPinsker [6] treated the fundamental concepts of Shannon ina more advanced manner by employing probability theorySuppose 120585 is a random variable defined on a probability space(Ω 119878

120596 119875

120585) and is taking values in a measurable space (119883 119878

119909)

and 120578 is a random variable defined on a probability space(Ψ 119878

120595 119875

120578) and is taking values in a measurable space (119884 119878

119910)

Then the pair 120585 120578 of random variables may be regarded as asingle random variable (120585 120578)with values in the product space119883 times 119884 of all pairs (119909 119910) with 119909 isin 119883 119910 isin 119884 The distribution119875(120585120578)

(sdot) = 119875120585120578(sdot) of (120585 120578) is called the joint distribution of

4 Mathematical Problems in Engineering

random variables 120585 and 120578 By the product of the distributions119875120585(sdot) and 119875

120578(sdot) denoted by 119875

120585times120578(sdot) we mean the distribution

defined on 119878119909times 119878

119910

119875120585times120578

(119864 times 119865) = 119875120585(119864) times 119875

120578(119865) (14)

for 119864 isin 119878119909and 119865 isin 119878

119910 If the joint distribution 119875

120585120578(sdot)

coincides with the product distribution 119875120585times120578

(sdot) the randomvariables 120585 and 120578 are said to be independent If 120585 and 120578 arediscrete random variables say 119883 and 119884 contain countablymany points 119909

1 119909

2 and 119910

1 119910

2 then

119868 (120585 120578) = sum119894119895

119875120585times120578

(119909119894 119910

119895) log

119875120585120578(119909

119894 119910

119895)

119875120585(119909

119894) 119875

120578(119910

119895) (15)

119868 is called the information of 120585 and 120578with respect to the other

324 A Modern Definition in Information Theory Of thevarious definitions of mutual information the most widelyaccepted of recent years is the one by Cover andThomas [8]

Let 119883 be a discrete variable with alphabet 120594 and prob-ability mass function 119901(119909) = Pr119883 = 119909 119909 isin 120594 Let 119884

be a discrete variable with alphabet Υ and probability massfunction 119901(119910) = Pr119884 = 119910 119910 isin Υ Suppose 119883 and 119884 havea joint mass function (joint distribution) 119901(119909 119910) Then themutual information 119868(119883 119884) can be defined as

119868 (119883 119884) = sum119909isin120594

sum119910isinΥ

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (16)

33 Class 2 Definitions In Class 2 definitions random vari-ables are replaced by ensembles and mutual information isso-called average mutual information Gallager [11] adopteda more general and more rigorous approach to introducethe concept of mutual information in communication theoryIndeed he combined and compiled the results from Fano [9]and Abramson [10]

Suppose that discrete ensemble 119883 has a sample space1198861 119886

2 119886

119870 and discrete ensemble 119884 has a sample space

1198871 1198872 119887

119871 Consider the joint sample space (119886

119896 119887119895) 1 le

119896 le 119870 1 le 119895 le 119869 A probability measure on the joint samplespace is given by the join probability 119875

119883119884(119886119896 119887119895) defined for

1 le 119896 le 119870 1 le 119895 le 119869 The combination of a joint samplespace and probability measure for outcomes 119909 and 119910 is calleda joint 119883119884ensemble Then the marginal probabilities can befound as

119875119883(119886

119896) =

119869

sum119895=1

119875119883119884

(119886119896 119887119895) 119896 = 1 2 119870 (17)

In more abbreviated notation this is written as

119875 (119909) = sum119910

119875 (119909 119910) (18)

Likewise

119875119884(119887119895) =

119870

sum119896=1

119875119883119884

(119886119896 119887119895) 119895 = 1 2 119869 (19)

In more abbreviated notation this is written as

119875 (119910) = sum119909

119875 (119909 119910) (20)

If 119875119883(119886119896) gt 0 the conditional probability that outcome 119910 is

119887119895 given that outcome of 119909 is 119886

119896 is defined as

119875119884|119883

(119887119895| 119886

119896) =

119875119883119884

(119886119896 119887119895)

119875119883(119886

119896)

(21)

The mutual information between events 119909 = 119886119896and 119910 = 119887

119895is

defined as

119868119883119884

(119886119896 119887119895) = log

119875119883|119884

(119886119896| 119887

119895)

119875119883(119886

119896)

= log119875119883119884

(119886119896 119887119895)

119875119883(119886

119896) 119875

119884(119887119895)

= log119875119883|119884

(119886119896| 119887

119895)

119875119883(119886

119896)

= 119868119884119883

(119887119895 119886

119896)

(22)

Since the mutual information defined above is a randomvariable on the 119883119884 joint ensemble the mean value which iscalled the average mutual information denoted by 119868(119883 119884) isgiven by

119868 (119883 119884) =

119870

sum119896=1

119871

sum119895=1

119875119883119884

(119886119896 119887119895) log

119875119883119884

(119886119896 119887119895)

119875119883(119886

119896) 119875

119884(119887119895) (23)

Remark 12 By means of an information channel consistingof a transmitter of alphabet 119860 with elements 119886

119894and total

elements 119905 and a receiver of alphabet 119861 with elements 119887119895

and total elements 119903 Abramson [10] denoted 119867(119860) minus 119867(119860 |

119861) = sum119860119861

119875(119886 119887) log(119875(119886 119887)119875(119886)119875(119887)) by 119868(119860 119861) and calledit mutual information of 119860 and 119861

The mutual information 119868(119883 119884) between 2 continuousrandomvariables119883 and119884 [8] (also called rate of transmissionin [1]) is defined as

119868 (119883 119884) = ∬119875 (119909 119910) log119875 (119909 119910)

119875 (119909) 119875 (119910)119889119909 119889119910 (24)

where 119875(119909 119910) is the joint probability density function of119883 and 119884 and 119875(119909) and 119875(119910) are the marginal densityfunctions associated with 119883 and 119884 respectively The mutualinformation between 2 continuous random variables is alsocalled the differential mutual information

However the differential mutual information ismuch lesspopular than its discrete counterpart On the one hand thejoint density function involved is unknown inmost cases andhence must be estimated [13 14] On the other hand datain engineering and machine learning are mostly finite andso mutual information between discrete random variables isused

4 A New Unified Definition ofMutual Information

In Section 3 we reviewed various definitions of mutual infor-mation Shannonrsquos original definition laid the foundation

Mathematical Problems in Engineering 5

of information theory Kullbackrsquos definition used randomvariables for the first time and was more mathematical andmore compact Although Ashrsquos definition followed Shannonrsquospath it was more systematic Pinskerrsquos definition was mostmathematical in that it employed probability theory Gal-lagerrsquos definition was more general and more rigorous incommunication theory Cover and Thomasrsquos definition is sosuccinct that it is now a standard definition in informationtheory

However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Except Pinskerrsquos definition Class 2definitions either neglect the probability spaces or assumethe two random variables have the same probability spaceBoth Class 1 definitions and Class 2 definitions assume a jointdistribution or a joint probability measure exists Yet they allignore an important fact that the joint distribution or the jointprobability measure is not unique

41 Unified Definition of Mutual Information Let 119883 be afinite discrete random variable on discrete probability space(Ω

1F

1 119875

1) with Ω

1= 120596

1 120596

2 120596

119899 and range 119909

1 119909

2

119909119870 with 119870 le 119899 Let 119884 be a discrete random variable on

probability space (Ω2F

2 119875

2) with Ω

2= 120588

1 120588

2 120588

119898 and

range 1199101 119910

2 119910

119871 with 119871 le 119898

If119883 and119884have the sameprobability space (ΩF 119875) thenthe joint distribution is simply

119875119883119884

(119883 = 119909 119884 = 119910) = 119875 (120596 isin Ω 119883 (120596) = 119909 119884 (120596) = 119910)

(25)

However when 119883 and 119884 have different probability spacesand so different probability measures the joint distributionis more complicated

Definition 13 The joint sample space of random variables 119883and 119884 is defined as the product Ω

1times Ω

2of all pairs (120596

119894 120588

119895)

119894 = 1 2 119899 and 119895 = 1 2 119898The joint 120590-fieldF1timesF

2is

defined as the product of all pairs (1198601 119860

2) where119860

1and119860

2

are elements of F1and F

2 respectively A joint probability

measure 119875119883119884

of 1198751and 119875

2is a probability measure on F

1times

F2 119875

119883119884(119860 times 119861) such that for any 119860 sube Ω

1and 119861 sube Ω

2

1198751(119860) = 119875

119883119884(119860 times Ω

2) =

119898

sum119895=1

119875119883119884

(119860 times 120588119895)

1198752(119861) = 119875

119883119884(Ω

1times 119861) =

119898

sum119894=1

119875119883119884

(120596119894 times 119861)

(26)

(Ω1timesΩ

2 F

1timesF

2 119875

119883119884) is called the joint probability space

of 119883 and 119884 and 119875119883119884

(119883 = 119909119894 times 119884 = 119910

119895) for 119894 = 1 2 119899

and 119895 = 1 2 119898 the joint distribution of119883 and 119884

Combining Definitions 2 and 13 we immediately obtainthe following results

Proposition 14 A sequence of nonnegative numbers 119901119894119895 1 le

119894 le 119870 1 le 119895 le 119871 whose sum is 1 can serve as a probabilitymeasure on F

1times F

2 119875

119883119884(120596

119894 120588

119895) = 119901

119894119895 The probability of

any event 119860 times 119861 sube Ω1times Ω

2is computed simply by adding the

probabilities of the individual points of (120596 120588) isin 119860 times 119861 If inaddition for 119894 = 1 2 119870 and 119895 = 1 2 119871 the followinghold

119871

sum119895=1

119901119894119895= 119875

119883(120596

119894)

119870

sum119894=1

119901119894119895= 119875

119884(120588

119895)

(27)

then 119875119883119884

(120596119894 120588

119895) = 119901

119894119895is a joint distribution of119883 and 119884

For convenience from now on we will shorten 119875119883119884

(119883 =

119909119894 times 119884 = 119910

119895) as 119875

119883119884(119909

119894 119910

119895)

This two-dimensional measure should not be confusedwith one-dimensional joint distribution when 119883 and 119884 havethe same probability space

Remark 15 If (Ω1F

1 119875

1) = (Ω

2F

2 119875

2) instead of using

the two-dimensional measure 119875119883119884

(119883 = 119909119894 times 119884 =

119910119895) we may use the one-dimensional measure 119875

1(119883 =

119909119894 and 119884 = 119910

119895) Then (26) always hold In this sense our

new definition of joint distribution reduces to the definitionof joint distribution with the same probability space

Definition 16 The conditional probability 119884 = 119910119895 given 119883 =

119909119894 is defined as

119875119884|119883

(119884 = 119910119895| 119883 = 119909

119894) =

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) (28)

Theorem 17 For any two discrete random variables thereis at least one joint probability measure called the productprobability measure or simply product distribution

Proof Let random variables 119883 and 119884 be defined as beforeDefine a function fromΩ

1times Ω

2to [0 1] as follows

119875119883119884

(120596119894 120588

119895) = 119875

1(120596

119894) 119875

2(120588

119895) (29)

Then

119899

sum119894=1

119898

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119899

sum119894=1

119898

sum119895=1

1198751(120596

119894) 119875

2(120588

119895)

=

119899

sum119894=1

1198751(120596

119894)

119898

sum119895=1

1198752(120588

119895) = 1

(30)

Hence 119875119883119884

can serve as a probability measure on F1times

F2by Definition 2 The probability of any event 119860 times 119861 sube

Ω1times Ω

2is computed simply by adding the probabilities of

6 Mathematical Problems in Engineering

the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596

1198941 120596

1198942 120596

119894119904 isin Ω

1of 119904 elements

119875119883119884

(119860 times Ω2) =

119898

sum119895=1

119875119883119884

(119860 times 120588119895)

=

119898

sum119895=1

119904

sum119906=1

119875119883119884

(120596119894119906 times 120588

119895)

=

119898

sum119895=1

119904

sum119906=1

1198751(120596

119894119906) 119875

2(120588

119895)

=

119898

sum119895=1

1198752(120588

119895)

119904

sum119906=1

1198751(120596

119894119906)

=

119904

sum119906=1

1198751(120596

119894119906) = 119875

1(119860)

(31)

Similarly 119875119883119884

(Ω1times 119861) = 119875

2(119861) for any 119861 isin Ω

2 Hence

119875119883119884

(119883 = 119909119894 times 119884 = 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) is a

joint probability measure of119883 and 119884 by Definition 13

Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875

119883119884(sdot) if 119875

119883119884(sdot) coincides

with the product distribution 119875119883times119884

(sdot)

Definition 19 The joint entropy119867(119883 119884) is defined as

119867(119883 119884) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119883119884(119909

119894 119910

119895) (32)

Definition 20 Theconditional entropy119867(119884 | 119883) is as follows

119867(119884 | 119883) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119884 = 119910

119895| 119883 = 119909

119894)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894)

(33)

Definition 21 The mutual information 119868(119883 119884) between 119883

and 119884 is defined as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

(34)

As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0

It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ

1= 119886

1 119886

2 119886

119870 andΩ

2= 119887

1 1198872 119887

119871

Define random variables 119883 Ω1rarr R and 119884 Ω

2rarr R as

one-to-one mappings as

119883(119886119894) = 119909

119894 119894 = 1 2 119870

119884 (119887119895) = 119910

119895 119895 = 1 2 119871

(35)

Then we have

119875119883119884

(119909119894 119910

119895) = 119875

119883119884(119886

119894 119887119895) (36)

It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example

Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions

1198751(119884 = 0) =

1

3 119875

1(119884 = 1) =

2

3

1198752(119883 = 1) =

1

3 119875

2(119883 = 2) =

1

3 119875

2(119883 = 3) =

1

3

(37)

We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information

(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13

(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0

(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13

(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =

29 119875(3 0) = 19 119875(3 1) = 29

42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]

Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =

1199011 119901

2 119901

119899 and 119876 = 119902

1 119902

2 119902

119899 is defined as

119863 (119875 119876) = sum119894

119901119894log

119901119894

119902119894

(38)

Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902

119894for all 119894

Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions

Mathematical Problems in Engineering 7

The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875

119883119884and the product

distribution 119875119883119875119884

Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875

119883119884and the product distribution 119875

11198752

Proof Using a mapping from 2-dimensional indices to one-dimensional index

(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899

for 119894 = 1 119870 119895 = 1 2 119871(39)

and using another mapping from one-dimensional indexback to two-dimensional indices

119894 = lceil119899

119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871

for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871

(40)

we rewrite 119868(119883 119884) as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(41)

Since119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) = 1

119870119871

sum119899=1

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

=

119870

sum119894=1

119871

sum119895=1

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895) = 1

(42)

we obtain

119868 (119883 119884) =

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(43)

Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies

119868 (119883 119884) ge 0 (44)

with equality if and only if119883 and 119884 are independent

Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875

119883119884(119909

lceil119899119871rceil 119910

119899minus(lceil119899119871rceilminus1)lowast119871) = 119875

1(119883 = 119909

lceil119899119871rceil)119875

2(119884 =

119910119899minus(lceil119899119871rceilminus1)lowast119871

) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +

sdot sdot sdot + 119870 that is 119875119883119884

(119909119894 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) for 119894 =

1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent

Corollary 26 If119883 is a constant random variable that is119870 =

1 then for any random variable 119884

119868 (119883 119884) = 0 (45)

Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875

1(119883 = 119909) = 119875

1(120596) = 1

For any 119895 = 1 2 119871

119875119883119884

(119909 119910119895) =

1

sum119894=1

119875119883119884

(119909 119910119895) = 119875

2(119884 = 119910

119895)

= 1198751(119883 = 119909) 119875

2(119884 = 119910

119895)

(46)

Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0

Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then

0 le 119867 (119883) le log119870 (47)

with equality if and only if the119870 values are equally probable

Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold

119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)

= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)

Proof Consider

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119884|119883

(119909119894 119910

119895)

1198752(119884 = 119910

119895)

8 Mathematical Problems in Engineering

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

+

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

119871

sum119894=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

1198752(119884 = 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= 119867 (119884) minus 119867 (119884 | 119883)

(49)

Combining the above properties and noting that 119867(119883 | 119884)

and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties

Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then

0 le 119868 (119883 119884) le 119867 (119884) le log 119871

0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)

Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent

5 Newly Defined Mutual Information inMachine Learning

Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883

1 119883

2 119883

119872 and their associated output values

119910 from an output space 119884 (ie dependent variable) Here1198831 119883

2 119883

119872are called features that is independent vari-

ables Hence 119878 can be expressed as

119878 = 1199091198941 119909

1198942 119909

119894119872 119910

119894 119894 = 1 2 119873 (51)

where feature 119883119895has values 119909

1119895 119909

2119895 119909

119873119895for 119895 = 1 2

119872

A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883

1 119883

2 119883

119872 Rather it is decided by only a subset of

them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows

(1) Evaluate the relationship between each individualinput feature 119909

119894and the output 119884

(2) Select the best set of attributes according to somecriterion

51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596

1 120596

2 120596

119870 If 119883 has missing values we will use 120596

1

to represent all the missing values Assume 119884 has 119871 differentvalues 120588

1 120588

2 120588

119871

Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874

119894119895be the frequency (could be 0) of (120596

119894 120588

119895) for

119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899

119894sdotand 119899

sdot119895 respectively Then

119899119894sdot= sum

119895

119874119894119895

119899sdot119895= sum

119894

119874119894119895

119873 = sum119894

sum119895

119874119894119895= sum

119894

119899119894sdot= sum

119895

119899sdot119895

(52)

Let us denote the relative frequency119874119894119895119873 by 119901

119894119895We have the

two-way relative frequency table see Table 2Since

119870

sum119894=1

119871

sum119895=1

119901119894119895=

119870

sum119894=1

119901119894sdot=

119871

sum119895=1

119901sdot119895= 1 (53)

119901119894sdot119870119894=1

119901sdot119895119871119895=1

and 119901119894119895119870119894=1

can each serve as a probabilitymeasure

Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables

119883 (Ω1F

1 119875

119883) 997888rarr 119877 (54)

as 119883(120596119894) = 119909

119894 where Ω

1= 120596

1 120596

2 120596

119870 and 119875

119883(120596

119894) =

119899119894sdot119873 = 119901

119894sdotfor 119894 = 1 2 119870 Note that 119909

1 119909

2 119909

119870

could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909

119894) = 119875

119883(120596

119894)

Mathematical Problems in Engineering 9

Table 1 Frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11987411

11987412

sdot sdot sdot 1198741119895

sdot sdot sdot 1198741119871

1198991∙

1205962

11987421

11987422

sdot sdot sdot 1198742119895

sdot sdot sdot 1198742119871

1198992∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1198741198941

1198741198942

sdot sdot sdot 119874119894119895

sdot sdot sdot 119874119894119871

119899119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1198741198701

1198741198702

sdot sdot sdot 119874119870119895

sdot sdot sdot 119874119870119871

119899119870∙

Total 119899∙1

119899∙2

sdot sdot sdot 119899∙119895

sdot sdot sdot 119899∙119871

119873

Table 2 Relative frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11990111

11990112

sdot sdot sdot 1199011119895

sdot sdot sdot 1199011119871

1199011∙

1205962

11990121

11990122

sdot sdot sdot 1199012119895

sdot sdot sdot 1199012119871

1199012∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1199011198941

1199011198942

sdot sdot sdot 119901119894119895

sdot sdot sdot 119901119894119871

119901119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1199011198701

1199011198702

sdot sdot sdot 119901119870119895

sdot sdot sdot 119901119870119871

119901119870∙

Total 119901∙1

119901∙2

sdot sdot sdot 119901∙119895

sdot sdot sdot 119901∙119871

1

Similarly

119884 (Ω2F

2 119875

2) 997888rarr 119877 (55)

as 119884(120588119895) = 119910

119895 where Ω

2= 120588

1 120588

2 120588

119871 and 119875

119884(120588

119895) =

119899sdot119895119873 = 119901

sdot119895for 119895 = 1 2 119871 Also 119910

1 119910

2 119910

119870could be

any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875

119884(119884 =

119910119895) = 119875

119884(120588

119895)

Now define a mapping 119875119883119884

fromΩ1timesΩ

2to 119877 as follows

119875119883119884

(120596119894 120588

119895) = 119901

119894119895=

119874119894119895

119873 (56)

Since119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) = 1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119871

sum119895=1

119901119894119895= 119901

119894sdot= 119875

119883(120596

119894)

119870

sum119894=1

119875119883119884

(120596119894 120588

119895) =

119870

sum119894=1

119901119894119895= 119901

sdot119895= 119875

119884(120588

119895)

(57)

119901119894119895119870119894=1

is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) log

119875119883119884

(120596119894 120588

119895)

119875119883(120596

119894) 119875

119884(120588

119895)

=

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

(58)

It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value

Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)

Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that

(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871

(2) 119901119894119895

= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and

119895 = 1 2 119871

(3) for each 119895 = 1 2 119871 since1198741119895+119874

2119895+sdot sdot sdot+119874

119870119895= 119899

sdot119895

there are 119899sdot119895nonzero 119874

119894119895rsquos or equivalently 119899

sdot119895nonzero

119901119894119895rsquos

(4) 119901119894sdot= 1119873 119894 = 1 2 119870

Using the above observations and the fact that 0 log 0 = 0 wehave

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

=

119870

sum119894=1

1199011198941log

1199011198941

119901119894sdot119901sdot1

+

119870

sum119894=1

1199011198942log

1199011198942

119901119894sdot119901sdot2

+ sdot sdot sdot +

119870

sum119894=1

119901119894119871log

119901119894119871

119901119894sdot119901sdot119871

10 Mathematical Problems in Engineering

= sum1199011198941 =0

1

119873log 1119873

119901sdot1119873

+ sum1199011198942 =0

1

119873log 1119873

119901sdot2119873

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1119873

119901sdot119871119873

= sum1199011198941 =0

1

119873log 1

119901sdot1

+ sum1199011198942 =0

1

119873log 1

119901sdot2

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1

119901sdot119871

=119899sdot1

119873log 1

119901sdot1

+119899sdot2

119873log 1

119901sdot2

+ sdot sdot sdot +119899sdot119871

119873log 1

119901sdot119871

= 119901sdot1log 1

119901sdot1

+ 119901sdot2log 1

119901sdot2

+ sdot sdot sdot + 119901sdot119871log 1

119901sdot119871

= 119867 (119884)

(59)

52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1

To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways

521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs

522 Relative Values From Property 4 we have 0 le

119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs

523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]

Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then

119901 ln119901

119902asymp (119901 minus 119902) +

(119901 minus 119902)2

2119902 (60)

The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is

Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain

2119868 (119883 119884) = 2119873

119870

sum119894=1

119871

sum119895=1

119901119894119895ln

119901119894119895

119901119894sdot119901sdot119895

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895119873

(119899119894sdot119873) (119899

sdot119895119873)

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895

119899119894sdot119899sdot119895119873

asymp 2

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus

119899119894sdot119899sdot119895

119873)

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895minus 2

sum119894119899119894sdot

119873

sum119895119899sdot119895

119873

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2119873 minus 2119873

119873+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

=

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 1205942

(61)

The last equation means the previous expressionsum119870

119894=1sum119871

119895=1((119874

119894119895minus 119899

119894sdot119899sdot119895119873)

2(119899

119894sdot119899sdot119895119873)) follows 1205942 distribu-

tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)

approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)

The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no

relationship between them)1198671119883 and119884 are dependent (ie there is a relationship

between them)

Mathematical Problems in Engineering 11

The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

asymp 2119873 times 119868 (119883 119884) (62)

is greater than 1205942119880 the upper-tail critical value from a Chi-

square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is

Select feature 119883 if 119868 (119883 119884) gt1205942119880

2119873 (63)

Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942

119880to be 169 from a Chi-

square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873

Assume a training set has119873 examples We can do featureselection by the following procedure

(i) Step 1 Choose a level of significance of 120572 say 005

(ii) Step 2 Find 119870 the number of values of feature119883

(iii) Step 3 Build the contingency table for119883 and 119884

(iv) Step 4 Calculate 119868(119883 119884) from the contingency table

(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom

from a Chi-square table or any other sources such asSAS

(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise

(vii) Step 7 Repeat Steps 2ndash6 for all features

If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure

53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1

Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process

(i) Step 1 select features by removing those with smallmutual information

(ii) Step 2 do binning for the rest of numerical features

(iii) Step 3 select features by mutual information

54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning

Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships

The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value

6 Conclusions

In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article A Unified Definition of Mutual

Mathematical Problems in Engineering 3

31 Shannonrsquos Original Definition

Definition 7 Let 119909 be a chance variable with probabilities1199011 119901

2 119901

119899 whose sum is 1 Then

119867(119909) = minus

119899

sum119894=1

119901119894log119901

119894 (6)

is called the entropy of 119909Suppose two chance variables 119909 and 119910 have 119898 and 119899

possibilities respectively Let indices 119894 and 119895 range over all the119898 possibilities and all the 119899 possibilities respectively Let 119901(119894)be the probability of 119894 and 119901(119894 119895) the probability of the jointoccurrence of 119894 and 119895 Denote the conditional probability of 119894given 119895 by 119901(119894 | 119895) and conditional probability of 119895 given 119894 by119901(119895 | 119894)

Definition 8 The joint entropy of 119909 and 119910 is defined as

119867(119909 119910) = minussum119894119895

119901 (119894 119895) log119901 (119894 119895) (7)

Definition 9 The conditional entropy of 119910119867119909(119910) is defined

as

119867119909(119910) = minussum

119894119895

119901 (119894 119895) log119901 (119895 | 119894)

= minussum119894119895

119901 (119894 119895) log119901 (119894 119895)

sum119895119901 (119894 119895)

(8)

The conditional entropy of 119909119867119910(119909) can be defined similarly

Then the following relation holds

119867(119909 119910) = 119867 (119909) + 119867119909(119910) = 119867 (119910) + 119867

119910(119909)

119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867

119909(119910)

= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)

(9)

Definition 10 The rate of transmission of information 119877 isdefined as the difference between 119867(119909) and 119867

119910(119909) Then 119877

can be written in two other forms

119877 = 119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867

119909(119910)

= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)

(10)

Remark 11 Shannon did not derive the explicit formula for119877

119877 = sumij119901 (119894 119895) log

119901 (119894 119895)

119901 (119894) 119901 (119895) (11)

However he did imply it in Appendix 7 [4]

32 Class 1 Definitions

321 Kullbackrsquos Definition Kullback [5] redefined entropymore mathematically in a standalone homework question asfollows Consider two discrete random variables 119909 119910 where

119901119894119895= Prob (119909 = 119909

119894 119910 = 119910

119895) gt 0

119894 = 1 2 119898 119895 = 1 2 119899

119901119894sdot=

119899

sum119895=1

119901119894119895

119901sdot119895=

119898

sum119894=1

119901119894119895

msumi=1

nsumj=1

119901119894119895=

119898

sum119894=1

119901119894sdot=

119899

sum119895=1

119901sdot119895= 1

(12)

Define the joint entropy entropy and conditional entropy asfollows

119867(119909 119910) = minussum119894

sum119895

119901119894119895log119901

119894119895

119867 (119909) = minussum119894

119901119894sdotlog119901

119894sdot

119867 (119910) = minussum119895

119901sdot119895log119901

sdot119895

119867 (119910 | 119909119894) = minussum

119895

119901119894119895

119901119894sdot

log119901119894119895

119901119894sdot

119867 (119910 | 119909) = sum119894

119901119894119867(119910 | 119909

119894) = minussum

119894

sum119895

119901119894119895log

119901119894119895

119901119894sdot

(13)

Then119867(119909 119910) = 119867(119909) +119867(119910 | 119909) le 119867(119909) +119867(119910) and119867(119910) ge

119867(119910 | 119909)

322 Information Conveyed Ash [7] beganwith two randomvariables 119883 and 119884 and assumed 119883 and 119884 had the sameprobability space He systematically defined the entropyconditional entropy and joint entropy following Shannonrsquospath in [4] At the end he denoted 119867(119883) minus 119867(119883 | 119884) by119868(119883 | 119884) and called it the information conveyed about 119883 by119884

323 Information of One Variable with respect to the OtherPinsker [6] treated the fundamental concepts of Shannon ina more advanced manner by employing probability theorySuppose 120585 is a random variable defined on a probability space(Ω 119878

120596 119875

120585) and is taking values in a measurable space (119883 119878

119909)

and 120578 is a random variable defined on a probability space(Ψ 119878

120595 119875

120578) and is taking values in a measurable space (119884 119878

119910)

Then the pair 120585 120578 of random variables may be regarded as asingle random variable (120585 120578)with values in the product space119883 times 119884 of all pairs (119909 119910) with 119909 isin 119883 119910 isin 119884 The distribution119875(120585120578)

(sdot) = 119875120585120578(sdot) of (120585 120578) is called the joint distribution of

4 Mathematical Problems in Engineering

random variables 120585 and 120578 By the product of the distributions119875120585(sdot) and 119875

120578(sdot) denoted by 119875

120585times120578(sdot) we mean the distribution

defined on 119878119909times 119878

119910

119875120585times120578

(119864 times 119865) = 119875120585(119864) times 119875

120578(119865) (14)

for 119864 isin 119878119909and 119865 isin 119878

119910 If the joint distribution 119875

120585120578(sdot)

coincides with the product distribution 119875120585times120578

(sdot) the randomvariables 120585 and 120578 are said to be independent If 120585 and 120578 arediscrete random variables say 119883 and 119884 contain countablymany points 119909

1 119909

2 and 119910

1 119910

2 then

119868 (120585 120578) = sum119894119895

119875120585times120578

(119909119894 119910

119895) log

119875120585120578(119909

119894 119910

119895)

119875120585(119909

119894) 119875

120578(119910

119895) (15)

119868 is called the information of 120585 and 120578with respect to the other

324 A Modern Definition in Information Theory Of thevarious definitions of mutual information the most widelyaccepted of recent years is the one by Cover andThomas [8]

Let 119883 be a discrete variable with alphabet 120594 and prob-ability mass function 119901(119909) = Pr119883 = 119909 119909 isin 120594 Let 119884

be a discrete variable with alphabet Υ and probability massfunction 119901(119910) = Pr119884 = 119910 119910 isin Υ Suppose 119883 and 119884 havea joint mass function (joint distribution) 119901(119909 119910) Then themutual information 119868(119883 119884) can be defined as

119868 (119883 119884) = sum119909isin120594

sum119910isinΥ

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (16)

33 Class 2 Definitions In Class 2 definitions random vari-ables are replaced by ensembles and mutual information isso-called average mutual information Gallager [11] adopteda more general and more rigorous approach to introducethe concept of mutual information in communication theoryIndeed he combined and compiled the results from Fano [9]and Abramson [10]

Suppose that discrete ensemble 119883 has a sample space1198861 119886

2 119886

119870 and discrete ensemble 119884 has a sample space

1198871 1198872 119887

119871 Consider the joint sample space (119886

119896 119887119895) 1 le

119896 le 119870 1 le 119895 le 119869 A probability measure on the joint samplespace is given by the join probability 119875

119883119884(119886119896 119887119895) defined for

1 le 119896 le 119870 1 le 119895 le 119869 The combination of a joint samplespace and probability measure for outcomes 119909 and 119910 is calleda joint 119883119884ensemble Then the marginal probabilities can befound as

119875119883(119886

119896) =

119869

sum119895=1

119875119883119884

(119886119896 119887119895) 119896 = 1 2 119870 (17)

In more abbreviated notation this is written as

119875 (119909) = sum119910

119875 (119909 119910) (18)

Likewise

119875119884(119887119895) =

119870

sum119896=1

119875119883119884

(119886119896 119887119895) 119895 = 1 2 119869 (19)

In more abbreviated notation this is written as

119875 (119910) = sum119909

119875 (119909 119910) (20)

If 119875119883(119886119896) gt 0 the conditional probability that outcome 119910 is

119887119895 given that outcome of 119909 is 119886

119896 is defined as

119875119884|119883

(119887119895| 119886

119896) =

119875119883119884

(119886119896 119887119895)

119875119883(119886

119896)

(21)

The mutual information between events 119909 = 119886119896and 119910 = 119887

119895is

defined as

119868119883119884

(119886119896 119887119895) = log

119875119883|119884

(119886119896| 119887

119895)

119875119883(119886

119896)

= log119875119883119884

(119886119896 119887119895)

119875119883(119886

119896) 119875

119884(119887119895)

= log119875119883|119884

(119886119896| 119887

119895)

119875119883(119886

119896)

= 119868119884119883

(119887119895 119886

119896)

(22)

Since the mutual information defined above is a randomvariable on the 119883119884 joint ensemble the mean value which iscalled the average mutual information denoted by 119868(119883 119884) isgiven by

119868 (119883 119884) =

119870

sum119896=1

119871

sum119895=1

119875119883119884

(119886119896 119887119895) log

119875119883119884

(119886119896 119887119895)

119875119883(119886

119896) 119875

119884(119887119895) (23)

Remark 12 By means of an information channel consistingof a transmitter of alphabet 119860 with elements 119886

119894and total

elements 119905 and a receiver of alphabet 119861 with elements 119887119895

and total elements 119903 Abramson [10] denoted 119867(119860) minus 119867(119860 |

119861) = sum119860119861

119875(119886 119887) log(119875(119886 119887)119875(119886)119875(119887)) by 119868(119860 119861) and calledit mutual information of 119860 and 119861

The mutual information 119868(119883 119884) between 2 continuousrandomvariables119883 and119884 [8] (also called rate of transmissionin [1]) is defined as

119868 (119883 119884) = ∬119875 (119909 119910) log119875 (119909 119910)

119875 (119909) 119875 (119910)119889119909 119889119910 (24)

where 119875(119909 119910) is the joint probability density function of119883 and 119884 and 119875(119909) and 119875(119910) are the marginal densityfunctions associated with 119883 and 119884 respectively The mutualinformation between 2 continuous random variables is alsocalled the differential mutual information

However the differential mutual information ismuch lesspopular than its discrete counterpart On the one hand thejoint density function involved is unknown inmost cases andhence must be estimated [13 14] On the other hand datain engineering and machine learning are mostly finite andso mutual information between discrete random variables isused

4 A New Unified Definition ofMutual Information

In Section 3 we reviewed various definitions of mutual infor-mation Shannonrsquos original definition laid the foundation

Mathematical Problems in Engineering 5

of information theory Kullbackrsquos definition used randomvariables for the first time and was more mathematical andmore compact Although Ashrsquos definition followed Shannonrsquospath it was more systematic Pinskerrsquos definition was mostmathematical in that it employed probability theory Gal-lagerrsquos definition was more general and more rigorous incommunication theory Cover and Thomasrsquos definition is sosuccinct that it is now a standard definition in informationtheory

However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Except Pinskerrsquos definition Class 2definitions either neglect the probability spaces or assumethe two random variables have the same probability spaceBoth Class 1 definitions and Class 2 definitions assume a jointdistribution or a joint probability measure exists Yet they allignore an important fact that the joint distribution or the jointprobability measure is not unique

41 Unified Definition of Mutual Information Let 119883 be afinite discrete random variable on discrete probability space(Ω

1F

1 119875

1) with Ω

1= 120596

1 120596

2 120596

119899 and range 119909

1 119909

2

119909119870 with 119870 le 119899 Let 119884 be a discrete random variable on

probability space (Ω2F

2 119875

2) with Ω

2= 120588

1 120588

2 120588

119898 and

range 1199101 119910

2 119910

119871 with 119871 le 119898

If119883 and119884have the sameprobability space (ΩF 119875) thenthe joint distribution is simply

119875119883119884

(119883 = 119909 119884 = 119910) = 119875 (120596 isin Ω 119883 (120596) = 119909 119884 (120596) = 119910)

(25)

However when 119883 and 119884 have different probability spacesand so different probability measures the joint distributionis more complicated

Definition 13 The joint sample space of random variables 119883and 119884 is defined as the product Ω

1times Ω

2of all pairs (120596

119894 120588

119895)

119894 = 1 2 119899 and 119895 = 1 2 119898The joint 120590-fieldF1timesF

2is

defined as the product of all pairs (1198601 119860

2) where119860

1and119860

2

are elements of F1and F

2 respectively A joint probability

measure 119875119883119884

of 1198751and 119875

2is a probability measure on F

1times

F2 119875

119883119884(119860 times 119861) such that for any 119860 sube Ω

1and 119861 sube Ω

2

1198751(119860) = 119875

119883119884(119860 times Ω

2) =

119898

sum119895=1

119875119883119884

(119860 times 120588119895)

1198752(119861) = 119875

119883119884(Ω

1times 119861) =

119898

sum119894=1

119875119883119884

(120596119894 times 119861)

(26)

(Ω1timesΩ

2 F

1timesF

2 119875

119883119884) is called the joint probability space

of 119883 and 119884 and 119875119883119884

(119883 = 119909119894 times 119884 = 119910

119895) for 119894 = 1 2 119899

and 119895 = 1 2 119898 the joint distribution of119883 and 119884

Combining Definitions 2 and 13 we immediately obtainthe following results

Proposition 14 A sequence of nonnegative numbers 119901119894119895 1 le

119894 le 119870 1 le 119895 le 119871 whose sum is 1 can serve as a probabilitymeasure on F

1times F

2 119875

119883119884(120596

119894 120588

119895) = 119901

119894119895 The probability of

any event 119860 times 119861 sube Ω1times Ω

2is computed simply by adding the

probabilities of the individual points of (120596 120588) isin 119860 times 119861 If inaddition for 119894 = 1 2 119870 and 119895 = 1 2 119871 the followinghold

119871

sum119895=1

119901119894119895= 119875

119883(120596

119894)

119870

sum119894=1

119901119894119895= 119875

119884(120588

119895)

(27)

then 119875119883119884

(120596119894 120588

119895) = 119901

119894119895is a joint distribution of119883 and 119884

For convenience from now on we will shorten 119875119883119884

(119883 =

119909119894 times 119884 = 119910

119895) as 119875

119883119884(119909

119894 119910

119895)

This two-dimensional measure should not be confusedwith one-dimensional joint distribution when 119883 and 119884 havethe same probability space

Remark 15 If (Ω1F

1 119875

1) = (Ω

2F

2 119875

2) instead of using

the two-dimensional measure 119875119883119884

(119883 = 119909119894 times 119884 =

119910119895) we may use the one-dimensional measure 119875

1(119883 =

119909119894 and 119884 = 119910

119895) Then (26) always hold In this sense our

new definition of joint distribution reduces to the definitionof joint distribution with the same probability space

Definition 16 The conditional probability 119884 = 119910119895 given 119883 =

119909119894 is defined as

119875119884|119883

(119884 = 119910119895| 119883 = 119909

119894) =

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) (28)

Theorem 17 For any two discrete random variables thereis at least one joint probability measure called the productprobability measure or simply product distribution

Proof Let random variables 119883 and 119884 be defined as beforeDefine a function fromΩ

1times Ω

2to [0 1] as follows

119875119883119884

(120596119894 120588

119895) = 119875

1(120596

119894) 119875

2(120588

119895) (29)

Then

119899

sum119894=1

119898

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119899

sum119894=1

119898

sum119895=1

1198751(120596

119894) 119875

2(120588

119895)

=

119899

sum119894=1

1198751(120596

119894)

119898

sum119895=1

1198752(120588

119895) = 1

(30)

Hence 119875119883119884

can serve as a probability measure on F1times

F2by Definition 2 The probability of any event 119860 times 119861 sube

Ω1times Ω

2is computed simply by adding the probabilities of

6 Mathematical Problems in Engineering

the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596

1198941 120596

1198942 120596

119894119904 isin Ω

1of 119904 elements

119875119883119884

(119860 times Ω2) =

119898

sum119895=1

119875119883119884

(119860 times 120588119895)

=

119898

sum119895=1

119904

sum119906=1

119875119883119884

(120596119894119906 times 120588

119895)

=

119898

sum119895=1

119904

sum119906=1

1198751(120596

119894119906) 119875

2(120588

119895)

=

119898

sum119895=1

1198752(120588

119895)

119904

sum119906=1

1198751(120596

119894119906)

=

119904

sum119906=1

1198751(120596

119894119906) = 119875

1(119860)

(31)

Similarly 119875119883119884

(Ω1times 119861) = 119875

2(119861) for any 119861 isin Ω

2 Hence

119875119883119884

(119883 = 119909119894 times 119884 = 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) is a

joint probability measure of119883 and 119884 by Definition 13

Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875

119883119884(sdot) if 119875

119883119884(sdot) coincides

with the product distribution 119875119883times119884

(sdot)

Definition 19 The joint entropy119867(119883 119884) is defined as

119867(119883 119884) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119883119884(119909

119894 119910

119895) (32)

Definition 20 Theconditional entropy119867(119884 | 119883) is as follows

119867(119884 | 119883) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119884 = 119910

119895| 119883 = 119909

119894)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894)

(33)

Definition 21 The mutual information 119868(119883 119884) between 119883

and 119884 is defined as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

(34)

As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0

It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ

1= 119886

1 119886

2 119886

119870 andΩ

2= 119887

1 1198872 119887

119871

Define random variables 119883 Ω1rarr R and 119884 Ω

2rarr R as

one-to-one mappings as

119883(119886119894) = 119909

119894 119894 = 1 2 119870

119884 (119887119895) = 119910

119895 119895 = 1 2 119871

(35)

Then we have

119875119883119884

(119909119894 119910

119895) = 119875

119883119884(119886

119894 119887119895) (36)

It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example

Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions

1198751(119884 = 0) =

1

3 119875

1(119884 = 1) =

2

3

1198752(119883 = 1) =

1

3 119875

2(119883 = 2) =

1

3 119875

2(119883 = 3) =

1

3

(37)

We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information

(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13

(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0

(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13

(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =

29 119875(3 0) = 19 119875(3 1) = 29

42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]

Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =

1199011 119901

2 119901

119899 and 119876 = 119902

1 119902

2 119902

119899 is defined as

119863 (119875 119876) = sum119894

119901119894log

119901119894

119902119894

(38)

Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902

119894for all 119894

Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions

Mathematical Problems in Engineering 7

The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875

119883119884and the product

distribution 119875119883119875119884

Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875

119883119884and the product distribution 119875

11198752

Proof Using a mapping from 2-dimensional indices to one-dimensional index

(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899

for 119894 = 1 119870 119895 = 1 2 119871(39)

and using another mapping from one-dimensional indexback to two-dimensional indices

119894 = lceil119899

119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871

for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871

(40)

we rewrite 119868(119883 119884) as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(41)

Since119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) = 1

119870119871

sum119899=1

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

=

119870

sum119894=1

119871

sum119895=1

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895) = 1

(42)

we obtain

119868 (119883 119884) =

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(43)

Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies

119868 (119883 119884) ge 0 (44)

with equality if and only if119883 and 119884 are independent

Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875

119883119884(119909

lceil119899119871rceil 119910

119899minus(lceil119899119871rceilminus1)lowast119871) = 119875

1(119883 = 119909

lceil119899119871rceil)119875

2(119884 =

119910119899minus(lceil119899119871rceilminus1)lowast119871

) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +

sdot sdot sdot + 119870 that is 119875119883119884

(119909119894 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) for 119894 =

1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent

Corollary 26 If119883 is a constant random variable that is119870 =

1 then for any random variable 119884

119868 (119883 119884) = 0 (45)

Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875

1(119883 = 119909) = 119875

1(120596) = 1

For any 119895 = 1 2 119871

119875119883119884

(119909 119910119895) =

1

sum119894=1

119875119883119884

(119909 119910119895) = 119875

2(119884 = 119910

119895)

= 1198751(119883 = 119909) 119875

2(119884 = 119910

119895)

(46)

Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0

Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then

0 le 119867 (119883) le log119870 (47)

with equality if and only if the119870 values are equally probable

Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold

119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)

= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)

Proof Consider

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119884|119883

(119909119894 119910

119895)

1198752(119884 = 119910

119895)

8 Mathematical Problems in Engineering

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

+

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

119871

sum119894=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

1198752(119884 = 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= 119867 (119884) minus 119867 (119884 | 119883)

(49)

Combining the above properties and noting that 119867(119883 | 119884)

and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties

Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then

0 le 119868 (119883 119884) le 119867 (119884) le log 119871

0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)

Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent

5 Newly Defined Mutual Information inMachine Learning

Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883

1 119883

2 119883

119872 and their associated output values

119910 from an output space 119884 (ie dependent variable) Here1198831 119883

2 119883

119872are called features that is independent vari-

ables Hence 119878 can be expressed as

119878 = 1199091198941 119909

1198942 119909

119894119872 119910

119894 119894 = 1 2 119873 (51)

where feature 119883119895has values 119909

1119895 119909

2119895 119909

119873119895for 119895 = 1 2

119872

A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883

1 119883

2 119883

119872 Rather it is decided by only a subset of

them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows

(1) Evaluate the relationship between each individualinput feature 119909

119894and the output 119884

(2) Select the best set of attributes according to somecriterion

51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596

1 120596

2 120596

119870 If 119883 has missing values we will use 120596

1

to represent all the missing values Assume 119884 has 119871 differentvalues 120588

1 120588

2 120588

119871

Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874

119894119895be the frequency (could be 0) of (120596

119894 120588

119895) for

119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899

119894sdotand 119899

sdot119895 respectively Then

119899119894sdot= sum

119895

119874119894119895

119899sdot119895= sum

119894

119874119894119895

119873 = sum119894

sum119895

119874119894119895= sum

119894

119899119894sdot= sum

119895

119899sdot119895

(52)

Let us denote the relative frequency119874119894119895119873 by 119901

119894119895We have the

two-way relative frequency table see Table 2Since

119870

sum119894=1

119871

sum119895=1

119901119894119895=

119870

sum119894=1

119901119894sdot=

119871

sum119895=1

119901sdot119895= 1 (53)

119901119894sdot119870119894=1

119901sdot119895119871119895=1

and 119901119894119895119870119894=1

can each serve as a probabilitymeasure

Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables

119883 (Ω1F

1 119875

119883) 997888rarr 119877 (54)

as 119883(120596119894) = 119909

119894 where Ω

1= 120596

1 120596

2 120596

119870 and 119875

119883(120596

119894) =

119899119894sdot119873 = 119901

119894sdotfor 119894 = 1 2 119870 Note that 119909

1 119909

2 119909

119870

could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909

119894) = 119875

119883(120596

119894)

Mathematical Problems in Engineering 9

Table 1 Frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11987411

11987412

sdot sdot sdot 1198741119895

sdot sdot sdot 1198741119871

1198991∙

1205962

11987421

11987422

sdot sdot sdot 1198742119895

sdot sdot sdot 1198742119871

1198992∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1198741198941

1198741198942

sdot sdot sdot 119874119894119895

sdot sdot sdot 119874119894119871

119899119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1198741198701

1198741198702

sdot sdot sdot 119874119870119895

sdot sdot sdot 119874119870119871

119899119870∙

Total 119899∙1

119899∙2

sdot sdot sdot 119899∙119895

sdot sdot sdot 119899∙119871

119873

Table 2 Relative frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11990111

11990112

sdot sdot sdot 1199011119895

sdot sdot sdot 1199011119871

1199011∙

1205962

11990121

11990122

sdot sdot sdot 1199012119895

sdot sdot sdot 1199012119871

1199012∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1199011198941

1199011198942

sdot sdot sdot 119901119894119895

sdot sdot sdot 119901119894119871

119901119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1199011198701

1199011198702

sdot sdot sdot 119901119870119895

sdot sdot sdot 119901119870119871

119901119870∙

Total 119901∙1

119901∙2

sdot sdot sdot 119901∙119895

sdot sdot sdot 119901∙119871

1

Similarly

119884 (Ω2F

2 119875

2) 997888rarr 119877 (55)

as 119884(120588119895) = 119910

119895 where Ω

2= 120588

1 120588

2 120588

119871 and 119875

119884(120588

119895) =

119899sdot119895119873 = 119901

sdot119895for 119895 = 1 2 119871 Also 119910

1 119910

2 119910

119870could be

any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875

119884(119884 =

119910119895) = 119875

119884(120588

119895)

Now define a mapping 119875119883119884

fromΩ1timesΩ

2to 119877 as follows

119875119883119884

(120596119894 120588

119895) = 119901

119894119895=

119874119894119895

119873 (56)

Since119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) = 1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119871

sum119895=1

119901119894119895= 119901

119894sdot= 119875

119883(120596

119894)

119870

sum119894=1

119875119883119884

(120596119894 120588

119895) =

119870

sum119894=1

119901119894119895= 119901

sdot119895= 119875

119884(120588

119895)

(57)

119901119894119895119870119894=1

is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) log

119875119883119884

(120596119894 120588

119895)

119875119883(120596

119894) 119875

119884(120588

119895)

=

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

(58)

It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value

Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)

Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that

(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871

(2) 119901119894119895

= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and

119895 = 1 2 119871

(3) for each 119895 = 1 2 119871 since1198741119895+119874

2119895+sdot sdot sdot+119874

119870119895= 119899

sdot119895

there are 119899sdot119895nonzero 119874

119894119895rsquos or equivalently 119899

sdot119895nonzero

119901119894119895rsquos

(4) 119901119894sdot= 1119873 119894 = 1 2 119870

Using the above observations and the fact that 0 log 0 = 0 wehave

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

=

119870

sum119894=1

1199011198941log

1199011198941

119901119894sdot119901sdot1

+

119870

sum119894=1

1199011198942log

1199011198942

119901119894sdot119901sdot2

+ sdot sdot sdot +

119870

sum119894=1

119901119894119871log

119901119894119871

119901119894sdot119901sdot119871

10 Mathematical Problems in Engineering

= sum1199011198941 =0

1

119873log 1119873

119901sdot1119873

+ sum1199011198942 =0

1

119873log 1119873

119901sdot2119873

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1119873

119901sdot119871119873

= sum1199011198941 =0

1

119873log 1

119901sdot1

+ sum1199011198942 =0

1

119873log 1

119901sdot2

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1

119901sdot119871

=119899sdot1

119873log 1

119901sdot1

+119899sdot2

119873log 1

119901sdot2

+ sdot sdot sdot +119899sdot119871

119873log 1

119901sdot119871

= 119901sdot1log 1

119901sdot1

+ 119901sdot2log 1

119901sdot2

+ sdot sdot sdot + 119901sdot119871log 1

119901sdot119871

= 119867 (119884)

(59)

52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1

To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways

521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs

522 Relative Values From Property 4 we have 0 le

119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs

523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]

Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then

119901 ln119901

119902asymp (119901 minus 119902) +

(119901 minus 119902)2

2119902 (60)

The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is

Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain

2119868 (119883 119884) = 2119873

119870

sum119894=1

119871

sum119895=1

119901119894119895ln

119901119894119895

119901119894sdot119901sdot119895

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895119873

(119899119894sdot119873) (119899

sdot119895119873)

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895

119899119894sdot119899sdot119895119873

asymp 2

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus

119899119894sdot119899sdot119895

119873)

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895minus 2

sum119894119899119894sdot

119873

sum119895119899sdot119895

119873

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2119873 minus 2119873

119873+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

=

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 1205942

(61)

The last equation means the previous expressionsum119870

119894=1sum119871

119895=1((119874

119894119895minus 119899

119894sdot119899sdot119895119873)

2(119899

119894sdot119899sdot119895119873)) follows 1205942 distribu-

tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)

approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)

The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no

relationship between them)1198671119883 and119884 are dependent (ie there is a relationship

between them)

Mathematical Problems in Engineering 11

The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

asymp 2119873 times 119868 (119883 119884) (62)

is greater than 1205942119880 the upper-tail critical value from a Chi-

square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is

Select feature 119883 if 119868 (119883 119884) gt1205942119880

2119873 (63)

Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942

119880to be 169 from a Chi-

square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873

Assume a training set has119873 examples We can do featureselection by the following procedure

(i) Step 1 Choose a level of significance of 120572 say 005

(ii) Step 2 Find 119870 the number of values of feature119883

(iii) Step 3 Build the contingency table for119883 and 119884

(iv) Step 4 Calculate 119868(119883 119884) from the contingency table

(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom

from a Chi-square table or any other sources such asSAS

(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise

(vii) Step 7 Repeat Steps 2ndash6 for all features

If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure

53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1

Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process

(i) Step 1 select features by removing those with smallmutual information

(ii) Step 2 do binning for the rest of numerical features

(iii) Step 3 select features by mutual information

54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning

Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships

The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value

6 Conclusions

In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article A Unified Definition of Mutual

4 Mathematical Problems in Engineering

random variables 120585 and 120578 By the product of the distributions119875120585(sdot) and 119875

120578(sdot) denoted by 119875

120585times120578(sdot) we mean the distribution

defined on 119878119909times 119878

119910

119875120585times120578

(119864 times 119865) = 119875120585(119864) times 119875

120578(119865) (14)

for 119864 isin 119878119909and 119865 isin 119878

119910 If the joint distribution 119875

120585120578(sdot)

coincides with the product distribution 119875120585times120578

(sdot) the randomvariables 120585 and 120578 are said to be independent If 120585 and 120578 arediscrete random variables say 119883 and 119884 contain countablymany points 119909

1 119909

2 and 119910

1 119910

2 then

119868 (120585 120578) = sum119894119895

119875120585times120578

(119909119894 119910

119895) log

119875120585120578(119909

119894 119910

119895)

119875120585(119909

119894) 119875

120578(119910

119895) (15)

119868 is called the information of 120585 and 120578with respect to the other

324 A Modern Definition in Information Theory Of thevarious definitions of mutual information the most widelyaccepted of recent years is the one by Cover andThomas [8]

Let 119883 be a discrete variable with alphabet 120594 and prob-ability mass function 119901(119909) = Pr119883 = 119909 119909 isin 120594 Let 119884

be a discrete variable with alphabet Υ and probability massfunction 119901(119910) = Pr119884 = 119910 119910 isin Υ Suppose 119883 and 119884 havea joint mass function (joint distribution) 119901(119909 119910) Then themutual information 119868(119883 119884) can be defined as

119868 (119883 119884) = sum119909isin120594

sum119910isinΥ

119901 (119909 119910) log119901 (119909 119910)

119901 (119909) 119901 (119910) (16)

33 Class 2 Definitions In Class 2 definitions random vari-ables are replaced by ensembles and mutual information isso-called average mutual information Gallager [11] adopteda more general and more rigorous approach to introducethe concept of mutual information in communication theoryIndeed he combined and compiled the results from Fano [9]and Abramson [10]

Suppose that discrete ensemble 119883 has a sample space1198861 119886

2 119886

119870 and discrete ensemble 119884 has a sample space

1198871 1198872 119887

119871 Consider the joint sample space (119886

119896 119887119895) 1 le

119896 le 119870 1 le 119895 le 119869 A probability measure on the joint samplespace is given by the join probability 119875

119883119884(119886119896 119887119895) defined for

1 le 119896 le 119870 1 le 119895 le 119869 The combination of a joint samplespace and probability measure for outcomes 119909 and 119910 is calleda joint 119883119884ensemble Then the marginal probabilities can befound as

119875119883(119886

119896) =

119869

sum119895=1

119875119883119884

(119886119896 119887119895) 119896 = 1 2 119870 (17)

In more abbreviated notation this is written as

119875 (119909) = sum119910

119875 (119909 119910) (18)

Likewise

119875119884(119887119895) =

119870

sum119896=1

119875119883119884

(119886119896 119887119895) 119895 = 1 2 119869 (19)

In more abbreviated notation this is written as

119875 (119910) = sum119909

119875 (119909 119910) (20)

If 119875119883(119886119896) gt 0 the conditional probability that outcome 119910 is

119887119895 given that outcome of 119909 is 119886

119896 is defined as

119875119884|119883

(119887119895| 119886

119896) =

119875119883119884

(119886119896 119887119895)

119875119883(119886

119896)

(21)

The mutual information between events 119909 = 119886119896and 119910 = 119887

119895is

defined as

119868119883119884

(119886119896 119887119895) = log

119875119883|119884

(119886119896| 119887

119895)

119875119883(119886

119896)

= log119875119883119884

(119886119896 119887119895)

119875119883(119886

119896) 119875

119884(119887119895)

= log119875119883|119884

(119886119896| 119887

119895)

119875119883(119886

119896)

= 119868119884119883

(119887119895 119886

119896)

(22)

Since the mutual information defined above is a randomvariable on the 119883119884 joint ensemble the mean value which iscalled the average mutual information denoted by 119868(119883 119884) isgiven by

119868 (119883 119884) =

119870

sum119896=1

119871

sum119895=1

119875119883119884

(119886119896 119887119895) log

119875119883119884

(119886119896 119887119895)

119875119883(119886

119896) 119875

119884(119887119895) (23)

Remark 12 By means of an information channel consistingof a transmitter of alphabet 119860 with elements 119886

119894and total

elements 119905 and a receiver of alphabet 119861 with elements 119887119895

and total elements 119903 Abramson [10] denoted 119867(119860) minus 119867(119860 |

119861) = sum119860119861

119875(119886 119887) log(119875(119886 119887)119875(119886)119875(119887)) by 119868(119860 119861) and calledit mutual information of 119860 and 119861

The mutual information 119868(119883 119884) between 2 continuousrandomvariables119883 and119884 [8] (also called rate of transmissionin [1]) is defined as

119868 (119883 119884) = ∬119875 (119909 119910) log119875 (119909 119910)

119875 (119909) 119875 (119910)119889119909 119889119910 (24)

where 119875(119909 119910) is the joint probability density function of119883 and 119884 and 119875(119909) and 119875(119910) are the marginal densityfunctions associated with 119883 and 119884 respectively The mutualinformation between 2 continuous random variables is alsocalled the differential mutual information

However the differential mutual information ismuch lesspopular than its discrete counterpart On the one hand thejoint density function involved is unknown inmost cases andhence must be estimated [13 14] On the other hand datain engineering and machine learning are mostly finite andso mutual information between discrete random variables isused

4 A New Unified Definition ofMutual Information

In Section 3 we reviewed various definitions of mutual infor-mation Shannonrsquos original definition laid the foundation

Mathematical Problems in Engineering 5

of information theory Kullbackrsquos definition used randomvariables for the first time and was more mathematical andmore compact Although Ashrsquos definition followed Shannonrsquospath it was more systematic Pinskerrsquos definition was mostmathematical in that it employed probability theory Gal-lagerrsquos definition was more general and more rigorous incommunication theory Cover and Thomasrsquos definition is sosuccinct that it is now a standard definition in informationtheory

However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Except Pinskerrsquos definition Class 2definitions either neglect the probability spaces or assumethe two random variables have the same probability spaceBoth Class 1 definitions and Class 2 definitions assume a jointdistribution or a joint probability measure exists Yet they allignore an important fact that the joint distribution or the jointprobability measure is not unique

41 Unified Definition of Mutual Information Let 119883 be afinite discrete random variable on discrete probability space(Ω

1F

1 119875

1) with Ω

1= 120596

1 120596

2 120596

119899 and range 119909

1 119909

2

119909119870 with 119870 le 119899 Let 119884 be a discrete random variable on

probability space (Ω2F

2 119875

2) with Ω

2= 120588

1 120588

2 120588

119898 and

range 1199101 119910

2 119910

119871 with 119871 le 119898

If119883 and119884have the sameprobability space (ΩF 119875) thenthe joint distribution is simply

119875119883119884

(119883 = 119909 119884 = 119910) = 119875 (120596 isin Ω 119883 (120596) = 119909 119884 (120596) = 119910)

(25)

However when 119883 and 119884 have different probability spacesand so different probability measures the joint distributionis more complicated

Definition 13 The joint sample space of random variables 119883and 119884 is defined as the product Ω

1times Ω

2of all pairs (120596

119894 120588

119895)

119894 = 1 2 119899 and 119895 = 1 2 119898The joint 120590-fieldF1timesF

2is

defined as the product of all pairs (1198601 119860

2) where119860

1and119860

2

are elements of F1and F

2 respectively A joint probability

measure 119875119883119884

of 1198751and 119875

2is a probability measure on F

1times

F2 119875

119883119884(119860 times 119861) such that for any 119860 sube Ω

1and 119861 sube Ω

2

1198751(119860) = 119875

119883119884(119860 times Ω

2) =

119898

sum119895=1

119875119883119884

(119860 times 120588119895)

1198752(119861) = 119875

119883119884(Ω

1times 119861) =

119898

sum119894=1

119875119883119884

(120596119894 times 119861)

(26)

(Ω1timesΩ

2 F

1timesF

2 119875

119883119884) is called the joint probability space

of 119883 and 119884 and 119875119883119884

(119883 = 119909119894 times 119884 = 119910

119895) for 119894 = 1 2 119899

and 119895 = 1 2 119898 the joint distribution of119883 and 119884

Combining Definitions 2 and 13 we immediately obtainthe following results

Proposition 14 A sequence of nonnegative numbers 119901119894119895 1 le

119894 le 119870 1 le 119895 le 119871 whose sum is 1 can serve as a probabilitymeasure on F

1times F

2 119875

119883119884(120596

119894 120588

119895) = 119901

119894119895 The probability of

any event 119860 times 119861 sube Ω1times Ω

2is computed simply by adding the

probabilities of the individual points of (120596 120588) isin 119860 times 119861 If inaddition for 119894 = 1 2 119870 and 119895 = 1 2 119871 the followinghold

119871

sum119895=1

119901119894119895= 119875

119883(120596

119894)

119870

sum119894=1

119901119894119895= 119875

119884(120588

119895)

(27)

then 119875119883119884

(120596119894 120588

119895) = 119901

119894119895is a joint distribution of119883 and 119884

For convenience from now on we will shorten 119875119883119884

(119883 =

119909119894 times 119884 = 119910

119895) as 119875

119883119884(119909

119894 119910

119895)

This two-dimensional measure should not be confusedwith one-dimensional joint distribution when 119883 and 119884 havethe same probability space

Remark 15 If (Ω1F

1 119875

1) = (Ω

2F

2 119875

2) instead of using

the two-dimensional measure 119875119883119884

(119883 = 119909119894 times 119884 =

119910119895) we may use the one-dimensional measure 119875

1(119883 =

119909119894 and 119884 = 119910

119895) Then (26) always hold In this sense our

new definition of joint distribution reduces to the definitionof joint distribution with the same probability space

Definition 16 The conditional probability 119884 = 119910119895 given 119883 =

119909119894 is defined as

119875119884|119883

(119884 = 119910119895| 119883 = 119909

119894) =

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) (28)

Theorem 17 For any two discrete random variables thereis at least one joint probability measure called the productprobability measure or simply product distribution

Proof Let random variables 119883 and 119884 be defined as beforeDefine a function fromΩ

1times Ω

2to [0 1] as follows

119875119883119884

(120596119894 120588

119895) = 119875

1(120596

119894) 119875

2(120588

119895) (29)

Then

119899

sum119894=1

119898

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119899

sum119894=1

119898

sum119895=1

1198751(120596

119894) 119875

2(120588

119895)

=

119899

sum119894=1

1198751(120596

119894)

119898

sum119895=1

1198752(120588

119895) = 1

(30)

Hence 119875119883119884

can serve as a probability measure on F1times

F2by Definition 2 The probability of any event 119860 times 119861 sube

Ω1times Ω

2is computed simply by adding the probabilities of

6 Mathematical Problems in Engineering

the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596

1198941 120596

1198942 120596

119894119904 isin Ω

1of 119904 elements

119875119883119884

(119860 times Ω2) =

119898

sum119895=1

119875119883119884

(119860 times 120588119895)

=

119898

sum119895=1

119904

sum119906=1

119875119883119884

(120596119894119906 times 120588

119895)

=

119898

sum119895=1

119904

sum119906=1

1198751(120596

119894119906) 119875

2(120588

119895)

=

119898

sum119895=1

1198752(120588

119895)

119904

sum119906=1

1198751(120596

119894119906)

=

119904

sum119906=1

1198751(120596

119894119906) = 119875

1(119860)

(31)

Similarly 119875119883119884

(Ω1times 119861) = 119875

2(119861) for any 119861 isin Ω

2 Hence

119875119883119884

(119883 = 119909119894 times 119884 = 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) is a

joint probability measure of119883 and 119884 by Definition 13

Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875

119883119884(sdot) if 119875

119883119884(sdot) coincides

with the product distribution 119875119883times119884

(sdot)

Definition 19 The joint entropy119867(119883 119884) is defined as

119867(119883 119884) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119883119884(119909

119894 119910

119895) (32)

Definition 20 Theconditional entropy119867(119884 | 119883) is as follows

119867(119884 | 119883) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119884 = 119910

119895| 119883 = 119909

119894)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894)

(33)

Definition 21 The mutual information 119868(119883 119884) between 119883

and 119884 is defined as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

(34)

As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0

It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ

1= 119886

1 119886

2 119886

119870 andΩ

2= 119887

1 1198872 119887

119871

Define random variables 119883 Ω1rarr R and 119884 Ω

2rarr R as

one-to-one mappings as

119883(119886119894) = 119909

119894 119894 = 1 2 119870

119884 (119887119895) = 119910

119895 119895 = 1 2 119871

(35)

Then we have

119875119883119884

(119909119894 119910

119895) = 119875

119883119884(119886

119894 119887119895) (36)

It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example

Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions

1198751(119884 = 0) =

1

3 119875

1(119884 = 1) =

2

3

1198752(119883 = 1) =

1

3 119875

2(119883 = 2) =

1

3 119875

2(119883 = 3) =

1

3

(37)

We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information

(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13

(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0

(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13

(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =

29 119875(3 0) = 19 119875(3 1) = 29

42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]

Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =

1199011 119901

2 119901

119899 and 119876 = 119902

1 119902

2 119902

119899 is defined as

119863 (119875 119876) = sum119894

119901119894log

119901119894

119902119894

(38)

Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902

119894for all 119894

Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions

Mathematical Problems in Engineering 7

The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875

119883119884and the product

distribution 119875119883119875119884

Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875

119883119884and the product distribution 119875

11198752

Proof Using a mapping from 2-dimensional indices to one-dimensional index

(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899

for 119894 = 1 119870 119895 = 1 2 119871(39)

and using another mapping from one-dimensional indexback to two-dimensional indices

119894 = lceil119899

119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871

for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871

(40)

we rewrite 119868(119883 119884) as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(41)

Since119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) = 1

119870119871

sum119899=1

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

=

119870

sum119894=1

119871

sum119895=1

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895) = 1

(42)

we obtain

119868 (119883 119884) =

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(43)

Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies

119868 (119883 119884) ge 0 (44)

with equality if and only if119883 and 119884 are independent

Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875

119883119884(119909

lceil119899119871rceil 119910

119899minus(lceil119899119871rceilminus1)lowast119871) = 119875

1(119883 = 119909

lceil119899119871rceil)119875

2(119884 =

119910119899minus(lceil119899119871rceilminus1)lowast119871

) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +

sdot sdot sdot + 119870 that is 119875119883119884

(119909119894 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) for 119894 =

1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent

Corollary 26 If119883 is a constant random variable that is119870 =

1 then for any random variable 119884

119868 (119883 119884) = 0 (45)

Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875

1(119883 = 119909) = 119875

1(120596) = 1

For any 119895 = 1 2 119871

119875119883119884

(119909 119910119895) =

1

sum119894=1

119875119883119884

(119909 119910119895) = 119875

2(119884 = 119910

119895)

= 1198751(119883 = 119909) 119875

2(119884 = 119910

119895)

(46)

Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0

Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then

0 le 119867 (119883) le log119870 (47)

with equality if and only if the119870 values are equally probable

Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold

119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)

= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)

Proof Consider

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119884|119883

(119909119894 119910

119895)

1198752(119884 = 119910

119895)

8 Mathematical Problems in Engineering

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

+

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

119871

sum119894=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

1198752(119884 = 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= 119867 (119884) minus 119867 (119884 | 119883)

(49)

Combining the above properties and noting that 119867(119883 | 119884)

and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties

Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then

0 le 119868 (119883 119884) le 119867 (119884) le log 119871

0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)

Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent

5 Newly Defined Mutual Information inMachine Learning

Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883

1 119883

2 119883

119872 and their associated output values

119910 from an output space 119884 (ie dependent variable) Here1198831 119883

2 119883

119872are called features that is independent vari-

ables Hence 119878 can be expressed as

119878 = 1199091198941 119909

1198942 119909

119894119872 119910

119894 119894 = 1 2 119873 (51)

where feature 119883119895has values 119909

1119895 119909

2119895 119909

119873119895for 119895 = 1 2

119872

A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883

1 119883

2 119883

119872 Rather it is decided by only a subset of

them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows

(1) Evaluate the relationship between each individualinput feature 119909

119894and the output 119884

(2) Select the best set of attributes according to somecriterion

51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596

1 120596

2 120596

119870 If 119883 has missing values we will use 120596

1

to represent all the missing values Assume 119884 has 119871 differentvalues 120588

1 120588

2 120588

119871

Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874

119894119895be the frequency (could be 0) of (120596

119894 120588

119895) for

119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899

119894sdotand 119899

sdot119895 respectively Then

119899119894sdot= sum

119895

119874119894119895

119899sdot119895= sum

119894

119874119894119895

119873 = sum119894

sum119895

119874119894119895= sum

119894

119899119894sdot= sum

119895

119899sdot119895

(52)

Let us denote the relative frequency119874119894119895119873 by 119901

119894119895We have the

two-way relative frequency table see Table 2Since

119870

sum119894=1

119871

sum119895=1

119901119894119895=

119870

sum119894=1

119901119894sdot=

119871

sum119895=1

119901sdot119895= 1 (53)

119901119894sdot119870119894=1

119901sdot119895119871119895=1

and 119901119894119895119870119894=1

can each serve as a probabilitymeasure

Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables

119883 (Ω1F

1 119875

119883) 997888rarr 119877 (54)

as 119883(120596119894) = 119909

119894 where Ω

1= 120596

1 120596

2 120596

119870 and 119875

119883(120596

119894) =

119899119894sdot119873 = 119901

119894sdotfor 119894 = 1 2 119870 Note that 119909

1 119909

2 119909

119870

could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909

119894) = 119875

119883(120596

119894)

Mathematical Problems in Engineering 9

Table 1 Frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11987411

11987412

sdot sdot sdot 1198741119895

sdot sdot sdot 1198741119871

1198991∙

1205962

11987421

11987422

sdot sdot sdot 1198742119895

sdot sdot sdot 1198742119871

1198992∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1198741198941

1198741198942

sdot sdot sdot 119874119894119895

sdot sdot sdot 119874119894119871

119899119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1198741198701

1198741198702

sdot sdot sdot 119874119870119895

sdot sdot sdot 119874119870119871

119899119870∙

Total 119899∙1

119899∙2

sdot sdot sdot 119899∙119895

sdot sdot sdot 119899∙119871

119873

Table 2 Relative frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11990111

11990112

sdot sdot sdot 1199011119895

sdot sdot sdot 1199011119871

1199011∙

1205962

11990121

11990122

sdot sdot sdot 1199012119895

sdot sdot sdot 1199012119871

1199012∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1199011198941

1199011198942

sdot sdot sdot 119901119894119895

sdot sdot sdot 119901119894119871

119901119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1199011198701

1199011198702

sdot sdot sdot 119901119870119895

sdot sdot sdot 119901119870119871

119901119870∙

Total 119901∙1

119901∙2

sdot sdot sdot 119901∙119895

sdot sdot sdot 119901∙119871

1

Similarly

119884 (Ω2F

2 119875

2) 997888rarr 119877 (55)

as 119884(120588119895) = 119910

119895 where Ω

2= 120588

1 120588

2 120588

119871 and 119875

119884(120588

119895) =

119899sdot119895119873 = 119901

sdot119895for 119895 = 1 2 119871 Also 119910

1 119910

2 119910

119870could be

any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875

119884(119884 =

119910119895) = 119875

119884(120588

119895)

Now define a mapping 119875119883119884

fromΩ1timesΩ

2to 119877 as follows

119875119883119884

(120596119894 120588

119895) = 119901

119894119895=

119874119894119895

119873 (56)

Since119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) = 1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119871

sum119895=1

119901119894119895= 119901

119894sdot= 119875

119883(120596

119894)

119870

sum119894=1

119875119883119884

(120596119894 120588

119895) =

119870

sum119894=1

119901119894119895= 119901

sdot119895= 119875

119884(120588

119895)

(57)

119901119894119895119870119894=1

is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) log

119875119883119884

(120596119894 120588

119895)

119875119883(120596

119894) 119875

119884(120588

119895)

=

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

(58)

It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value

Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)

Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that

(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871

(2) 119901119894119895

= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and

119895 = 1 2 119871

(3) for each 119895 = 1 2 119871 since1198741119895+119874

2119895+sdot sdot sdot+119874

119870119895= 119899

sdot119895

there are 119899sdot119895nonzero 119874

119894119895rsquos or equivalently 119899

sdot119895nonzero

119901119894119895rsquos

(4) 119901119894sdot= 1119873 119894 = 1 2 119870

Using the above observations and the fact that 0 log 0 = 0 wehave

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

=

119870

sum119894=1

1199011198941log

1199011198941

119901119894sdot119901sdot1

+

119870

sum119894=1

1199011198942log

1199011198942

119901119894sdot119901sdot2

+ sdot sdot sdot +

119870

sum119894=1

119901119894119871log

119901119894119871

119901119894sdot119901sdot119871

10 Mathematical Problems in Engineering

= sum1199011198941 =0

1

119873log 1119873

119901sdot1119873

+ sum1199011198942 =0

1

119873log 1119873

119901sdot2119873

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1119873

119901sdot119871119873

= sum1199011198941 =0

1

119873log 1

119901sdot1

+ sum1199011198942 =0

1

119873log 1

119901sdot2

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1

119901sdot119871

=119899sdot1

119873log 1

119901sdot1

+119899sdot2

119873log 1

119901sdot2

+ sdot sdot sdot +119899sdot119871

119873log 1

119901sdot119871

= 119901sdot1log 1

119901sdot1

+ 119901sdot2log 1

119901sdot2

+ sdot sdot sdot + 119901sdot119871log 1

119901sdot119871

= 119867 (119884)

(59)

52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1

To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways

521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs

522 Relative Values From Property 4 we have 0 le

119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs

523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]

Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then

119901 ln119901

119902asymp (119901 minus 119902) +

(119901 minus 119902)2

2119902 (60)

The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is

Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain

2119868 (119883 119884) = 2119873

119870

sum119894=1

119871

sum119895=1

119901119894119895ln

119901119894119895

119901119894sdot119901sdot119895

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895119873

(119899119894sdot119873) (119899

sdot119895119873)

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895

119899119894sdot119899sdot119895119873

asymp 2

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus

119899119894sdot119899sdot119895

119873)

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895minus 2

sum119894119899119894sdot

119873

sum119895119899sdot119895

119873

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2119873 minus 2119873

119873+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

=

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 1205942

(61)

The last equation means the previous expressionsum119870

119894=1sum119871

119895=1((119874

119894119895minus 119899

119894sdot119899sdot119895119873)

2(119899

119894sdot119899sdot119895119873)) follows 1205942 distribu-

tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)

approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)

The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no

relationship between them)1198671119883 and119884 are dependent (ie there is a relationship

between them)

Mathematical Problems in Engineering 11

The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

asymp 2119873 times 119868 (119883 119884) (62)

is greater than 1205942119880 the upper-tail critical value from a Chi-

square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is

Select feature 119883 if 119868 (119883 119884) gt1205942119880

2119873 (63)

Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942

119880to be 169 from a Chi-

square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873

Assume a training set has119873 examples We can do featureselection by the following procedure

(i) Step 1 Choose a level of significance of 120572 say 005

(ii) Step 2 Find 119870 the number of values of feature119883

(iii) Step 3 Build the contingency table for119883 and 119884

(iv) Step 4 Calculate 119868(119883 119884) from the contingency table

(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom

from a Chi-square table or any other sources such asSAS

(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise

(vii) Step 7 Repeat Steps 2ndash6 for all features

If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure

53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1

Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process

(i) Step 1 select features by removing those with smallmutual information

(ii) Step 2 do binning for the rest of numerical features

(iii) Step 3 select features by mutual information

54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning

Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships

The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value

6 Conclusions

In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article A Unified Definition of Mutual

Mathematical Problems in Engineering 5

of information theory Kullbackrsquos definition used randomvariables for the first time and was more mathematical andmore compact Although Ashrsquos definition followed Shannonrsquospath it was more systematic Pinskerrsquos definition was mostmathematical in that it employed probability theory Gal-lagerrsquos definition was more general and more rigorous incommunication theory Cover and Thomasrsquos definition is sosuccinct that it is now a standard definition in informationtheory

However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Except Pinskerrsquos definition Class 2definitions either neglect the probability spaces or assumethe two random variables have the same probability spaceBoth Class 1 definitions and Class 2 definitions assume a jointdistribution or a joint probability measure exists Yet they allignore an important fact that the joint distribution or the jointprobability measure is not unique

41 Unified Definition of Mutual Information Let 119883 be afinite discrete random variable on discrete probability space(Ω

1F

1 119875

1) with Ω

1= 120596

1 120596

2 120596

119899 and range 119909

1 119909

2

119909119870 with 119870 le 119899 Let 119884 be a discrete random variable on

probability space (Ω2F

2 119875

2) with Ω

2= 120588

1 120588

2 120588

119898 and

range 1199101 119910

2 119910

119871 with 119871 le 119898

If119883 and119884have the sameprobability space (ΩF 119875) thenthe joint distribution is simply

119875119883119884

(119883 = 119909 119884 = 119910) = 119875 (120596 isin Ω 119883 (120596) = 119909 119884 (120596) = 119910)

(25)

However when 119883 and 119884 have different probability spacesand so different probability measures the joint distributionis more complicated

Definition 13 The joint sample space of random variables 119883and 119884 is defined as the product Ω

1times Ω

2of all pairs (120596

119894 120588

119895)

119894 = 1 2 119899 and 119895 = 1 2 119898The joint 120590-fieldF1timesF

2is

defined as the product of all pairs (1198601 119860

2) where119860

1and119860

2

are elements of F1and F

2 respectively A joint probability

measure 119875119883119884

of 1198751and 119875

2is a probability measure on F

1times

F2 119875

119883119884(119860 times 119861) such that for any 119860 sube Ω

1and 119861 sube Ω

2

1198751(119860) = 119875

119883119884(119860 times Ω

2) =

119898

sum119895=1

119875119883119884

(119860 times 120588119895)

1198752(119861) = 119875

119883119884(Ω

1times 119861) =

119898

sum119894=1

119875119883119884

(120596119894 times 119861)

(26)

(Ω1timesΩ

2 F

1timesF

2 119875

119883119884) is called the joint probability space

of 119883 and 119884 and 119875119883119884

(119883 = 119909119894 times 119884 = 119910

119895) for 119894 = 1 2 119899

and 119895 = 1 2 119898 the joint distribution of119883 and 119884

Combining Definitions 2 and 13 we immediately obtainthe following results

Proposition 14 A sequence of nonnegative numbers 119901119894119895 1 le

119894 le 119870 1 le 119895 le 119871 whose sum is 1 can serve as a probabilitymeasure on F

1times F

2 119875

119883119884(120596

119894 120588

119895) = 119901

119894119895 The probability of

any event 119860 times 119861 sube Ω1times Ω

2is computed simply by adding the

probabilities of the individual points of (120596 120588) isin 119860 times 119861 If inaddition for 119894 = 1 2 119870 and 119895 = 1 2 119871 the followinghold

119871

sum119895=1

119901119894119895= 119875

119883(120596

119894)

119870

sum119894=1

119901119894119895= 119875

119884(120588

119895)

(27)

then 119875119883119884

(120596119894 120588

119895) = 119901

119894119895is a joint distribution of119883 and 119884

For convenience from now on we will shorten 119875119883119884

(119883 =

119909119894 times 119884 = 119910

119895) as 119875

119883119884(119909

119894 119910

119895)

This two-dimensional measure should not be confusedwith one-dimensional joint distribution when 119883 and 119884 havethe same probability space

Remark 15 If (Ω1F

1 119875

1) = (Ω

2F

2 119875

2) instead of using

the two-dimensional measure 119875119883119884

(119883 = 119909119894 times 119884 =

119910119895) we may use the one-dimensional measure 119875

1(119883 =

119909119894 and 119884 = 119910

119895) Then (26) always hold In this sense our

new definition of joint distribution reduces to the definitionof joint distribution with the same probability space

Definition 16 The conditional probability 119884 = 119910119895 given 119883 =

119909119894 is defined as

119875119884|119883

(119884 = 119910119895| 119883 = 119909

119894) =

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) (28)

Theorem 17 For any two discrete random variables thereis at least one joint probability measure called the productprobability measure or simply product distribution

Proof Let random variables 119883 and 119884 be defined as beforeDefine a function fromΩ

1times Ω

2to [0 1] as follows

119875119883119884

(120596119894 120588

119895) = 119875

1(120596

119894) 119875

2(120588

119895) (29)

Then

119899

sum119894=1

119898

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119899

sum119894=1

119898

sum119895=1

1198751(120596

119894) 119875

2(120588

119895)

=

119899

sum119894=1

1198751(120596

119894)

119898

sum119895=1

1198752(120588

119895) = 1

(30)

Hence 119875119883119884

can serve as a probability measure on F1times

F2by Definition 2 The probability of any event 119860 times 119861 sube

Ω1times Ω

2is computed simply by adding the probabilities of

6 Mathematical Problems in Engineering

the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596

1198941 120596

1198942 120596

119894119904 isin Ω

1of 119904 elements

119875119883119884

(119860 times Ω2) =

119898

sum119895=1

119875119883119884

(119860 times 120588119895)

=

119898

sum119895=1

119904

sum119906=1

119875119883119884

(120596119894119906 times 120588

119895)

=

119898

sum119895=1

119904

sum119906=1

1198751(120596

119894119906) 119875

2(120588

119895)

=

119898

sum119895=1

1198752(120588

119895)

119904

sum119906=1

1198751(120596

119894119906)

=

119904

sum119906=1

1198751(120596

119894119906) = 119875

1(119860)

(31)

Similarly 119875119883119884

(Ω1times 119861) = 119875

2(119861) for any 119861 isin Ω

2 Hence

119875119883119884

(119883 = 119909119894 times 119884 = 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) is a

joint probability measure of119883 and 119884 by Definition 13

Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875

119883119884(sdot) if 119875

119883119884(sdot) coincides

with the product distribution 119875119883times119884

(sdot)

Definition 19 The joint entropy119867(119883 119884) is defined as

119867(119883 119884) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119883119884(119909

119894 119910

119895) (32)

Definition 20 Theconditional entropy119867(119884 | 119883) is as follows

119867(119884 | 119883) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119884 = 119910

119895| 119883 = 119909

119894)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894)

(33)

Definition 21 The mutual information 119868(119883 119884) between 119883

and 119884 is defined as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

(34)

As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0

It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ

1= 119886

1 119886

2 119886

119870 andΩ

2= 119887

1 1198872 119887

119871

Define random variables 119883 Ω1rarr R and 119884 Ω

2rarr R as

one-to-one mappings as

119883(119886119894) = 119909

119894 119894 = 1 2 119870

119884 (119887119895) = 119910

119895 119895 = 1 2 119871

(35)

Then we have

119875119883119884

(119909119894 119910

119895) = 119875

119883119884(119886

119894 119887119895) (36)

It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example

Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions

1198751(119884 = 0) =

1

3 119875

1(119884 = 1) =

2

3

1198752(119883 = 1) =

1

3 119875

2(119883 = 2) =

1

3 119875

2(119883 = 3) =

1

3

(37)

We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information

(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13

(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0

(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13

(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =

29 119875(3 0) = 19 119875(3 1) = 29

42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]

Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =

1199011 119901

2 119901

119899 and 119876 = 119902

1 119902

2 119902

119899 is defined as

119863 (119875 119876) = sum119894

119901119894log

119901119894

119902119894

(38)

Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902

119894for all 119894

Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions

Mathematical Problems in Engineering 7

The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875

119883119884and the product

distribution 119875119883119875119884

Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875

119883119884and the product distribution 119875

11198752

Proof Using a mapping from 2-dimensional indices to one-dimensional index

(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899

for 119894 = 1 119870 119895 = 1 2 119871(39)

and using another mapping from one-dimensional indexback to two-dimensional indices

119894 = lceil119899

119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871

for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871

(40)

we rewrite 119868(119883 119884) as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(41)

Since119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) = 1

119870119871

sum119899=1

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

=

119870

sum119894=1

119871

sum119895=1

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895) = 1

(42)

we obtain

119868 (119883 119884) =

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(43)

Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies

119868 (119883 119884) ge 0 (44)

with equality if and only if119883 and 119884 are independent

Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875

119883119884(119909

lceil119899119871rceil 119910

119899minus(lceil119899119871rceilminus1)lowast119871) = 119875

1(119883 = 119909

lceil119899119871rceil)119875

2(119884 =

119910119899minus(lceil119899119871rceilminus1)lowast119871

) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +

sdot sdot sdot + 119870 that is 119875119883119884

(119909119894 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) for 119894 =

1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent

Corollary 26 If119883 is a constant random variable that is119870 =

1 then for any random variable 119884

119868 (119883 119884) = 0 (45)

Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875

1(119883 = 119909) = 119875

1(120596) = 1

For any 119895 = 1 2 119871

119875119883119884

(119909 119910119895) =

1

sum119894=1

119875119883119884

(119909 119910119895) = 119875

2(119884 = 119910

119895)

= 1198751(119883 = 119909) 119875

2(119884 = 119910

119895)

(46)

Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0

Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then

0 le 119867 (119883) le log119870 (47)

with equality if and only if the119870 values are equally probable

Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold

119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)

= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)

Proof Consider

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119884|119883

(119909119894 119910

119895)

1198752(119884 = 119910

119895)

8 Mathematical Problems in Engineering

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

+

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

119871

sum119894=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

1198752(119884 = 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= 119867 (119884) minus 119867 (119884 | 119883)

(49)

Combining the above properties and noting that 119867(119883 | 119884)

and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties

Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then

0 le 119868 (119883 119884) le 119867 (119884) le log 119871

0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)

Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent

5 Newly Defined Mutual Information inMachine Learning

Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883

1 119883

2 119883

119872 and their associated output values

119910 from an output space 119884 (ie dependent variable) Here1198831 119883

2 119883

119872are called features that is independent vari-

ables Hence 119878 can be expressed as

119878 = 1199091198941 119909

1198942 119909

119894119872 119910

119894 119894 = 1 2 119873 (51)

where feature 119883119895has values 119909

1119895 119909

2119895 119909

119873119895for 119895 = 1 2

119872

A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883

1 119883

2 119883

119872 Rather it is decided by only a subset of

them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows

(1) Evaluate the relationship between each individualinput feature 119909

119894and the output 119884

(2) Select the best set of attributes according to somecriterion

51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596

1 120596

2 120596

119870 If 119883 has missing values we will use 120596

1

to represent all the missing values Assume 119884 has 119871 differentvalues 120588

1 120588

2 120588

119871

Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874

119894119895be the frequency (could be 0) of (120596

119894 120588

119895) for

119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899

119894sdotand 119899

sdot119895 respectively Then

119899119894sdot= sum

119895

119874119894119895

119899sdot119895= sum

119894

119874119894119895

119873 = sum119894

sum119895

119874119894119895= sum

119894

119899119894sdot= sum

119895

119899sdot119895

(52)

Let us denote the relative frequency119874119894119895119873 by 119901

119894119895We have the

two-way relative frequency table see Table 2Since

119870

sum119894=1

119871

sum119895=1

119901119894119895=

119870

sum119894=1

119901119894sdot=

119871

sum119895=1

119901sdot119895= 1 (53)

119901119894sdot119870119894=1

119901sdot119895119871119895=1

and 119901119894119895119870119894=1

can each serve as a probabilitymeasure

Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables

119883 (Ω1F

1 119875

119883) 997888rarr 119877 (54)

as 119883(120596119894) = 119909

119894 where Ω

1= 120596

1 120596

2 120596

119870 and 119875

119883(120596

119894) =

119899119894sdot119873 = 119901

119894sdotfor 119894 = 1 2 119870 Note that 119909

1 119909

2 119909

119870

could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909

119894) = 119875

119883(120596

119894)

Mathematical Problems in Engineering 9

Table 1 Frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11987411

11987412

sdot sdot sdot 1198741119895

sdot sdot sdot 1198741119871

1198991∙

1205962

11987421

11987422

sdot sdot sdot 1198742119895

sdot sdot sdot 1198742119871

1198992∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1198741198941

1198741198942

sdot sdot sdot 119874119894119895

sdot sdot sdot 119874119894119871

119899119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1198741198701

1198741198702

sdot sdot sdot 119874119870119895

sdot sdot sdot 119874119870119871

119899119870∙

Total 119899∙1

119899∙2

sdot sdot sdot 119899∙119895

sdot sdot sdot 119899∙119871

119873

Table 2 Relative frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11990111

11990112

sdot sdot sdot 1199011119895

sdot sdot sdot 1199011119871

1199011∙

1205962

11990121

11990122

sdot sdot sdot 1199012119895

sdot sdot sdot 1199012119871

1199012∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1199011198941

1199011198942

sdot sdot sdot 119901119894119895

sdot sdot sdot 119901119894119871

119901119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1199011198701

1199011198702

sdot sdot sdot 119901119870119895

sdot sdot sdot 119901119870119871

119901119870∙

Total 119901∙1

119901∙2

sdot sdot sdot 119901∙119895

sdot sdot sdot 119901∙119871

1

Similarly

119884 (Ω2F

2 119875

2) 997888rarr 119877 (55)

as 119884(120588119895) = 119910

119895 where Ω

2= 120588

1 120588

2 120588

119871 and 119875

119884(120588

119895) =

119899sdot119895119873 = 119901

sdot119895for 119895 = 1 2 119871 Also 119910

1 119910

2 119910

119870could be

any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875

119884(119884 =

119910119895) = 119875

119884(120588

119895)

Now define a mapping 119875119883119884

fromΩ1timesΩ

2to 119877 as follows

119875119883119884

(120596119894 120588

119895) = 119901

119894119895=

119874119894119895

119873 (56)

Since119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) = 1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119871

sum119895=1

119901119894119895= 119901

119894sdot= 119875

119883(120596

119894)

119870

sum119894=1

119875119883119884

(120596119894 120588

119895) =

119870

sum119894=1

119901119894119895= 119901

sdot119895= 119875

119884(120588

119895)

(57)

119901119894119895119870119894=1

is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) log

119875119883119884

(120596119894 120588

119895)

119875119883(120596

119894) 119875

119884(120588

119895)

=

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

(58)

It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value

Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)

Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that

(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871

(2) 119901119894119895

= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and

119895 = 1 2 119871

(3) for each 119895 = 1 2 119871 since1198741119895+119874

2119895+sdot sdot sdot+119874

119870119895= 119899

sdot119895

there are 119899sdot119895nonzero 119874

119894119895rsquos or equivalently 119899

sdot119895nonzero

119901119894119895rsquos

(4) 119901119894sdot= 1119873 119894 = 1 2 119870

Using the above observations and the fact that 0 log 0 = 0 wehave

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

=

119870

sum119894=1

1199011198941log

1199011198941

119901119894sdot119901sdot1

+

119870

sum119894=1

1199011198942log

1199011198942

119901119894sdot119901sdot2

+ sdot sdot sdot +

119870

sum119894=1

119901119894119871log

119901119894119871

119901119894sdot119901sdot119871

10 Mathematical Problems in Engineering

= sum1199011198941 =0

1

119873log 1119873

119901sdot1119873

+ sum1199011198942 =0

1

119873log 1119873

119901sdot2119873

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1119873

119901sdot119871119873

= sum1199011198941 =0

1

119873log 1

119901sdot1

+ sum1199011198942 =0

1

119873log 1

119901sdot2

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1

119901sdot119871

=119899sdot1

119873log 1

119901sdot1

+119899sdot2

119873log 1

119901sdot2

+ sdot sdot sdot +119899sdot119871

119873log 1

119901sdot119871

= 119901sdot1log 1

119901sdot1

+ 119901sdot2log 1

119901sdot2

+ sdot sdot sdot + 119901sdot119871log 1

119901sdot119871

= 119867 (119884)

(59)

52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1

To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways

521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs

522 Relative Values From Property 4 we have 0 le

119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs

523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]

Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then

119901 ln119901

119902asymp (119901 minus 119902) +

(119901 minus 119902)2

2119902 (60)

The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is

Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain

2119868 (119883 119884) = 2119873

119870

sum119894=1

119871

sum119895=1

119901119894119895ln

119901119894119895

119901119894sdot119901sdot119895

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895119873

(119899119894sdot119873) (119899

sdot119895119873)

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895

119899119894sdot119899sdot119895119873

asymp 2

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus

119899119894sdot119899sdot119895

119873)

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895minus 2

sum119894119899119894sdot

119873

sum119895119899sdot119895

119873

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2119873 minus 2119873

119873+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

=

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 1205942

(61)

The last equation means the previous expressionsum119870

119894=1sum119871

119895=1((119874

119894119895minus 119899

119894sdot119899sdot119895119873)

2(119899

119894sdot119899sdot119895119873)) follows 1205942 distribu-

tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)

approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)

The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no

relationship between them)1198671119883 and119884 are dependent (ie there is a relationship

between them)

Mathematical Problems in Engineering 11

The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

asymp 2119873 times 119868 (119883 119884) (62)

is greater than 1205942119880 the upper-tail critical value from a Chi-

square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is

Select feature 119883 if 119868 (119883 119884) gt1205942119880

2119873 (63)

Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942

119880to be 169 from a Chi-

square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873

Assume a training set has119873 examples We can do featureselection by the following procedure

(i) Step 1 Choose a level of significance of 120572 say 005

(ii) Step 2 Find 119870 the number of values of feature119883

(iii) Step 3 Build the contingency table for119883 and 119884

(iv) Step 4 Calculate 119868(119883 119884) from the contingency table

(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom

from a Chi-square table or any other sources such asSAS

(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise

(vii) Step 7 Repeat Steps 2ndash6 for all features

If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure

53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1

Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process

(i) Step 1 select features by removing those with smallmutual information

(ii) Step 2 do binning for the rest of numerical features

(iii) Step 3 select features by mutual information

54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning

Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships

The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value

6 Conclusions

In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article A Unified Definition of Mutual

6 Mathematical Problems in Engineering

the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596

1198941 120596

1198942 120596

119894119904 isin Ω

1of 119904 elements

119875119883119884

(119860 times Ω2) =

119898

sum119895=1

119875119883119884

(119860 times 120588119895)

=

119898

sum119895=1

119904

sum119906=1

119875119883119884

(120596119894119906 times 120588

119895)

=

119898

sum119895=1

119904

sum119906=1

1198751(120596

119894119906) 119875

2(120588

119895)

=

119898

sum119895=1

1198752(120588

119895)

119904

sum119906=1

1198751(120596

119894119906)

=

119904

sum119906=1

1198751(120596

119894119906) = 119875

1(119860)

(31)

Similarly 119875119883119884

(Ω1times 119861) = 119875

2(119861) for any 119861 isin Ω

2 Hence

119875119883119884

(119883 = 119909119894 times 119884 = 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) is a

joint probability measure of119883 and 119884 by Definition 13

Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875

119883119884(sdot) if 119875

119883119884(sdot) coincides

with the product distribution 119875119883times119884

(sdot)

Definition 19 The joint entropy119867(119883 119884) is defined as

119867(119883 119884) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119883119884(119909

119894 119910

119895) (32)

Definition 20 Theconditional entropy119867(119884 | 119883) is as follows

119867(119884 | 119883) = minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119884 = 119910

119895| 119883 = 119909

119894)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894)

(33)

Definition 21 The mutual information 119868(119883 119884) between 119883

and 119884 is defined as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

(34)

As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0

It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ

1= 119886

1 119886

2 119886

119870 andΩ

2= 119887

1 1198872 119887

119871

Define random variables 119883 Ω1rarr R and 119884 Ω

2rarr R as

one-to-one mappings as

119883(119886119894) = 119909

119894 119894 = 1 2 119870

119884 (119887119895) = 119910

119895 119895 = 1 2 119871

(35)

Then we have

119875119883119884

(119909119894 119910

119895) = 119875

119883119884(119886

119894 119887119895) (36)

It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example

Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions

1198751(119884 = 0) =

1

3 119875

1(119884 = 1) =

2

3

1198752(119883 = 1) =

1

3 119875

2(119883 = 2) =

1

3 119875

2(119883 = 3) =

1

3

(37)

We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information

(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13

(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0

(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13

(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =

29 119875(3 0) = 19 119875(3 1) = 29

42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]

Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =

1199011 119901

2 119901

119899 and 119876 = 119902

1 119902

2 119902

119899 is defined as

119863 (119875 119876) = sum119894

119901119894log

119901119894

119902119894

(38)

Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902

119894for all 119894

Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions

Mathematical Problems in Engineering 7

The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875

119883119884and the product

distribution 119875119883119875119884

Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875

119883119884and the product distribution 119875

11198752

Proof Using a mapping from 2-dimensional indices to one-dimensional index

(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899

for 119894 = 1 119870 119895 = 1 2 119871(39)

and using another mapping from one-dimensional indexback to two-dimensional indices

119894 = lceil119899

119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871

for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871

(40)

we rewrite 119868(119883 119884) as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(41)

Since119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) = 1

119870119871

sum119899=1

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

=

119870

sum119894=1

119871

sum119895=1

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895) = 1

(42)

we obtain

119868 (119883 119884) =

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(43)

Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies

119868 (119883 119884) ge 0 (44)

with equality if and only if119883 and 119884 are independent

Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875

119883119884(119909

lceil119899119871rceil 119910

119899minus(lceil119899119871rceilminus1)lowast119871) = 119875

1(119883 = 119909

lceil119899119871rceil)119875

2(119884 =

119910119899minus(lceil119899119871rceilminus1)lowast119871

) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +

sdot sdot sdot + 119870 that is 119875119883119884

(119909119894 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) for 119894 =

1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent

Corollary 26 If119883 is a constant random variable that is119870 =

1 then for any random variable 119884

119868 (119883 119884) = 0 (45)

Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875

1(119883 = 119909) = 119875

1(120596) = 1

For any 119895 = 1 2 119871

119875119883119884

(119909 119910119895) =

1

sum119894=1

119875119883119884

(119909 119910119895) = 119875

2(119884 = 119910

119895)

= 1198751(119883 = 119909) 119875

2(119884 = 119910

119895)

(46)

Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0

Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then

0 le 119867 (119883) le log119870 (47)

with equality if and only if the119870 values are equally probable

Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold

119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)

= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)

Proof Consider

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119884|119883

(119909119894 119910

119895)

1198752(119884 = 119910

119895)

8 Mathematical Problems in Engineering

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

+

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

119871

sum119894=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

1198752(119884 = 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= 119867 (119884) minus 119867 (119884 | 119883)

(49)

Combining the above properties and noting that 119867(119883 | 119884)

and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties

Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then

0 le 119868 (119883 119884) le 119867 (119884) le log 119871

0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)

Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent

5 Newly Defined Mutual Information inMachine Learning

Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883

1 119883

2 119883

119872 and their associated output values

119910 from an output space 119884 (ie dependent variable) Here1198831 119883

2 119883

119872are called features that is independent vari-

ables Hence 119878 can be expressed as

119878 = 1199091198941 119909

1198942 119909

119894119872 119910

119894 119894 = 1 2 119873 (51)

where feature 119883119895has values 119909

1119895 119909

2119895 119909

119873119895for 119895 = 1 2

119872

A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883

1 119883

2 119883

119872 Rather it is decided by only a subset of

them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows

(1) Evaluate the relationship between each individualinput feature 119909

119894and the output 119884

(2) Select the best set of attributes according to somecriterion

51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596

1 120596

2 120596

119870 If 119883 has missing values we will use 120596

1

to represent all the missing values Assume 119884 has 119871 differentvalues 120588

1 120588

2 120588

119871

Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874

119894119895be the frequency (could be 0) of (120596

119894 120588

119895) for

119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899

119894sdotand 119899

sdot119895 respectively Then

119899119894sdot= sum

119895

119874119894119895

119899sdot119895= sum

119894

119874119894119895

119873 = sum119894

sum119895

119874119894119895= sum

119894

119899119894sdot= sum

119895

119899sdot119895

(52)

Let us denote the relative frequency119874119894119895119873 by 119901

119894119895We have the

two-way relative frequency table see Table 2Since

119870

sum119894=1

119871

sum119895=1

119901119894119895=

119870

sum119894=1

119901119894sdot=

119871

sum119895=1

119901sdot119895= 1 (53)

119901119894sdot119870119894=1

119901sdot119895119871119895=1

and 119901119894119895119870119894=1

can each serve as a probabilitymeasure

Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables

119883 (Ω1F

1 119875

119883) 997888rarr 119877 (54)

as 119883(120596119894) = 119909

119894 where Ω

1= 120596

1 120596

2 120596

119870 and 119875

119883(120596

119894) =

119899119894sdot119873 = 119901

119894sdotfor 119894 = 1 2 119870 Note that 119909

1 119909

2 119909

119870

could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909

119894) = 119875

119883(120596

119894)

Mathematical Problems in Engineering 9

Table 1 Frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11987411

11987412

sdot sdot sdot 1198741119895

sdot sdot sdot 1198741119871

1198991∙

1205962

11987421

11987422

sdot sdot sdot 1198742119895

sdot sdot sdot 1198742119871

1198992∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1198741198941

1198741198942

sdot sdot sdot 119874119894119895

sdot sdot sdot 119874119894119871

119899119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1198741198701

1198741198702

sdot sdot sdot 119874119870119895

sdot sdot sdot 119874119870119871

119899119870∙

Total 119899∙1

119899∙2

sdot sdot sdot 119899∙119895

sdot sdot sdot 119899∙119871

119873

Table 2 Relative frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11990111

11990112

sdot sdot sdot 1199011119895

sdot sdot sdot 1199011119871

1199011∙

1205962

11990121

11990122

sdot sdot sdot 1199012119895

sdot sdot sdot 1199012119871

1199012∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1199011198941

1199011198942

sdot sdot sdot 119901119894119895

sdot sdot sdot 119901119894119871

119901119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1199011198701

1199011198702

sdot sdot sdot 119901119870119895

sdot sdot sdot 119901119870119871

119901119870∙

Total 119901∙1

119901∙2

sdot sdot sdot 119901∙119895

sdot sdot sdot 119901∙119871

1

Similarly

119884 (Ω2F

2 119875

2) 997888rarr 119877 (55)

as 119884(120588119895) = 119910

119895 where Ω

2= 120588

1 120588

2 120588

119871 and 119875

119884(120588

119895) =

119899sdot119895119873 = 119901

sdot119895for 119895 = 1 2 119871 Also 119910

1 119910

2 119910

119870could be

any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875

119884(119884 =

119910119895) = 119875

119884(120588

119895)

Now define a mapping 119875119883119884

fromΩ1timesΩ

2to 119877 as follows

119875119883119884

(120596119894 120588

119895) = 119901

119894119895=

119874119894119895

119873 (56)

Since119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) = 1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119871

sum119895=1

119901119894119895= 119901

119894sdot= 119875

119883(120596

119894)

119870

sum119894=1

119875119883119884

(120596119894 120588

119895) =

119870

sum119894=1

119901119894119895= 119901

sdot119895= 119875

119884(120588

119895)

(57)

119901119894119895119870119894=1

is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) log

119875119883119884

(120596119894 120588

119895)

119875119883(120596

119894) 119875

119884(120588

119895)

=

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

(58)

It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value

Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)

Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that

(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871

(2) 119901119894119895

= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and

119895 = 1 2 119871

(3) for each 119895 = 1 2 119871 since1198741119895+119874

2119895+sdot sdot sdot+119874

119870119895= 119899

sdot119895

there are 119899sdot119895nonzero 119874

119894119895rsquos or equivalently 119899

sdot119895nonzero

119901119894119895rsquos

(4) 119901119894sdot= 1119873 119894 = 1 2 119870

Using the above observations and the fact that 0 log 0 = 0 wehave

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

=

119870

sum119894=1

1199011198941log

1199011198941

119901119894sdot119901sdot1

+

119870

sum119894=1

1199011198942log

1199011198942

119901119894sdot119901sdot2

+ sdot sdot sdot +

119870

sum119894=1

119901119894119871log

119901119894119871

119901119894sdot119901sdot119871

10 Mathematical Problems in Engineering

= sum1199011198941 =0

1

119873log 1119873

119901sdot1119873

+ sum1199011198942 =0

1

119873log 1119873

119901sdot2119873

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1119873

119901sdot119871119873

= sum1199011198941 =0

1

119873log 1

119901sdot1

+ sum1199011198942 =0

1

119873log 1

119901sdot2

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1

119901sdot119871

=119899sdot1

119873log 1

119901sdot1

+119899sdot2

119873log 1

119901sdot2

+ sdot sdot sdot +119899sdot119871

119873log 1

119901sdot119871

= 119901sdot1log 1

119901sdot1

+ 119901sdot2log 1

119901sdot2

+ sdot sdot sdot + 119901sdot119871log 1

119901sdot119871

= 119867 (119884)

(59)

52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1

To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways

521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs

522 Relative Values From Property 4 we have 0 le

119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs

523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]

Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then

119901 ln119901

119902asymp (119901 minus 119902) +

(119901 minus 119902)2

2119902 (60)

The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is

Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain

2119868 (119883 119884) = 2119873

119870

sum119894=1

119871

sum119895=1

119901119894119895ln

119901119894119895

119901119894sdot119901sdot119895

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895119873

(119899119894sdot119873) (119899

sdot119895119873)

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895

119899119894sdot119899sdot119895119873

asymp 2

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus

119899119894sdot119899sdot119895

119873)

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895minus 2

sum119894119899119894sdot

119873

sum119895119899sdot119895

119873

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2119873 minus 2119873

119873+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

=

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 1205942

(61)

The last equation means the previous expressionsum119870

119894=1sum119871

119895=1((119874

119894119895minus 119899

119894sdot119899sdot119895119873)

2(119899

119894sdot119899sdot119895119873)) follows 1205942 distribu-

tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)

approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)

The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no

relationship between them)1198671119883 and119884 are dependent (ie there is a relationship

between them)

Mathematical Problems in Engineering 11

The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

asymp 2119873 times 119868 (119883 119884) (62)

is greater than 1205942119880 the upper-tail critical value from a Chi-

square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is

Select feature 119883 if 119868 (119883 119884) gt1205942119880

2119873 (63)

Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942

119880to be 169 from a Chi-

square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873

Assume a training set has119873 examples We can do featureselection by the following procedure

(i) Step 1 Choose a level of significance of 120572 say 005

(ii) Step 2 Find 119870 the number of values of feature119883

(iii) Step 3 Build the contingency table for119883 and 119884

(iv) Step 4 Calculate 119868(119883 119884) from the contingency table

(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom

from a Chi-square table or any other sources such asSAS

(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise

(vii) Step 7 Repeat Steps 2ndash6 for all features

If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure

53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1

Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process

(i) Step 1 select features by removing those with smallmutual information

(ii) Step 2 do binning for the rest of numerical features

(iii) Step 3 select features by mutual information

54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning

Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships

The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value

6 Conclusions

In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article A Unified Definition of Mutual

Mathematical Problems in Engineering 7

The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875

119883119884and the product

distribution 119875119883119875119884

Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875

119883119884and the product distribution 119875

11198752

Proof Using a mapping from 2-dimensional indices to one-dimensional index

(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899

for 119894 = 1 119870 119895 = 1 2 119871(39)

and using another mapping from one-dimensional indexback to two-dimensional indices

119894 = lceil119899

119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871

for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871

(40)

we rewrite 119868(119883 119884) as

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(41)

Since119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) = 1

119870119871

sum119899=1

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

=

119870

sum119894=1

119871

sum119895=1

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895) = 1

(42)

we obtain

119868 (119883 119884) =

119870119871

sum119899=1

119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

sdot log119875119883119884

(119909lceil119899119871rceil

119910119899minus(lceil119899119871rceilminus1)lowast119871

)

1198751(119883 = 119909

lceil119899119871rceil) 119875

2(119884 = 119910

119899minus(lceil119899119871rceilminus1)lowast119871)

(43)

Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies

119868 (119883 119884) ge 0 (44)

with equality if and only if119883 and 119884 are independent

Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875

119883119884(119909

lceil119899119871rceil 119910

119899minus(lceil119899119871rceilminus1)lowast119871) = 119875

1(119883 = 119909

lceil119899119871rceil)119875

2(119884 =

119910119899minus(lceil119899119871rceilminus1)lowast119871

) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +

sdot sdot sdot + 119870 that is 119875119883119884

(119909119894 119910

119895) = 119875

1(119883 = 119909

119894)119875

2(119884 = 119910

119895) for 119894 =

1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent

Corollary 26 If119883 is a constant random variable that is119870 =

1 then for any random variable 119884

119868 (119883 119884) = 0 (45)

Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875

1(119883 = 119909) = 119875

1(120596) = 1

For any 119895 = 1 2 119871

119875119883119884

(119909 119910119895) =

1

sum119894=1

119875119883119884

(119909 119910119895) = 119875

2(119884 = 119910

119895)

= 1198751(119883 = 119909) 119875

2(119884 = 119910

119895)

(46)

Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0

Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then

0 le 119867 (119883) le log119870 (47)

with equality if and only if the119870 values are equally probable

Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold

119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)

= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)

Proof Consider

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119883119884

(119909119894 119910

119895)

1198751(119883 = 119909

119894) 119875

2(119884 = 119910

119895)

=

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log

119875119884|119883

(119909119894 119910

119895)

1198752(119884 = 119910

119895)

8 Mathematical Problems in Engineering

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

+

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

119871

sum119894=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

1198752(119884 = 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= 119867 (119884) minus 119867 (119884 | 119883)

(49)

Combining the above properties and noting that 119867(119883 | 119884)

and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties

Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then

0 le 119868 (119883 119884) le 119867 (119884) le log 119871

0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)

Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent

5 Newly Defined Mutual Information inMachine Learning

Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883

1 119883

2 119883

119872 and their associated output values

119910 from an output space 119884 (ie dependent variable) Here1198831 119883

2 119883

119872are called features that is independent vari-

ables Hence 119878 can be expressed as

119878 = 1199091198941 119909

1198942 119909

119894119872 119910

119894 119894 = 1 2 119873 (51)

where feature 119883119895has values 119909

1119895 119909

2119895 119909

119873119895for 119895 = 1 2

119872

A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883

1 119883

2 119883

119872 Rather it is decided by only a subset of

them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows

(1) Evaluate the relationship between each individualinput feature 119909

119894and the output 119884

(2) Select the best set of attributes according to somecriterion

51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596

1 120596

2 120596

119870 If 119883 has missing values we will use 120596

1

to represent all the missing values Assume 119884 has 119871 differentvalues 120588

1 120588

2 120588

119871

Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874

119894119895be the frequency (could be 0) of (120596

119894 120588

119895) for

119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899

119894sdotand 119899

sdot119895 respectively Then

119899119894sdot= sum

119895

119874119894119895

119899sdot119895= sum

119894

119874119894119895

119873 = sum119894

sum119895

119874119894119895= sum

119894

119899119894sdot= sum

119895

119899sdot119895

(52)

Let us denote the relative frequency119874119894119895119873 by 119901

119894119895We have the

two-way relative frequency table see Table 2Since

119870

sum119894=1

119871

sum119895=1

119901119894119895=

119870

sum119894=1

119901119894sdot=

119871

sum119895=1

119901sdot119895= 1 (53)

119901119894sdot119870119894=1

119901sdot119895119871119895=1

and 119901119894119895119870119894=1

can each serve as a probabilitymeasure

Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables

119883 (Ω1F

1 119875

119883) 997888rarr 119877 (54)

as 119883(120596119894) = 119909

119894 where Ω

1= 120596

1 120596

2 120596

119870 and 119875

119883(120596

119894) =

119899119894sdot119873 = 119901

119894sdotfor 119894 = 1 2 119870 Note that 119909

1 119909

2 119909

119870

could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909

119894) = 119875

119883(120596

119894)

Mathematical Problems in Engineering 9

Table 1 Frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11987411

11987412

sdot sdot sdot 1198741119895

sdot sdot sdot 1198741119871

1198991∙

1205962

11987421

11987422

sdot sdot sdot 1198742119895

sdot sdot sdot 1198742119871

1198992∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1198741198941

1198741198942

sdot sdot sdot 119874119894119895

sdot sdot sdot 119874119894119871

119899119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1198741198701

1198741198702

sdot sdot sdot 119874119870119895

sdot sdot sdot 119874119870119871

119899119870∙

Total 119899∙1

119899∙2

sdot sdot sdot 119899∙119895

sdot sdot sdot 119899∙119871

119873

Table 2 Relative frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11990111

11990112

sdot sdot sdot 1199011119895

sdot sdot sdot 1199011119871

1199011∙

1205962

11990121

11990122

sdot sdot sdot 1199012119895

sdot sdot sdot 1199012119871

1199012∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1199011198941

1199011198942

sdot sdot sdot 119901119894119895

sdot sdot sdot 119901119894119871

119901119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1199011198701

1199011198702

sdot sdot sdot 119901119870119895

sdot sdot sdot 119901119870119871

119901119870∙

Total 119901∙1

119901∙2

sdot sdot sdot 119901∙119895

sdot sdot sdot 119901∙119871

1

Similarly

119884 (Ω2F

2 119875

2) 997888rarr 119877 (55)

as 119884(120588119895) = 119910

119895 where Ω

2= 120588

1 120588

2 120588

119871 and 119875

119884(120588

119895) =

119899sdot119895119873 = 119901

sdot119895for 119895 = 1 2 119871 Also 119910

1 119910

2 119910

119870could be

any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875

119884(119884 =

119910119895) = 119875

119884(120588

119895)

Now define a mapping 119875119883119884

fromΩ1timesΩ

2to 119877 as follows

119875119883119884

(120596119894 120588

119895) = 119901

119894119895=

119874119894119895

119873 (56)

Since119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) = 1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119871

sum119895=1

119901119894119895= 119901

119894sdot= 119875

119883(120596

119894)

119870

sum119894=1

119875119883119884

(120596119894 120588

119895) =

119870

sum119894=1

119901119894119895= 119901

sdot119895= 119875

119884(120588

119895)

(57)

119901119894119895119870119894=1

is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) log

119875119883119884

(120596119894 120588

119895)

119875119883(120596

119894) 119875

119884(120588

119895)

=

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

(58)

It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value

Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)

Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that

(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871

(2) 119901119894119895

= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and

119895 = 1 2 119871

(3) for each 119895 = 1 2 119871 since1198741119895+119874

2119895+sdot sdot sdot+119874

119870119895= 119899

sdot119895

there are 119899sdot119895nonzero 119874

119894119895rsquos or equivalently 119899

sdot119895nonzero

119901119894119895rsquos

(4) 119901119894sdot= 1119873 119894 = 1 2 119870

Using the above observations and the fact that 0 log 0 = 0 wehave

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

=

119870

sum119894=1

1199011198941log

1199011198941

119901119894sdot119901sdot1

+

119870

sum119894=1

1199011198942log

1199011198942

119901119894sdot119901sdot2

+ sdot sdot sdot +

119870

sum119894=1

119901119894119871log

119901119894119871

119901119894sdot119901sdot119871

10 Mathematical Problems in Engineering

= sum1199011198941 =0

1

119873log 1119873

119901sdot1119873

+ sum1199011198942 =0

1

119873log 1119873

119901sdot2119873

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1119873

119901sdot119871119873

= sum1199011198941 =0

1

119873log 1

119901sdot1

+ sum1199011198942 =0

1

119873log 1

119901sdot2

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1

119901sdot119871

=119899sdot1

119873log 1

119901sdot1

+119899sdot2

119873log 1

119901sdot2

+ sdot sdot sdot +119899sdot119871

119873log 1

119901sdot119871

= 119901sdot1log 1

119901sdot1

+ 119901sdot2log 1

119901sdot2

+ sdot sdot sdot + 119901sdot119871log 1

119901sdot119871

= 119867 (119884)

(59)

52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1

To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways

521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs

522 Relative Values From Property 4 we have 0 le

119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs

523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]

Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then

119901 ln119901

119902asymp (119901 minus 119902) +

(119901 minus 119902)2

2119902 (60)

The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is

Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain

2119868 (119883 119884) = 2119873

119870

sum119894=1

119871

sum119895=1

119901119894119895ln

119901119894119895

119901119894sdot119901sdot119895

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895119873

(119899119894sdot119873) (119899

sdot119895119873)

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895

119899119894sdot119899sdot119895119873

asymp 2

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus

119899119894sdot119899sdot119895

119873)

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895minus 2

sum119894119899119894sdot

119873

sum119895119899sdot119895

119873

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2119873 minus 2119873

119873+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

=

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 1205942

(61)

The last equation means the previous expressionsum119870

119894=1sum119871

119895=1((119874

119894119895minus 119899

119894sdot119899sdot119895119873)

2(119899

119894sdot119899sdot119895119873)) follows 1205942 distribu-

tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)

approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)

The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no

relationship between them)1198671119883 and119884 are dependent (ie there is a relationship

between them)

Mathematical Problems in Engineering 11

The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

asymp 2119873 times 119868 (119883 119884) (62)

is greater than 1205942119880 the upper-tail critical value from a Chi-

square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is

Select feature 119883 if 119868 (119883 119884) gt1205942119880

2119873 (63)

Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942

119880to be 169 from a Chi-

square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873

Assume a training set has119873 examples We can do featureselection by the following procedure

(i) Step 1 Choose a level of significance of 120572 say 005

(ii) Step 2 Find 119870 the number of values of feature119883

(iii) Step 3 Build the contingency table for119883 and 119884

(iv) Step 4 Calculate 119868(119883 119884) from the contingency table

(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom

from a Chi-square table or any other sources such asSAS

(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise

(vii) Step 7 Repeat Steps 2ndash6 for all features

If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure

53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1

Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process

(i) Step 1 select features by removing those with smallmutual information

(ii) Step 2 do binning for the rest of numerical features

(iii) Step 3 select features by mutual information

54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning

Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships

The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value

6 Conclusions

In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article A Unified Definition of Mutual

8 Mathematical Problems in Engineering

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

+

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895)

= minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

119871

sum119894=1

119875119883119884

(119909119894 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= minus

119871

sum119895=1

1198752(119884 = 119910

119895) log119875

2(119884 = 119910

119895)

minus (minus

119870

sum119894=1

119871

sum119895=1

119875119883119884

(119909119894 119910

119895) log119875

119884|119883(119909

119894 119910

119895))

= 119867 (119884) minus 119867 (119884 | 119883)

(49)

Combining the above properties and noting that 119867(119883 | 119884)

and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties

Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then

0 le 119868 (119883 119884) le 119867 (119884) le log 119871

0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)

Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent

5 Newly Defined Mutual Information inMachine Learning

Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883

1 119883

2 119883

119872 and their associated output values

119910 from an output space 119884 (ie dependent variable) Here1198831 119883

2 119883

119872are called features that is independent vari-

ables Hence 119878 can be expressed as

119878 = 1199091198941 119909

1198942 119909

119894119872 119910

119894 119894 = 1 2 119873 (51)

where feature 119883119895has values 119909

1119895 119909

2119895 119909

119873119895for 119895 = 1 2

119872

A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883

1 119883

2 119883

119872 Rather it is decided by only a subset of

them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows

(1) Evaluate the relationship between each individualinput feature 119909

119894and the output 119884

(2) Select the best set of attributes according to somecriterion

51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596

1 120596

2 120596

119870 If 119883 has missing values we will use 120596

1

to represent all the missing values Assume 119884 has 119871 differentvalues 120588

1 120588

2 120588

119871

Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874

119894119895be the frequency (could be 0) of (120596

119894 120588

119895) for

119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899

119894sdotand 119899

sdot119895 respectively Then

119899119894sdot= sum

119895

119874119894119895

119899sdot119895= sum

119894

119874119894119895

119873 = sum119894

sum119895

119874119894119895= sum

119894

119899119894sdot= sum

119895

119899sdot119895

(52)

Let us denote the relative frequency119874119894119895119873 by 119901

119894119895We have the

two-way relative frequency table see Table 2Since

119870

sum119894=1

119871

sum119895=1

119901119894119895=

119870

sum119894=1

119901119894sdot=

119871

sum119895=1

119901sdot119895= 1 (53)

119901119894sdot119870119894=1

119901sdot119895119871119895=1

and 119901119894119895119870119894=1

can each serve as a probabilitymeasure

Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables

119883 (Ω1F

1 119875

119883) 997888rarr 119877 (54)

as 119883(120596119894) = 119909

119894 where Ω

1= 120596

1 120596

2 120596

119870 and 119875

119883(120596

119894) =

119899119894sdot119873 = 119901

119894sdotfor 119894 = 1 2 119870 Note that 119909

1 119909

2 119909

119870

could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909

119894) = 119875

119883(120596

119894)

Mathematical Problems in Engineering 9

Table 1 Frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11987411

11987412

sdot sdot sdot 1198741119895

sdot sdot sdot 1198741119871

1198991∙

1205962

11987421

11987422

sdot sdot sdot 1198742119895

sdot sdot sdot 1198742119871

1198992∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1198741198941

1198741198942

sdot sdot sdot 119874119894119895

sdot sdot sdot 119874119894119871

119899119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1198741198701

1198741198702

sdot sdot sdot 119874119870119895

sdot sdot sdot 119874119870119871

119899119870∙

Total 119899∙1

119899∙2

sdot sdot sdot 119899∙119895

sdot sdot sdot 119899∙119871

119873

Table 2 Relative frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11990111

11990112

sdot sdot sdot 1199011119895

sdot sdot sdot 1199011119871

1199011∙

1205962

11990121

11990122

sdot sdot sdot 1199012119895

sdot sdot sdot 1199012119871

1199012∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1199011198941

1199011198942

sdot sdot sdot 119901119894119895

sdot sdot sdot 119901119894119871

119901119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1199011198701

1199011198702

sdot sdot sdot 119901119870119895

sdot sdot sdot 119901119870119871

119901119870∙

Total 119901∙1

119901∙2

sdot sdot sdot 119901∙119895

sdot sdot sdot 119901∙119871

1

Similarly

119884 (Ω2F

2 119875

2) 997888rarr 119877 (55)

as 119884(120588119895) = 119910

119895 where Ω

2= 120588

1 120588

2 120588

119871 and 119875

119884(120588

119895) =

119899sdot119895119873 = 119901

sdot119895for 119895 = 1 2 119871 Also 119910

1 119910

2 119910

119870could be

any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875

119884(119884 =

119910119895) = 119875

119884(120588

119895)

Now define a mapping 119875119883119884

fromΩ1timesΩ

2to 119877 as follows

119875119883119884

(120596119894 120588

119895) = 119901

119894119895=

119874119894119895

119873 (56)

Since119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) = 1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119871

sum119895=1

119901119894119895= 119901

119894sdot= 119875

119883(120596

119894)

119870

sum119894=1

119875119883119884

(120596119894 120588

119895) =

119870

sum119894=1

119901119894119895= 119901

sdot119895= 119875

119884(120588

119895)

(57)

119901119894119895119870119894=1

is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) log

119875119883119884

(120596119894 120588

119895)

119875119883(120596

119894) 119875

119884(120588

119895)

=

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

(58)

It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value

Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)

Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that

(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871

(2) 119901119894119895

= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and

119895 = 1 2 119871

(3) for each 119895 = 1 2 119871 since1198741119895+119874

2119895+sdot sdot sdot+119874

119870119895= 119899

sdot119895

there are 119899sdot119895nonzero 119874

119894119895rsquos or equivalently 119899

sdot119895nonzero

119901119894119895rsquos

(4) 119901119894sdot= 1119873 119894 = 1 2 119870

Using the above observations and the fact that 0 log 0 = 0 wehave

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

=

119870

sum119894=1

1199011198941log

1199011198941

119901119894sdot119901sdot1

+

119870

sum119894=1

1199011198942log

1199011198942

119901119894sdot119901sdot2

+ sdot sdot sdot +

119870

sum119894=1

119901119894119871log

119901119894119871

119901119894sdot119901sdot119871

10 Mathematical Problems in Engineering

= sum1199011198941 =0

1

119873log 1119873

119901sdot1119873

+ sum1199011198942 =0

1

119873log 1119873

119901sdot2119873

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1119873

119901sdot119871119873

= sum1199011198941 =0

1

119873log 1

119901sdot1

+ sum1199011198942 =0

1

119873log 1

119901sdot2

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1

119901sdot119871

=119899sdot1

119873log 1

119901sdot1

+119899sdot2

119873log 1

119901sdot2

+ sdot sdot sdot +119899sdot119871

119873log 1

119901sdot119871

= 119901sdot1log 1

119901sdot1

+ 119901sdot2log 1

119901sdot2

+ sdot sdot sdot + 119901sdot119871log 1

119901sdot119871

= 119867 (119884)

(59)

52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1

To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways

521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs

522 Relative Values From Property 4 we have 0 le

119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs

523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]

Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then

119901 ln119901

119902asymp (119901 minus 119902) +

(119901 minus 119902)2

2119902 (60)

The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is

Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain

2119868 (119883 119884) = 2119873

119870

sum119894=1

119871

sum119895=1

119901119894119895ln

119901119894119895

119901119894sdot119901sdot119895

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895119873

(119899119894sdot119873) (119899

sdot119895119873)

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895

119899119894sdot119899sdot119895119873

asymp 2

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus

119899119894sdot119899sdot119895

119873)

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895minus 2

sum119894119899119894sdot

119873

sum119895119899sdot119895

119873

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2119873 minus 2119873

119873+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

=

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 1205942

(61)

The last equation means the previous expressionsum119870

119894=1sum119871

119895=1((119874

119894119895minus 119899

119894sdot119899sdot119895119873)

2(119899

119894sdot119899sdot119895119873)) follows 1205942 distribu-

tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)

approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)

The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no

relationship between them)1198671119883 and119884 are dependent (ie there is a relationship

between them)

Mathematical Problems in Engineering 11

The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

asymp 2119873 times 119868 (119883 119884) (62)

is greater than 1205942119880 the upper-tail critical value from a Chi-

square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is

Select feature 119883 if 119868 (119883 119884) gt1205942119880

2119873 (63)

Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942

119880to be 169 from a Chi-

square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873

Assume a training set has119873 examples We can do featureselection by the following procedure

(i) Step 1 Choose a level of significance of 120572 say 005

(ii) Step 2 Find 119870 the number of values of feature119883

(iii) Step 3 Build the contingency table for119883 and 119884

(iv) Step 4 Calculate 119868(119883 119884) from the contingency table

(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom

from a Chi-square table or any other sources such asSAS

(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise

(vii) Step 7 Repeat Steps 2ndash6 for all features

If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure

53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1

Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process

(i) Step 1 select features by removing those with smallmutual information

(ii) Step 2 do binning for the rest of numerical features

(iii) Step 3 select features by mutual information

54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning

Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships

The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value

6 Conclusions

In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article A Unified Definition of Mutual

Mathematical Problems in Engineering 9

Table 1 Frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11987411

11987412

sdot sdot sdot 1198741119895

sdot sdot sdot 1198741119871

1198991∙

1205962

11987421

11987422

sdot sdot sdot 1198742119895

sdot sdot sdot 1198742119871

1198992∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1198741198941

1198741198942

sdot sdot sdot 119874119894119895

sdot sdot sdot 119874119894119871

119899119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1198741198701

1198741198702

sdot sdot sdot 119874119870119895

sdot sdot sdot 119874119870119871

119899119870∙

Total 119899∙1

119899∙2

sdot sdot sdot 119899∙119895

sdot sdot sdot 119899∙119871

119873

Table 2 Relative frequency table

1205881

1205882

sdot sdot sdot 120588119895

sdot sdot sdot 120588119871

Total1205961

11990111

11990112

sdot sdot sdot 1199011119895

sdot sdot sdot 1199011119871

1199011∙

1205962

11990121

11990122

sdot sdot sdot 1199012119895

sdot sdot sdot 1199012119871

1199012∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119894

1199011198941

1199011198942

sdot sdot sdot 119901119894119895

sdot sdot sdot 119901119894119871

119901119894∙

sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot

120596119870

1199011198701

1199011198702

sdot sdot sdot 119901119870119895

sdot sdot sdot 119901119870119871

119901119870∙

Total 119901∙1

119901∙2

sdot sdot sdot 119901∙119895

sdot sdot sdot 119901∙119871

1

Similarly

119884 (Ω2F

2 119875

2) 997888rarr 119877 (55)

as 119884(120588119895) = 119910

119895 where Ω

2= 120588

1 120588

2 120588

119871 and 119875

119884(120588

119895) =

119899sdot119895119873 = 119901

sdot119895for 119895 = 1 2 119871 Also 119910

1 119910

2 119910

119870could be

any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875

119884(119884 =

119910119895) = 119875

119884(120588

119895)

Now define a mapping 119875119883119884

fromΩ1timesΩ

2to 119877 as follows

119875119883119884

(120596119894 120588

119895) = 119901

119894119895=

119874119894119895

119873 (56)

Since119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) = 1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) =

119871

sum119895=1

119901119894119895= 119901

119894sdot= 119875

119883(120596

119894)

119870

sum119894=1

119875119883119884

(120596119894 120588

119895) =

119870

sum119894=1

119901119894119895= 119901

sdot119895= 119875

119884(120588

119895)

(57)

119901119894119895119870119894=1

is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119875119883119884

(120596119894 120588

119895) log

119875119883119884

(120596119894 120588

119895)

119875119883(120596

119894) 119875

119884(120588

119895)

=

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

(58)

It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value

Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)

Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that

(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871

(2) 119901119894119895

= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and

119895 = 1 2 119871

(3) for each 119895 = 1 2 119871 since1198741119895+119874

2119895+sdot sdot sdot+119874

119870119895= 119899

sdot119895

there are 119899sdot119895nonzero 119874

119894119895rsquos or equivalently 119899

sdot119895nonzero

119901119894119895rsquos

(4) 119901119894sdot= 1119873 119894 = 1 2 119870

Using the above observations and the fact that 0 log 0 = 0 wehave

119868 (119883 119884) =

119870

sum119894=1

119871

sum119895=1

119901119894119895log

119901119894119895

119901119894sdot119901sdot119895

=

119870

sum119894=1

1199011198941log

1199011198941

119901119894sdot119901sdot1

+

119870

sum119894=1

1199011198942log

1199011198942

119901119894sdot119901sdot2

+ sdot sdot sdot +

119870

sum119894=1

119901119894119871log

119901119894119871

119901119894sdot119901sdot119871

10 Mathematical Problems in Engineering

= sum1199011198941 =0

1

119873log 1119873

119901sdot1119873

+ sum1199011198942 =0

1

119873log 1119873

119901sdot2119873

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1119873

119901sdot119871119873

= sum1199011198941 =0

1

119873log 1

119901sdot1

+ sum1199011198942 =0

1

119873log 1

119901sdot2

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1

119901sdot119871

=119899sdot1

119873log 1

119901sdot1

+119899sdot2

119873log 1

119901sdot2

+ sdot sdot sdot +119899sdot119871

119873log 1

119901sdot119871

= 119901sdot1log 1

119901sdot1

+ 119901sdot2log 1

119901sdot2

+ sdot sdot sdot + 119901sdot119871log 1

119901sdot119871

= 119867 (119884)

(59)

52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1

To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways

521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs

522 Relative Values From Property 4 we have 0 le

119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs

523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]

Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then

119901 ln119901

119902asymp (119901 minus 119902) +

(119901 minus 119902)2

2119902 (60)

The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is

Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain

2119868 (119883 119884) = 2119873

119870

sum119894=1

119871

sum119895=1

119901119894119895ln

119901119894119895

119901119894sdot119901sdot119895

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895119873

(119899119894sdot119873) (119899

sdot119895119873)

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895

119899119894sdot119899sdot119895119873

asymp 2

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus

119899119894sdot119899sdot119895

119873)

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895minus 2

sum119894119899119894sdot

119873

sum119895119899sdot119895

119873

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2119873 minus 2119873

119873+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

=

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 1205942

(61)

The last equation means the previous expressionsum119870

119894=1sum119871

119895=1((119874

119894119895minus 119899

119894sdot119899sdot119895119873)

2(119899

119894sdot119899sdot119895119873)) follows 1205942 distribu-

tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)

approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)

The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no

relationship between them)1198671119883 and119884 are dependent (ie there is a relationship

between them)

Mathematical Problems in Engineering 11

The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

asymp 2119873 times 119868 (119883 119884) (62)

is greater than 1205942119880 the upper-tail critical value from a Chi-

square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is

Select feature 119883 if 119868 (119883 119884) gt1205942119880

2119873 (63)

Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942

119880to be 169 from a Chi-

square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873

Assume a training set has119873 examples We can do featureselection by the following procedure

(i) Step 1 Choose a level of significance of 120572 say 005

(ii) Step 2 Find 119870 the number of values of feature119883

(iii) Step 3 Build the contingency table for119883 and 119884

(iv) Step 4 Calculate 119868(119883 119884) from the contingency table

(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom

from a Chi-square table or any other sources such asSAS

(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise

(vii) Step 7 Repeat Steps 2ndash6 for all features

If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure

53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1

Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process

(i) Step 1 select features by removing those with smallmutual information

(ii) Step 2 do binning for the rest of numerical features

(iii) Step 3 select features by mutual information

54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning

Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships

The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value

6 Conclusions

In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article A Unified Definition of Mutual

10 Mathematical Problems in Engineering

= sum1199011198941 =0

1

119873log 1119873

119901sdot1119873

+ sum1199011198942 =0

1

119873log 1119873

119901sdot2119873

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1119873

119901sdot119871119873

= sum1199011198941 =0

1

119873log 1

119901sdot1

+ sum1199011198942 =0

1

119873log 1

119901sdot2

+ sdot sdot sdot + sum119901119894119871 =0

1

119873log 1

119901sdot119871

=119899sdot1

119873log 1

119901sdot1

+119899sdot2

119873log 1

119901sdot2

+ sdot sdot sdot +119899sdot119871

119873log 1

119901sdot119871

= 119901sdot1log 1

119901sdot1

+ 119901sdot2log 1

119901sdot2

+ sdot sdot sdot + 119901sdot119871log 1

119901sdot119871

= 119867 (119884)

(59)

52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1

To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways

521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs

522 Relative Values From Property 4 we have 0 le

119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs

523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]

Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then

119901 ln119901

119902asymp (119901 minus 119902) +

(119901 minus 119902)2

2119902 (60)

The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is

Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain

2119868 (119883 119884) = 2119873

119870

sum119894=1

119871

sum119895=1

119901119894119895ln

119901119894119895

119901119894sdot119901sdot119895

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895119873

(119899119894sdot119873) (119899

sdot119895119873)

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895ln

119874119894119895

119899119894sdot119899sdot119895119873

asymp 2

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus

119899119894sdot119899sdot119895

119873)

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2

119870

sum119894=1

119871

sum119895=1

119874119894119895minus 2

sum119894119899119894sdot

119873

sum119895119899sdot119895

119873

+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 2119873 minus 2119873

119873+

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

=

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

= 1205942

(61)

The last equation means the previous expressionsum119870

119894=1sum119871

119895=1((119874

119894119895minus 119899

119894sdot119899sdot119895119873)

2(119899

119894sdot119899sdot119895119873)) follows 1205942 distribu-

tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)

approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)

The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no

relationship between them)1198671119883 and119884 are dependent (ie there is a relationship

between them)

Mathematical Problems in Engineering 11

The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

asymp 2119873 times 119868 (119883 119884) (62)

is greater than 1205942119880 the upper-tail critical value from a Chi-

square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is

Select feature 119883 if 119868 (119883 119884) gt1205942119880

2119873 (63)

Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942

119880to be 169 from a Chi-

square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873

Assume a training set has119873 examples We can do featureselection by the following procedure

(i) Step 1 Choose a level of significance of 120572 say 005

(ii) Step 2 Find 119870 the number of values of feature119883

(iii) Step 3 Build the contingency table for119883 and 119884

(iv) Step 4 Calculate 119868(119883 119884) from the contingency table

(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom

from a Chi-square table or any other sources such asSAS

(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise

(vii) Step 7 Repeat Steps 2ndash6 for all features

If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure

53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1

Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process

(i) Step 1 select features by removing those with smallmutual information

(ii) Step 2 do binning for the rest of numerical features

(iii) Step 3 select features by mutual information

54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning

Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships

The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value

6 Conclusions

In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article A Unified Definition of Mutual

Mathematical Problems in Engineering 11

The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic

119870

sum119894=1

119871

sum119895=1

(119874119894119895minus 119899

119894sdot119899sdot119895119873)

2

119899119894sdot119899sdot119895119873

asymp 2119873 times 119868 (119883 119884) (62)

is greater than 1205942119880 the upper-tail critical value from a Chi-

square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is

Select feature 119883 if 119868 (119883 119884) gt1205942119880

2119873 (63)

Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942

119880to be 169 from a Chi-

square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873

Assume a training set has119873 examples We can do featureselection by the following procedure

(i) Step 1 Choose a level of significance of 120572 say 005

(ii) Step 2 Find 119870 the number of values of feature119883

(iii) Step 3 Build the contingency table for119883 and 119884

(iv) Step 4 Calculate 119868(119883 119884) from the contingency table

(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom

from a Chi-square table or any other sources such asSAS

(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise

(vii) Step 7 Repeat Steps 2ndash6 for all features

If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure

53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1

Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process

(i) Step 1 select features by removing those with smallmutual information

(ii) Step 2 do binning for the rest of numerical features

(iii) Step 3 select features by mutual information

54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning

Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships

The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value

6 Conclusions

In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article A Unified Definition of Mutual

12 Mathematical Problems in Engineering

References

[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001

[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003

[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006

[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948

[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959

[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)

[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965

[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006

[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961

[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963

[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968

[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000

[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014

[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003

[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011

[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006

[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013

[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014

[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014

[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013

[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004

[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: Research Article A Unified Definition of Mutual

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of