research article a unified definition of mutual
TRANSCRIPT
Research ArticleA Unified Definition of Mutual Information withApplications in Machine Learning
Guoping Zeng
Elevate 4150 International Plaza Fort Worth TX 76109 USA
Correspondence should be addressed to Guoping Zeng guopingtxyahoocom
Received 21 December 2014 Revised 16 March 2015 Accepted 17 March 2015
Academic Editor Zexuan Zhu
Copyright copy 2015 Guoping ZengThis is an open access article distributed under theCreativeCommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
There are various definitions of mutual information Essentially these definitions can be divided into two classes (1) definitionswith random variables and (2) definitions with ensembles However there are some mathematical flaws in these definitions Forinstance Class 1 definitions either neglect the probability spaces or assume the two random variables have the same probabilityspace Class 2 definitions redefine marginal probabilities from the joint probabilities In fact the marginal probabilities are givenfrom the ensembles and should not be redefined from the joint probabilities Both Class 1 and Class 2 definitions assume a jointdistribution exists Yet they all ignore an important fact that the joint or the joint probabilitymeasure is not unique In this paper wefirst present a new unified definition of mutual information to cover all the various definitions and to fix their mathematical flawsOur idea is to define the joint distribution of two random variables by taking the marginal probabilities into consideration Nextwe establish some properties of the newly defined mutual information We then propose a method to calculate mutual informationin machine learning Finally we apply our newly defined mutual information to credit scoring
1 Introduction
Mutual information has emerged in recent years as an impor-tant measure of statistical dependence It has been used asa criterion for feature selection in engineering especially inmachine learning (see [1ndash3] and references therein)
Mutual information is a concept rooted in informationtheory Its predecessor called the rate of transmission wasfirst introduced by Shannon in 1948 in a classical paper [4]for the communication system Shannon first introduced aconcept called entropy for a single discrete chance variableHe then defined the joint entropy and conditional entropyfor two discrete chance variables using the joint distributionFinally he defined the rate of transmission as the differencebetween the entropy and conditional entropyWhile Shannondid not define a chance variable in his paper it is understoodto be a synonym of a random variable
Since Shannonrsquos pioneering work [4] there have beenvarious definitions for mutual information Essentially thesedefinitions can be divided into two classes (1) definitionswithrandom variables and (2) definitions with ensembles that isprobability spaces in the mathematical literature
Class 1 definitions of mutual information depend on thejoint distribution of two random variables More specificallyKullback ([5] 1959) defined entropy conditional entropyand joint entropy using compact mathematical formulasPinsker ([6] 1960 and 1964) treated the fundamental con-cepts of Shannon in a more advanced manner by employingprobability theory His definition of mutual informationwas more general in that he implicitly assumed the tworandom variables had different probability spaces Ash ([7]1965) explicitly assumed the two random variables had thesame probability space and followed Shannonrsquos way to definemutual information Cover and Thomas ([8] 2006) definedmutual information in a simple way by avoiding mentioningprobability spaces
Class 2 definitions depend on the joint probability mea-sure of the joint sample space of two ensembles Amongsuch definitions Fano ([9] 1961) Abramson ([10] 1963) andGallager ([11] 1968) developed their definitions in a similarwayThey first defined the entropy of an ensemble conditionentropy and the joint entropy of two ensembles Next theydefined the mutual information of a joint event Noting thatthe mutual information of a joint event is a random variable
Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 201874 12 pageshttpdxdoiorg1011552015201874
2 Mathematical Problems in Engineering
they calculated the mean value of this random variable andcalled the result the mean information of two ensembles
However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Moreover except Pinskerrsquos definitionClass 2 definitions either neglect the probability spaces orassume the two random variables have the same probabilityspace Both Class 1 definitions and Class 2 definitions assumea joint distribution or a joint probability measure exists Yetthey all ignore an important fact that the joint distribution orthe joint probability measure is not unique
In this paper we first present a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distributionof two random variables by taking the marginal probabilitiesinto consideration With our new definition of mutual infor-mation different joint distributions will result in differentmutual information Next we establish some properties ofthe newly defined mutual information We then propose amethod to calculatemutual information inmachine learningFinally we apply our newly defined mutual information tocredit scoring
The rest of the paper is organized as follows In Section 2we briefly review the basic concepts in probability theoryIn Section 3 we examine various definitions of mutualinformation In Section 4 we first propose a new unifieddefinition for mutual information and then establish someproperties of the newly defined mutual information InSection 5 we first propose a method to calculate mutualinformation in machine learning We then apply the newlydefined mutual information to credit scoring The paper isconcluded in Section 6
Throughout the paper we will restrict our focus onmutual information for finite discrete random variables
2 Basic Concepts in Probability Theory
Let us review some basic concepts of probability theoryTheycan be found inmany books in probability theory such as [12]
Definition 1 A probability space is a triple (ΩF 119875) where
(1) Ω is a set called a sample space and elements ofΩ aredenoted by 120596 and are called outcomes
(2) F is an 120590-field consisting of all subsets of Ω andelements ofF are called events
(3) 119875 is called a probability measure and it is a mappingfromF to [0 1]with 119875(Ω) = 1 such that if119860
1 119860
2
are pairwise disjoint
119875(⋃119894
119860119894) = sum
119894
119875 (119860119894) (1)
Definition 2 A discrete probability space is a probabil-ity space (ΩF 119875) such that Ω is finite or countable
Ω = 1205961 120596
2 In this case F is chosen to be all the
subsets of Ω and the probability measure 119875 can be definedin terms of a series of nonnegative numbers 119901
1 119901
2 whose
sum is 1 If 119860 is any subset ofΩ then
119875 (119860) = sum120596119894isin119860
119901119894 (2)
In particular
119875 (120596119894) = 119901
119894 (3)
For simplicity we will write 119875(120596) as 119875(120596) From Defini-tion 2 we see that for a discrete probability space the prob-ability measure is characterized by the pointwise mapping 119901120596
1 120596
2 rarr [0 1] in (2)The probability of an event119860 is
computed simply by adding the probabilities of the individualpoints of 119860
Definition 3 A randomvariable 119883 on probability space(ΩF 119875) is a Borel measurable function fromΩ to (minusinfininfin)
such that for every Borel set 119861119883minus1(119861) = 119883 isin 119861 isin F Herewe use notation 119883 isin 119861 = 120596 isin Ω 119883(120596) isin 119861
Definition 4 If 119883 is a random variable then for every Borelsubset 119861 of 119877 we define a function by 120583
119883(119861) = 119875(119883 isin
119861) = 119875(119883minus1(119861)) Then 120583119883
is a probability measure on(RB(R) 120583
119883) and is called the probability distribution of119883
Definition 5 A random variable 119883 is discrete if its range isfinite or countable In particular any random variable on adiscrete probability space is discrete sinceΩ is countable
Definition 6 A (discrete) random variable 119883 on a discreteprobability space (ΩF 119875) is a Borel measurable functionfrom Ω to R where Ω = 120596
1 120596
2 and R is the set of
real numbers If the range of 119883 is 1199091 119909
2 then function
119891119883 119909
1 119909
2 rarr [0 1] defined by
119891119883(119909
119894) = 119875 (119883 = 119909
119894) (4)
is called the probability mass function of 119883 whereas proba-bilities 119875(119883 = 119909
1) 119875(119883 = 119909
2) are called the probability
distribution of119883
Note that in Definition 2
sum119894
119875 (119883 = 119909119894) = 1 (5)
Thus a discrete random variable may be characterized by itsprobability mass function
3 Various Definitions of Mutual Information
Since Shannonrsquos pioneering work [4] there have been variousdefinitions for mutual information Essentially these defi-nitions can be divided into two classes (1) definitions withrandom variables and (2) definitions with ensembles that isprobability spaces in the mathematical literature
Mathematical Problems in Engineering 3
31 Shannonrsquos Original Definition
Definition 7 Let 119909 be a chance variable with probabilities1199011 119901
2 119901
119899 whose sum is 1 Then
119867(119909) = minus
119899
sum119894=1
119901119894log119901
119894 (6)
is called the entropy of 119909Suppose two chance variables 119909 and 119910 have 119898 and 119899
possibilities respectively Let indices 119894 and 119895 range over all the119898 possibilities and all the 119899 possibilities respectively Let 119901(119894)be the probability of 119894 and 119901(119894 119895) the probability of the jointoccurrence of 119894 and 119895 Denote the conditional probability of 119894given 119895 by 119901(119894 | 119895) and conditional probability of 119895 given 119894 by119901(119895 | 119894)
Definition 8 The joint entropy of 119909 and 119910 is defined as
119867(119909 119910) = minussum119894119895
119901 (119894 119895) log119901 (119894 119895) (7)
Definition 9 The conditional entropy of 119910119867119909(119910) is defined
as
119867119909(119910) = minussum
119894119895
119901 (119894 119895) log119901 (119895 | 119894)
= minussum119894119895
119901 (119894 119895) log119901 (119894 119895)
sum119895119901 (119894 119895)
(8)
The conditional entropy of 119909119867119910(119909) can be defined similarly
Then the following relation holds
119867(119909 119910) = 119867 (119909) + 119867119909(119910) = 119867 (119910) + 119867
119910(119909)
119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867
119909(119910)
= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)
(9)
Definition 10 The rate of transmission of information 119877 isdefined as the difference between 119867(119909) and 119867
119910(119909) Then 119877
can be written in two other forms
119877 = 119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867
119909(119910)
= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)
(10)
Remark 11 Shannon did not derive the explicit formula for119877
119877 = sumij119901 (119894 119895) log
119901 (119894 119895)
119901 (119894) 119901 (119895) (11)
However he did imply it in Appendix 7 [4]
32 Class 1 Definitions
321 Kullbackrsquos Definition Kullback [5] redefined entropymore mathematically in a standalone homework question asfollows Consider two discrete random variables 119909 119910 where
119901119894119895= Prob (119909 = 119909
119894 119910 = 119910
119895) gt 0
119894 = 1 2 119898 119895 = 1 2 119899
119901119894sdot=
119899
sum119895=1
119901119894119895
119901sdot119895=
119898
sum119894=1
119901119894119895
msumi=1
nsumj=1
119901119894119895=
119898
sum119894=1
119901119894sdot=
119899
sum119895=1
119901sdot119895= 1
(12)
Define the joint entropy entropy and conditional entropy asfollows
119867(119909 119910) = minussum119894
sum119895
119901119894119895log119901
119894119895
119867 (119909) = minussum119894
119901119894sdotlog119901
119894sdot
119867 (119910) = minussum119895
119901sdot119895log119901
sdot119895
119867 (119910 | 119909119894) = minussum
119895
119901119894119895
119901119894sdot
log119901119894119895
119901119894sdot
119867 (119910 | 119909) = sum119894
119901119894119867(119910 | 119909
119894) = minussum
119894
sum119895
119901119894119895log
119901119894119895
119901119894sdot
(13)
Then119867(119909 119910) = 119867(119909) +119867(119910 | 119909) le 119867(119909) +119867(119910) and119867(119910) ge
119867(119910 | 119909)
322 Information Conveyed Ash [7] beganwith two randomvariables 119883 and 119884 and assumed 119883 and 119884 had the sameprobability space He systematically defined the entropyconditional entropy and joint entropy following Shannonrsquospath in [4] At the end he denoted 119867(119883) minus 119867(119883 | 119884) by119868(119883 | 119884) and called it the information conveyed about 119883 by119884
323 Information of One Variable with respect to the OtherPinsker [6] treated the fundamental concepts of Shannon ina more advanced manner by employing probability theorySuppose 120585 is a random variable defined on a probability space(Ω 119878
120596 119875
120585) and is taking values in a measurable space (119883 119878
119909)
and 120578 is a random variable defined on a probability space(Ψ 119878
120595 119875
120578) and is taking values in a measurable space (119884 119878
119910)
Then the pair 120585 120578 of random variables may be regarded as asingle random variable (120585 120578)with values in the product space119883 times 119884 of all pairs (119909 119910) with 119909 isin 119883 119910 isin 119884 The distribution119875(120585120578)
(sdot) = 119875120585120578(sdot) of (120585 120578) is called the joint distribution of
4 Mathematical Problems in Engineering
random variables 120585 and 120578 By the product of the distributions119875120585(sdot) and 119875
120578(sdot) denoted by 119875
120585times120578(sdot) we mean the distribution
defined on 119878119909times 119878
119910
119875120585times120578
(119864 times 119865) = 119875120585(119864) times 119875
120578(119865) (14)
for 119864 isin 119878119909and 119865 isin 119878
119910 If the joint distribution 119875
120585120578(sdot)
coincides with the product distribution 119875120585times120578
(sdot) the randomvariables 120585 and 120578 are said to be independent If 120585 and 120578 arediscrete random variables say 119883 and 119884 contain countablymany points 119909
1 119909
2 and 119910
1 119910
2 then
119868 (120585 120578) = sum119894119895
119875120585times120578
(119909119894 119910
119895) log
119875120585120578(119909
119894 119910
119895)
119875120585(119909
119894) 119875
120578(119910
119895) (15)
119868 is called the information of 120585 and 120578with respect to the other
324 A Modern Definition in Information Theory Of thevarious definitions of mutual information the most widelyaccepted of recent years is the one by Cover andThomas [8]
Let 119883 be a discrete variable with alphabet 120594 and prob-ability mass function 119901(119909) = Pr119883 = 119909 119909 isin 120594 Let 119884
be a discrete variable with alphabet Υ and probability massfunction 119901(119910) = Pr119884 = 119910 119910 isin Υ Suppose 119883 and 119884 havea joint mass function (joint distribution) 119901(119909 119910) Then themutual information 119868(119883 119884) can be defined as
119868 (119883 119884) = sum119909isin120594
sum119910isinΥ
119901 (119909 119910) log119901 (119909 119910)
119901 (119909) 119901 (119910) (16)
33 Class 2 Definitions In Class 2 definitions random vari-ables are replaced by ensembles and mutual information isso-called average mutual information Gallager [11] adopteda more general and more rigorous approach to introducethe concept of mutual information in communication theoryIndeed he combined and compiled the results from Fano [9]and Abramson [10]
Suppose that discrete ensemble 119883 has a sample space1198861 119886
2 119886
119870 and discrete ensemble 119884 has a sample space
1198871 1198872 119887
119871 Consider the joint sample space (119886
119896 119887119895) 1 le
119896 le 119870 1 le 119895 le 119869 A probability measure on the joint samplespace is given by the join probability 119875
119883119884(119886119896 119887119895) defined for
1 le 119896 le 119870 1 le 119895 le 119869 The combination of a joint samplespace and probability measure for outcomes 119909 and 119910 is calleda joint 119883119884ensemble Then the marginal probabilities can befound as
119875119883(119886
119896) =
119869
sum119895=1
119875119883119884
(119886119896 119887119895) 119896 = 1 2 119870 (17)
In more abbreviated notation this is written as
119875 (119909) = sum119910
119875 (119909 119910) (18)
Likewise
119875119884(119887119895) =
119870
sum119896=1
119875119883119884
(119886119896 119887119895) 119895 = 1 2 119869 (19)
In more abbreviated notation this is written as
119875 (119910) = sum119909
119875 (119909 119910) (20)
If 119875119883(119886119896) gt 0 the conditional probability that outcome 119910 is
119887119895 given that outcome of 119909 is 119886
119896 is defined as
119875119884|119883
(119887119895| 119886
119896) =
119875119883119884
(119886119896 119887119895)
119875119883(119886
119896)
(21)
The mutual information between events 119909 = 119886119896and 119910 = 119887
119895is
defined as
119868119883119884
(119886119896 119887119895) = log
119875119883|119884
(119886119896| 119887
119895)
119875119883(119886
119896)
= log119875119883119884
(119886119896 119887119895)
119875119883(119886
119896) 119875
119884(119887119895)
= log119875119883|119884
(119886119896| 119887
119895)
119875119883(119886
119896)
= 119868119884119883
(119887119895 119886
119896)
(22)
Since the mutual information defined above is a randomvariable on the 119883119884 joint ensemble the mean value which iscalled the average mutual information denoted by 119868(119883 119884) isgiven by
119868 (119883 119884) =
119870
sum119896=1
119871
sum119895=1
119875119883119884
(119886119896 119887119895) log
119875119883119884
(119886119896 119887119895)
119875119883(119886
119896) 119875
119884(119887119895) (23)
Remark 12 By means of an information channel consistingof a transmitter of alphabet 119860 with elements 119886
119894and total
elements 119905 and a receiver of alphabet 119861 with elements 119887119895
and total elements 119903 Abramson [10] denoted 119867(119860) minus 119867(119860 |
119861) = sum119860119861
119875(119886 119887) log(119875(119886 119887)119875(119886)119875(119887)) by 119868(119860 119861) and calledit mutual information of 119860 and 119861
The mutual information 119868(119883 119884) between 2 continuousrandomvariables119883 and119884 [8] (also called rate of transmissionin [1]) is defined as
119868 (119883 119884) = ∬119875 (119909 119910) log119875 (119909 119910)
119875 (119909) 119875 (119910)119889119909 119889119910 (24)
where 119875(119909 119910) is the joint probability density function of119883 and 119884 and 119875(119909) and 119875(119910) are the marginal densityfunctions associated with 119883 and 119884 respectively The mutualinformation between 2 continuous random variables is alsocalled the differential mutual information
However the differential mutual information ismuch lesspopular than its discrete counterpart On the one hand thejoint density function involved is unknown inmost cases andhence must be estimated [13 14] On the other hand datain engineering and machine learning are mostly finite andso mutual information between discrete random variables isused
4 A New Unified Definition ofMutual Information
In Section 3 we reviewed various definitions of mutual infor-mation Shannonrsquos original definition laid the foundation
Mathematical Problems in Engineering 5
of information theory Kullbackrsquos definition used randomvariables for the first time and was more mathematical andmore compact Although Ashrsquos definition followed Shannonrsquospath it was more systematic Pinskerrsquos definition was mostmathematical in that it employed probability theory Gal-lagerrsquos definition was more general and more rigorous incommunication theory Cover and Thomasrsquos definition is sosuccinct that it is now a standard definition in informationtheory
However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Except Pinskerrsquos definition Class 2definitions either neglect the probability spaces or assumethe two random variables have the same probability spaceBoth Class 1 definitions and Class 2 definitions assume a jointdistribution or a joint probability measure exists Yet they allignore an important fact that the joint distribution or the jointprobability measure is not unique
41 Unified Definition of Mutual Information Let 119883 be afinite discrete random variable on discrete probability space(Ω
1F
1 119875
1) with Ω
1= 120596
1 120596
2 120596
119899 and range 119909
1 119909
2
119909119870 with 119870 le 119899 Let 119884 be a discrete random variable on
probability space (Ω2F
2 119875
2) with Ω
2= 120588
1 120588
2 120588
119898 and
range 1199101 119910
2 119910
119871 with 119871 le 119898
If119883 and119884have the sameprobability space (ΩF 119875) thenthe joint distribution is simply
119875119883119884
(119883 = 119909 119884 = 119910) = 119875 (120596 isin Ω 119883 (120596) = 119909 119884 (120596) = 119910)
(25)
However when 119883 and 119884 have different probability spacesand so different probability measures the joint distributionis more complicated
Definition 13 The joint sample space of random variables 119883and 119884 is defined as the product Ω
1times Ω
2of all pairs (120596
119894 120588
119895)
119894 = 1 2 119899 and 119895 = 1 2 119898The joint 120590-fieldF1timesF
2is
defined as the product of all pairs (1198601 119860
2) where119860
1and119860
2
are elements of F1and F
2 respectively A joint probability
measure 119875119883119884
of 1198751and 119875
2is a probability measure on F
1times
F2 119875
119883119884(119860 times 119861) such that for any 119860 sube Ω
1and 119861 sube Ω
2
1198751(119860) = 119875
119883119884(119860 times Ω
2) =
119898
sum119895=1
119875119883119884
(119860 times 120588119895)
1198752(119861) = 119875
119883119884(Ω
1times 119861) =
119898
sum119894=1
119875119883119884
(120596119894 times 119861)
(26)
(Ω1timesΩ
2 F
1timesF
2 119875
119883119884) is called the joint probability space
of 119883 and 119884 and 119875119883119884
(119883 = 119909119894 times 119884 = 119910
119895) for 119894 = 1 2 119899
and 119895 = 1 2 119898 the joint distribution of119883 and 119884
Combining Definitions 2 and 13 we immediately obtainthe following results
Proposition 14 A sequence of nonnegative numbers 119901119894119895 1 le
119894 le 119870 1 le 119895 le 119871 whose sum is 1 can serve as a probabilitymeasure on F
1times F
2 119875
119883119884(120596
119894 120588
119895) = 119901
119894119895 The probability of
any event 119860 times 119861 sube Ω1times Ω
2is computed simply by adding the
probabilities of the individual points of (120596 120588) isin 119860 times 119861 If inaddition for 119894 = 1 2 119870 and 119895 = 1 2 119871 the followinghold
119871
sum119895=1
119901119894119895= 119875
119883(120596
119894)
119870
sum119894=1
119901119894119895= 119875
119884(120588
119895)
(27)
then 119875119883119884
(120596119894 120588
119895) = 119901
119894119895is a joint distribution of119883 and 119884
For convenience from now on we will shorten 119875119883119884
(119883 =
119909119894 times 119884 = 119910
119895) as 119875
119883119884(119909
119894 119910
119895)
This two-dimensional measure should not be confusedwith one-dimensional joint distribution when 119883 and 119884 havethe same probability space
Remark 15 If (Ω1F
1 119875
1) = (Ω
2F
2 119875
2) instead of using
the two-dimensional measure 119875119883119884
(119883 = 119909119894 times 119884 =
119910119895) we may use the one-dimensional measure 119875
1(119883 =
119909119894 and 119884 = 119910
119895) Then (26) always hold In this sense our
new definition of joint distribution reduces to the definitionof joint distribution with the same probability space
Definition 16 The conditional probability 119884 = 119910119895 given 119883 =
119909119894 is defined as
119875119884|119883
(119884 = 119910119895| 119883 = 119909
119894) =
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) (28)
Theorem 17 For any two discrete random variables thereis at least one joint probability measure called the productprobability measure or simply product distribution
Proof Let random variables 119883 and 119884 be defined as beforeDefine a function fromΩ
1times Ω
2to [0 1] as follows
119875119883119884
(120596119894 120588
119895) = 119875
1(120596
119894) 119875
2(120588
119895) (29)
Then
119899
sum119894=1
119898
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119899
sum119894=1
119898
sum119895=1
1198751(120596
119894) 119875
2(120588
119895)
=
119899
sum119894=1
1198751(120596
119894)
119898
sum119895=1
1198752(120588
119895) = 1
(30)
Hence 119875119883119884
can serve as a probability measure on F1times
F2by Definition 2 The probability of any event 119860 times 119861 sube
Ω1times Ω
2is computed simply by adding the probabilities of
6 Mathematical Problems in Engineering
the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596
1198941 120596
1198942 120596
119894119904 isin Ω
1of 119904 elements
119875119883119884
(119860 times Ω2) =
119898
sum119895=1
119875119883119884
(119860 times 120588119895)
=
119898
sum119895=1
119904
sum119906=1
119875119883119884
(120596119894119906 times 120588
119895)
=
119898
sum119895=1
119904
sum119906=1
1198751(120596
119894119906) 119875
2(120588
119895)
=
119898
sum119895=1
1198752(120588
119895)
119904
sum119906=1
1198751(120596
119894119906)
=
119904
sum119906=1
1198751(120596
119894119906) = 119875
1(119860)
(31)
Similarly 119875119883119884
(Ω1times 119861) = 119875
2(119861) for any 119861 isin Ω
2 Hence
119875119883119884
(119883 = 119909119894 times 119884 = 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) is a
joint probability measure of119883 and 119884 by Definition 13
Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875
119883119884(sdot) if 119875
119883119884(sdot) coincides
with the product distribution 119875119883times119884
(sdot)
Definition 19 The joint entropy119867(119883 119884) is defined as
119867(119883 119884) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119883119884(119909
119894 119910
119895) (32)
Definition 20 Theconditional entropy119867(119884 | 119883) is as follows
119867(119884 | 119883) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119884 = 119910
119895| 119883 = 119909
119894)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894)
(33)
Definition 21 The mutual information 119868(119883 119884) between 119883
and 119884 is defined as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
(34)
As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0
It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ
1= 119886
1 119886
2 119886
119870 andΩ
2= 119887
1 1198872 119887
119871
Define random variables 119883 Ω1rarr R and 119884 Ω
2rarr R as
one-to-one mappings as
119883(119886119894) = 119909
119894 119894 = 1 2 119870
119884 (119887119895) = 119910
119895 119895 = 1 2 119871
(35)
Then we have
119875119883119884
(119909119894 119910
119895) = 119875
119883119884(119886
119894 119887119895) (36)
It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example
Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions
1198751(119884 = 0) =
1
3 119875
1(119884 = 1) =
2
3
1198752(119883 = 1) =
1
3 119875
2(119883 = 2) =
1
3 119875
2(119883 = 3) =
1
3
(37)
We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information
(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13
(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0
(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13
(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =
29 119875(3 0) = 19 119875(3 1) = 29
42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]
Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =
1199011 119901
2 119901
119899 and 119876 = 119902
1 119902
2 119902
119899 is defined as
119863 (119875 119876) = sum119894
119901119894log
119901119894
119902119894
(38)
Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902
119894for all 119894
Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions
Mathematical Problems in Engineering 7
The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875
119883119884and the product
distribution 119875119883119875119884
Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875
119883119884and the product distribution 119875
11198752
Proof Using a mapping from 2-dimensional indices to one-dimensional index
(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899
for 119894 = 1 119870 119895 = 1 2 119871(39)
and using another mapping from one-dimensional indexback to two-dimensional indices
119894 = lceil119899
119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871
for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871
(40)
we rewrite 119868(119883 119884) as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(41)
Since119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) = 1
119870119871
sum119899=1
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
=
119870
sum119894=1
119871
sum119895=1
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895) = 1
(42)
we obtain
119868 (119883 119884) =
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(43)
Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies
119868 (119883 119884) ge 0 (44)
with equality if and only if119883 and 119884 are independent
Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875
119883119884(119909
lceil119899119871rceil 119910
119899minus(lceil119899119871rceilminus1)lowast119871) = 119875
1(119883 = 119909
lceil119899119871rceil)119875
2(119884 =
119910119899minus(lceil119899119871rceilminus1)lowast119871
) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +
sdot sdot sdot + 119870 that is 119875119883119884
(119909119894 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) for 119894 =
1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent
Corollary 26 If119883 is a constant random variable that is119870 =
1 then for any random variable 119884
119868 (119883 119884) = 0 (45)
Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875
1(119883 = 119909) = 119875
1(120596) = 1
For any 119895 = 1 2 119871
119875119883119884
(119909 119910119895) =
1
sum119894=1
119875119883119884
(119909 119910119895) = 119875
2(119884 = 119910
119895)
= 1198751(119883 = 119909) 119875
2(119884 = 119910
119895)
(46)
Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0
Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then
0 le 119867 (119883) le log119870 (47)
with equality if and only if the119870 values are equally probable
Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold
119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)
= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)
Proof Consider
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119884|119883
(119909119894 119910
119895)
1198752(119884 = 119910
119895)
8 Mathematical Problems in Engineering
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
+
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
119871
sum119894=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
1198752(119884 = 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= 119867 (119884) minus 119867 (119884 | 119883)
(49)
Combining the above properties and noting that 119867(119883 | 119884)
and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties
Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then
0 le 119868 (119883 119884) le 119867 (119884) le log 119871
0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)
Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent
5 Newly Defined Mutual Information inMachine Learning
Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883
1 119883
2 119883
119872 and their associated output values
119910 from an output space 119884 (ie dependent variable) Here1198831 119883
2 119883
119872are called features that is independent vari-
ables Hence 119878 can be expressed as
119878 = 1199091198941 119909
1198942 119909
119894119872 119910
119894 119894 = 1 2 119873 (51)
where feature 119883119895has values 119909
1119895 119909
2119895 119909
119873119895for 119895 = 1 2
119872
A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883
1 119883
2 119883
119872 Rather it is decided by only a subset of
them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows
(1) Evaluate the relationship between each individualinput feature 119909
119894and the output 119884
(2) Select the best set of attributes according to somecriterion
51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596
1 120596
2 120596
119870 If 119883 has missing values we will use 120596
1
to represent all the missing values Assume 119884 has 119871 differentvalues 120588
1 120588
2 120588
119871
Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874
119894119895be the frequency (could be 0) of (120596
119894 120588
119895) for
119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899
119894sdotand 119899
sdot119895 respectively Then
119899119894sdot= sum
119895
119874119894119895
119899sdot119895= sum
119894
119874119894119895
119873 = sum119894
sum119895
119874119894119895= sum
119894
119899119894sdot= sum
119895
119899sdot119895
(52)
Let us denote the relative frequency119874119894119895119873 by 119901
119894119895We have the
two-way relative frequency table see Table 2Since
119870
sum119894=1
119871
sum119895=1
119901119894119895=
119870
sum119894=1
119901119894sdot=
119871
sum119895=1
119901sdot119895= 1 (53)
119901119894sdot119870119894=1
119901sdot119895119871119895=1
and 119901119894119895119870119894=1
can each serve as a probabilitymeasure
Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables
119883 (Ω1F
1 119875
119883) 997888rarr 119877 (54)
as 119883(120596119894) = 119909
119894 where Ω
1= 120596
1 120596
2 120596
119870 and 119875
119883(120596
119894) =
119899119894sdot119873 = 119901
119894sdotfor 119894 = 1 2 119870 Note that 119909
1 119909
2 119909
119870
could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909
119894) = 119875
119883(120596
119894)
Mathematical Problems in Engineering 9
Table 1 Frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11987411
11987412
sdot sdot sdot 1198741119895
sdot sdot sdot 1198741119871
1198991∙
1205962
11987421
11987422
sdot sdot sdot 1198742119895
sdot sdot sdot 1198742119871
1198992∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1198741198941
1198741198942
sdot sdot sdot 119874119894119895
sdot sdot sdot 119874119894119871
119899119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1198741198701
1198741198702
sdot sdot sdot 119874119870119895
sdot sdot sdot 119874119870119871
119899119870∙
Total 119899∙1
119899∙2
sdot sdot sdot 119899∙119895
sdot sdot sdot 119899∙119871
119873
Table 2 Relative frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11990111
11990112
sdot sdot sdot 1199011119895
sdot sdot sdot 1199011119871
1199011∙
1205962
11990121
11990122
sdot sdot sdot 1199012119895
sdot sdot sdot 1199012119871
1199012∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1199011198941
1199011198942
sdot sdot sdot 119901119894119895
sdot sdot sdot 119901119894119871
119901119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1199011198701
1199011198702
sdot sdot sdot 119901119870119895
sdot sdot sdot 119901119870119871
119901119870∙
Total 119901∙1
119901∙2
sdot sdot sdot 119901∙119895
sdot sdot sdot 119901∙119871
1
Similarly
119884 (Ω2F
2 119875
2) 997888rarr 119877 (55)
as 119884(120588119895) = 119910
119895 where Ω
2= 120588
1 120588
2 120588
119871 and 119875
119884(120588
119895) =
119899sdot119895119873 = 119901
sdot119895for 119895 = 1 2 119871 Also 119910
1 119910
2 119910
119870could be
any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875
119884(119884 =
119910119895) = 119875
119884(120588
119895)
Now define a mapping 119875119883119884
fromΩ1timesΩ
2to 119877 as follows
119875119883119884
(120596119894 120588
119895) = 119901
119894119895=
119874119894119895
119873 (56)
Since119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) = 1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119871
sum119895=1
119901119894119895= 119901
119894sdot= 119875
119883(120596
119894)
119870
sum119894=1
119875119883119884
(120596119894 120588
119895) =
119870
sum119894=1
119901119894119895= 119901
sdot119895= 119875
119884(120588
119895)
(57)
119901119894119895119870119894=1
is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) log
119875119883119884
(120596119894 120588
119895)
119875119883(120596
119894) 119875
119884(120588
119895)
=
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
(58)
It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value
Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)
Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that
(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871
(2) 119901119894119895
= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and
119895 = 1 2 119871
(3) for each 119895 = 1 2 119871 since1198741119895+119874
2119895+sdot sdot sdot+119874
119870119895= 119899
sdot119895
there are 119899sdot119895nonzero 119874
119894119895rsquos or equivalently 119899
sdot119895nonzero
119901119894119895rsquos
(4) 119901119894sdot= 1119873 119894 = 1 2 119870
Using the above observations and the fact that 0 log 0 = 0 wehave
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
=
119870
sum119894=1
1199011198941log
1199011198941
119901119894sdot119901sdot1
+
119870
sum119894=1
1199011198942log
1199011198942
119901119894sdot119901sdot2
+ sdot sdot sdot +
119870
sum119894=1
119901119894119871log
119901119894119871
119901119894sdot119901sdot119871
10 Mathematical Problems in Engineering
= sum1199011198941 =0
1
119873log 1119873
119901sdot1119873
+ sum1199011198942 =0
1
119873log 1119873
119901sdot2119873
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1119873
119901sdot119871119873
= sum1199011198941 =0
1
119873log 1
119901sdot1
+ sum1199011198942 =0
1
119873log 1
119901sdot2
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1
119901sdot119871
=119899sdot1
119873log 1
119901sdot1
+119899sdot2
119873log 1
119901sdot2
+ sdot sdot sdot +119899sdot119871
119873log 1
119901sdot119871
= 119901sdot1log 1
119901sdot1
+ 119901sdot2log 1
119901sdot2
+ sdot sdot sdot + 119901sdot119871log 1
119901sdot119871
= 119867 (119884)
(59)
52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1
To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways
521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs
522 Relative Values From Property 4 we have 0 le
119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs
523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]
Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then
119901 ln119901
119902asymp (119901 minus 119902) +
(119901 minus 119902)2
2119902 (60)
The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is
Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain
2119868 (119883 119884) = 2119873
119870
sum119894=1
119871
sum119895=1
119901119894119895ln
119901119894119895
119901119894sdot119901sdot119895
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895119873
(119899119894sdot119873) (119899
sdot119895119873)
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895
119899119894sdot119899sdot119895119873
asymp 2
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus
119899119894sdot119899sdot119895
119873)
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895minus 2
sum119894119899119894sdot
119873
sum119895119899sdot119895
119873
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2119873 minus 2119873
119873+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
=
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 1205942
(61)
The last equation means the previous expressionsum119870
119894=1sum119871
119895=1((119874
119894119895minus 119899
119894sdot119899sdot119895119873)
2(119899
119894sdot119899sdot119895119873)) follows 1205942 distribu-
tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)
approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)
The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no
relationship between them)1198671119883 and119884 are dependent (ie there is a relationship
between them)
Mathematical Problems in Engineering 11
The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
asymp 2119873 times 119868 (119883 119884) (62)
is greater than 1205942119880 the upper-tail critical value from a Chi-
square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is
Select feature 119883 if 119868 (119883 119884) gt1205942119880
2119873 (63)
Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942
119880to be 169 from a Chi-
square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873
Assume a training set has119873 examples We can do featureselection by the following procedure
(i) Step 1 Choose a level of significance of 120572 say 005
(ii) Step 2 Find 119870 the number of values of feature119883
(iii) Step 3 Build the contingency table for119883 and 119884
(iv) Step 4 Calculate 119868(119883 119884) from the contingency table
(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom
from a Chi-square table or any other sources such asSAS
(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise
(vii) Step 7 Repeat Steps 2ndash6 for all features
If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure
53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1
Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process
(i) Step 1 select features by removing those with smallmutual information
(ii) Step 2 do binning for the rest of numerical features
(iii) Step 3 select features by mutual information
54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning
Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships
The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value
6 Conclusions
In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring
Conflict of Interests
The author declares that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
2 Mathematical Problems in Engineering
they calculated the mean value of this random variable andcalled the result the mean information of two ensembles
However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Moreover except Pinskerrsquos definitionClass 2 definitions either neglect the probability spaces orassume the two random variables have the same probabilityspace Both Class 1 definitions and Class 2 definitions assumea joint distribution or a joint probability measure exists Yetthey all ignore an important fact that the joint distribution orthe joint probability measure is not unique
In this paper we first present a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distributionof two random variables by taking the marginal probabilitiesinto consideration With our new definition of mutual infor-mation different joint distributions will result in differentmutual information Next we establish some properties ofthe newly defined mutual information We then propose amethod to calculatemutual information inmachine learningFinally we apply our newly defined mutual information tocredit scoring
The rest of the paper is organized as follows In Section 2we briefly review the basic concepts in probability theoryIn Section 3 we examine various definitions of mutualinformation In Section 4 we first propose a new unifieddefinition for mutual information and then establish someproperties of the newly defined mutual information InSection 5 we first propose a method to calculate mutualinformation in machine learning We then apply the newlydefined mutual information to credit scoring The paper isconcluded in Section 6
Throughout the paper we will restrict our focus onmutual information for finite discrete random variables
2 Basic Concepts in Probability Theory
Let us review some basic concepts of probability theoryTheycan be found inmany books in probability theory such as [12]
Definition 1 A probability space is a triple (ΩF 119875) where
(1) Ω is a set called a sample space and elements ofΩ aredenoted by 120596 and are called outcomes
(2) F is an 120590-field consisting of all subsets of Ω andelements ofF are called events
(3) 119875 is called a probability measure and it is a mappingfromF to [0 1]with 119875(Ω) = 1 such that if119860
1 119860
2
are pairwise disjoint
119875(⋃119894
119860119894) = sum
119894
119875 (119860119894) (1)
Definition 2 A discrete probability space is a probabil-ity space (ΩF 119875) such that Ω is finite or countable
Ω = 1205961 120596
2 In this case F is chosen to be all the
subsets of Ω and the probability measure 119875 can be definedin terms of a series of nonnegative numbers 119901
1 119901
2 whose
sum is 1 If 119860 is any subset ofΩ then
119875 (119860) = sum120596119894isin119860
119901119894 (2)
In particular
119875 (120596119894) = 119901
119894 (3)
For simplicity we will write 119875(120596) as 119875(120596) From Defini-tion 2 we see that for a discrete probability space the prob-ability measure is characterized by the pointwise mapping 119901120596
1 120596
2 rarr [0 1] in (2)The probability of an event119860 is
computed simply by adding the probabilities of the individualpoints of 119860
Definition 3 A randomvariable 119883 on probability space(ΩF 119875) is a Borel measurable function fromΩ to (minusinfininfin)
such that for every Borel set 119861119883minus1(119861) = 119883 isin 119861 isin F Herewe use notation 119883 isin 119861 = 120596 isin Ω 119883(120596) isin 119861
Definition 4 If 119883 is a random variable then for every Borelsubset 119861 of 119877 we define a function by 120583
119883(119861) = 119875(119883 isin
119861) = 119875(119883minus1(119861)) Then 120583119883
is a probability measure on(RB(R) 120583
119883) and is called the probability distribution of119883
Definition 5 A random variable 119883 is discrete if its range isfinite or countable In particular any random variable on adiscrete probability space is discrete sinceΩ is countable
Definition 6 A (discrete) random variable 119883 on a discreteprobability space (ΩF 119875) is a Borel measurable functionfrom Ω to R where Ω = 120596
1 120596
2 and R is the set of
real numbers If the range of 119883 is 1199091 119909
2 then function
119891119883 119909
1 119909
2 rarr [0 1] defined by
119891119883(119909
119894) = 119875 (119883 = 119909
119894) (4)
is called the probability mass function of 119883 whereas proba-bilities 119875(119883 = 119909
1) 119875(119883 = 119909
2) are called the probability
distribution of119883
Note that in Definition 2
sum119894
119875 (119883 = 119909119894) = 1 (5)
Thus a discrete random variable may be characterized by itsprobability mass function
3 Various Definitions of Mutual Information
Since Shannonrsquos pioneering work [4] there have been variousdefinitions for mutual information Essentially these defi-nitions can be divided into two classes (1) definitions withrandom variables and (2) definitions with ensembles that isprobability spaces in the mathematical literature
Mathematical Problems in Engineering 3
31 Shannonrsquos Original Definition
Definition 7 Let 119909 be a chance variable with probabilities1199011 119901
2 119901
119899 whose sum is 1 Then
119867(119909) = minus
119899
sum119894=1
119901119894log119901
119894 (6)
is called the entropy of 119909Suppose two chance variables 119909 and 119910 have 119898 and 119899
possibilities respectively Let indices 119894 and 119895 range over all the119898 possibilities and all the 119899 possibilities respectively Let 119901(119894)be the probability of 119894 and 119901(119894 119895) the probability of the jointoccurrence of 119894 and 119895 Denote the conditional probability of 119894given 119895 by 119901(119894 | 119895) and conditional probability of 119895 given 119894 by119901(119895 | 119894)
Definition 8 The joint entropy of 119909 and 119910 is defined as
119867(119909 119910) = minussum119894119895
119901 (119894 119895) log119901 (119894 119895) (7)
Definition 9 The conditional entropy of 119910119867119909(119910) is defined
as
119867119909(119910) = minussum
119894119895
119901 (119894 119895) log119901 (119895 | 119894)
= minussum119894119895
119901 (119894 119895) log119901 (119894 119895)
sum119895119901 (119894 119895)
(8)
The conditional entropy of 119909119867119910(119909) can be defined similarly
Then the following relation holds
119867(119909 119910) = 119867 (119909) + 119867119909(119910) = 119867 (119910) + 119867
119910(119909)
119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867
119909(119910)
= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)
(9)
Definition 10 The rate of transmission of information 119877 isdefined as the difference between 119867(119909) and 119867
119910(119909) Then 119877
can be written in two other forms
119877 = 119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867
119909(119910)
= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)
(10)
Remark 11 Shannon did not derive the explicit formula for119877
119877 = sumij119901 (119894 119895) log
119901 (119894 119895)
119901 (119894) 119901 (119895) (11)
However he did imply it in Appendix 7 [4]
32 Class 1 Definitions
321 Kullbackrsquos Definition Kullback [5] redefined entropymore mathematically in a standalone homework question asfollows Consider two discrete random variables 119909 119910 where
119901119894119895= Prob (119909 = 119909
119894 119910 = 119910
119895) gt 0
119894 = 1 2 119898 119895 = 1 2 119899
119901119894sdot=
119899
sum119895=1
119901119894119895
119901sdot119895=
119898
sum119894=1
119901119894119895
msumi=1
nsumj=1
119901119894119895=
119898
sum119894=1
119901119894sdot=
119899
sum119895=1
119901sdot119895= 1
(12)
Define the joint entropy entropy and conditional entropy asfollows
119867(119909 119910) = minussum119894
sum119895
119901119894119895log119901
119894119895
119867 (119909) = minussum119894
119901119894sdotlog119901
119894sdot
119867 (119910) = minussum119895
119901sdot119895log119901
sdot119895
119867 (119910 | 119909119894) = minussum
119895
119901119894119895
119901119894sdot
log119901119894119895
119901119894sdot
119867 (119910 | 119909) = sum119894
119901119894119867(119910 | 119909
119894) = minussum
119894
sum119895
119901119894119895log
119901119894119895
119901119894sdot
(13)
Then119867(119909 119910) = 119867(119909) +119867(119910 | 119909) le 119867(119909) +119867(119910) and119867(119910) ge
119867(119910 | 119909)
322 Information Conveyed Ash [7] beganwith two randomvariables 119883 and 119884 and assumed 119883 and 119884 had the sameprobability space He systematically defined the entropyconditional entropy and joint entropy following Shannonrsquospath in [4] At the end he denoted 119867(119883) minus 119867(119883 | 119884) by119868(119883 | 119884) and called it the information conveyed about 119883 by119884
323 Information of One Variable with respect to the OtherPinsker [6] treated the fundamental concepts of Shannon ina more advanced manner by employing probability theorySuppose 120585 is a random variable defined on a probability space(Ω 119878
120596 119875
120585) and is taking values in a measurable space (119883 119878
119909)
and 120578 is a random variable defined on a probability space(Ψ 119878
120595 119875
120578) and is taking values in a measurable space (119884 119878
119910)
Then the pair 120585 120578 of random variables may be regarded as asingle random variable (120585 120578)with values in the product space119883 times 119884 of all pairs (119909 119910) with 119909 isin 119883 119910 isin 119884 The distribution119875(120585120578)
(sdot) = 119875120585120578(sdot) of (120585 120578) is called the joint distribution of
4 Mathematical Problems in Engineering
random variables 120585 and 120578 By the product of the distributions119875120585(sdot) and 119875
120578(sdot) denoted by 119875
120585times120578(sdot) we mean the distribution
defined on 119878119909times 119878
119910
119875120585times120578
(119864 times 119865) = 119875120585(119864) times 119875
120578(119865) (14)
for 119864 isin 119878119909and 119865 isin 119878
119910 If the joint distribution 119875
120585120578(sdot)
coincides with the product distribution 119875120585times120578
(sdot) the randomvariables 120585 and 120578 are said to be independent If 120585 and 120578 arediscrete random variables say 119883 and 119884 contain countablymany points 119909
1 119909
2 and 119910
1 119910
2 then
119868 (120585 120578) = sum119894119895
119875120585times120578
(119909119894 119910
119895) log
119875120585120578(119909
119894 119910
119895)
119875120585(119909
119894) 119875
120578(119910
119895) (15)
119868 is called the information of 120585 and 120578with respect to the other
324 A Modern Definition in Information Theory Of thevarious definitions of mutual information the most widelyaccepted of recent years is the one by Cover andThomas [8]
Let 119883 be a discrete variable with alphabet 120594 and prob-ability mass function 119901(119909) = Pr119883 = 119909 119909 isin 120594 Let 119884
be a discrete variable with alphabet Υ and probability massfunction 119901(119910) = Pr119884 = 119910 119910 isin Υ Suppose 119883 and 119884 havea joint mass function (joint distribution) 119901(119909 119910) Then themutual information 119868(119883 119884) can be defined as
119868 (119883 119884) = sum119909isin120594
sum119910isinΥ
119901 (119909 119910) log119901 (119909 119910)
119901 (119909) 119901 (119910) (16)
33 Class 2 Definitions In Class 2 definitions random vari-ables are replaced by ensembles and mutual information isso-called average mutual information Gallager [11] adopteda more general and more rigorous approach to introducethe concept of mutual information in communication theoryIndeed he combined and compiled the results from Fano [9]and Abramson [10]
Suppose that discrete ensemble 119883 has a sample space1198861 119886
2 119886
119870 and discrete ensemble 119884 has a sample space
1198871 1198872 119887
119871 Consider the joint sample space (119886
119896 119887119895) 1 le
119896 le 119870 1 le 119895 le 119869 A probability measure on the joint samplespace is given by the join probability 119875
119883119884(119886119896 119887119895) defined for
1 le 119896 le 119870 1 le 119895 le 119869 The combination of a joint samplespace and probability measure for outcomes 119909 and 119910 is calleda joint 119883119884ensemble Then the marginal probabilities can befound as
119875119883(119886
119896) =
119869
sum119895=1
119875119883119884
(119886119896 119887119895) 119896 = 1 2 119870 (17)
In more abbreviated notation this is written as
119875 (119909) = sum119910
119875 (119909 119910) (18)
Likewise
119875119884(119887119895) =
119870
sum119896=1
119875119883119884
(119886119896 119887119895) 119895 = 1 2 119869 (19)
In more abbreviated notation this is written as
119875 (119910) = sum119909
119875 (119909 119910) (20)
If 119875119883(119886119896) gt 0 the conditional probability that outcome 119910 is
119887119895 given that outcome of 119909 is 119886
119896 is defined as
119875119884|119883
(119887119895| 119886
119896) =
119875119883119884
(119886119896 119887119895)
119875119883(119886
119896)
(21)
The mutual information between events 119909 = 119886119896and 119910 = 119887
119895is
defined as
119868119883119884
(119886119896 119887119895) = log
119875119883|119884
(119886119896| 119887
119895)
119875119883(119886
119896)
= log119875119883119884
(119886119896 119887119895)
119875119883(119886
119896) 119875
119884(119887119895)
= log119875119883|119884
(119886119896| 119887
119895)
119875119883(119886
119896)
= 119868119884119883
(119887119895 119886
119896)
(22)
Since the mutual information defined above is a randomvariable on the 119883119884 joint ensemble the mean value which iscalled the average mutual information denoted by 119868(119883 119884) isgiven by
119868 (119883 119884) =
119870
sum119896=1
119871
sum119895=1
119875119883119884
(119886119896 119887119895) log
119875119883119884
(119886119896 119887119895)
119875119883(119886
119896) 119875
119884(119887119895) (23)
Remark 12 By means of an information channel consistingof a transmitter of alphabet 119860 with elements 119886
119894and total
elements 119905 and a receiver of alphabet 119861 with elements 119887119895
and total elements 119903 Abramson [10] denoted 119867(119860) minus 119867(119860 |
119861) = sum119860119861
119875(119886 119887) log(119875(119886 119887)119875(119886)119875(119887)) by 119868(119860 119861) and calledit mutual information of 119860 and 119861
The mutual information 119868(119883 119884) between 2 continuousrandomvariables119883 and119884 [8] (also called rate of transmissionin [1]) is defined as
119868 (119883 119884) = ∬119875 (119909 119910) log119875 (119909 119910)
119875 (119909) 119875 (119910)119889119909 119889119910 (24)
where 119875(119909 119910) is the joint probability density function of119883 and 119884 and 119875(119909) and 119875(119910) are the marginal densityfunctions associated with 119883 and 119884 respectively The mutualinformation between 2 continuous random variables is alsocalled the differential mutual information
However the differential mutual information ismuch lesspopular than its discrete counterpart On the one hand thejoint density function involved is unknown inmost cases andhence must be estimated [13 14] On the other hand datain engineering and machine learning are mostly finite andso mutual information between discrete random variables isused
4 A New Unified Definition ofMutual Information
In Section 3 we reviewed various definitions of mutual infor-mation Shannonrsquos original definition laid the foundation
Mathematical Problems in Engineering 5
of information theory Kullbackrsquos definition used randomvariables for the first time and was more mathematical andmore compact Although Ashrsquos definition followed Shannonrsquospath it was more systematic Pinskerrsquos definition was mostmathematical in that it employed probability theory Gal-lagerrsquos definition was more general and more rigorous incommunication theory Cover and Thomasrsquos definition is sosuccinct that it is now a standard definition in informationtheory
However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Except Pinskerrsquos definition Class 2definitions either neglect the probability spaces or assumethe two random variables have the same probability spaceBoth Class 1 definitions and Class 2 definitions assume a jointdistribution or a joint probability measure exists Yet they allignore an important fact that the joint distribution or the jointprobability measure is not unique
41 Unified Definition of Mutual Information Let 119883 be afinite discrete random variable on discrete probability space(Ω
1F
1 119875
1) with Ω
1= 120596
1 120596
2 120596
119899 and range 119909
1 119909
2
119909119870 with 119870 le 119899 Let 119884 be a discrete random variable on
probability space (Ω2F
2 119875
2) with Ω
2= 120588
1 120588
2 120588
119898 and
range 1199101 119910
2 119910
119871 with 119871 le 119898
If119883 and119884have the sameprobability space (ΩF 119875) thenthe joint distribution is simply
119875119883119884
(119883 = 119909 119884 = 119910) = 119875 (120596 isin Ω 119883 (120596) = 119909 119884 (120596) = 119910)
(25)
However when 119883 and 119884 have different probability spacesand so different probability measures the joint distributionis more complicated
Definition 13 The joint sample space of random variables 119883and 119884 is defined as the product Ω
1times Ω
2of all pairs (120596
119894 120588
119895)
119894 = 1 2 119899 and 119895 = 1 2 119898The joint 120590-fieldF1timesF
2is
defined as the product of all pairs (1198601 119860
2) where119860
1and119860
2
are elements of F1and F
2 respectively A joint probability
measure 119875119883119884
of 1198751and 119875
2is a probability measure on F
1times
F2 119875
119883119884(119860 times 119861) such that for any 119860 sube Ω
1and 119861 sube Ω
2
1198751(119860) = 119875
119883119884(119860 times Ω
2) =
119898
sum119895=1
119875119883119884
(119860 times 120588119895)
1198752(119861) = 119875
119883119884(Ω
1times 119861) =
119898
sum119894=1
119875119883119884
(120596119894 times 119861)
(26)
(Ω1timesΩ
2 F
1timesF
2 119875
119883119884) is called the joint probability space
of 119883 and 119884 and 119875119883119884
(119883 = 119909119894 times 119884 = 119910
119895) for 119894 = 1 2 119899
and 119895 = 1 2 119898 the joint distribution of119883 and 119884
Combining Definitions 2 and 13 we immediately obtainthe following results
Proposition 14 A sequence of nonnegative numbers 119901119894119895 1 le
119894 le 119870 1 le 119895 le 119871 whose sum is 1 can serve as a probabilitymeasure on F
1times F
2 119875
119883119884(120596
119894 120588
119895) = 119901
119894119895 The probability of
any event 119860 times 119861 sube Ω1times Ω
2is computed simply by adding the
probabilities of the individual points of (120596 120588) isin 119860 times 119861 If inaddition for 119894 = 1 2 119870 and 119895 = 1 2 119871 the followinghold
119871
sum119895=1
119901119894119895= 119875
119883(120596
119894)
119870
sum119894=1
119901119894119895= 119875
119884(120588
119895)
(27)
then 119875119883119884
(120596119894 120588
119895) = 119901
119894119895is a joint distribution of119883 and 119884
For convenience from now on we will shorten 119875119883119884
(119883 =
119909119894 times 119884 = 119910
119895) as 119875
119883119884(119909
119894 119910
119895)
This two-dimensional measure should not be confusedwith one-dimensional joint distribution when 119883 and 119884 havethe same probability space
Remark 15 If (Ω1F
1 119875
1) = (Ω
2F
2 119875
2) instead of using
the two-dimensional measure 119875119883119884
(119883 = 119909119894 times 119884 =
119910119895) we may use the one-dimensional measure 119875
1(119883 =
119909119894 and 119884 = 119910
119895) Then (26) always hold In this sense our
new definition of joint distribution reduces to the definitionof joint distribution with the same probability space
Definition 16 The conditional probability 119884 = 119910119895 given 119883 =
119909119894 is defined as
119875119884|119883
(119884 = 119910119895| 119883 = 119909
119894) =
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) (28)
Theorem 17 For any two discrete random variables thereis at least one joint probability measure called the productprobability measure or simply product distribution
Proof Let random variables 119883 and 119884 be defined as beforeDefine a function fromΩ
1times Ω
2to [0 1] as follows
119875119883119884
(120596119894 120588
119895) = 119875
1(120596
119894) 119875
2(120588
119895) (29)
Then
119899
sum119894=1
119898
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119899
sum119894=1
119898
sum119895=1
1198751(120596
119894) 119875
2(120588
119895)
=
119899
sum119894=1
1198751(120596
119894)
119898
sum119895=1
1198752(120588
119895) = 1
(30)
Hence 119875119883119884
can serve as a probability measure on F1times
F2by Definition 2 The probability of any event 119860 times 119861 sube
Ω1times Ω
2is computed simply by adding the probabilities of
6 Mathematical Problems in Engineering
the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596
1198941 120596
1198942 120596
119894119904 isin Ω
1of 119904 elements
119875119883119884
(119860 times Ω2) =
119898
sum119895=1
119875119883119884
(119860 times 120588119895)
=
119898
sum119895=1
119904
sum119906=1
119875119883119884
(120596119894119906 times 120588
119895)
=
119898
sum119895=1
119904
sum119906=1
1198751(120596
119894119906) 119875
2(120588
119895)
=
119898
sum119895=1
1198752(120588
119895)
119904
sum119906=1
1198751(120596
119894119906)
=
119904
sum119906=1
1198751(120596
119894119906) = 119875
1(119860)
(31)
Similarly 119875119883119884
(Ω1times 119861) = 119875
2(119861) for any 119861 isin Ω
2 Hence
119875119883119884
(119883 = 119909119894 times 119884 = 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) is a
joint probability measure of119883 and 119884 by Definition 13
Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875
119883119884(sdot) if 119875
119883119884(sdot) coincides
with the product distribution 119875119883times119884
(sdot)
Definition 19 The joint entropy119867(119883 119884) is defined as
119867(119883 119884) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119883119884(119909
119894 119910
119895) (32)
Definition 20 Theconditional entropy119867(119884 | 119883) is as follows
119867(119884 | 119883) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119884 = 119910
119895| 119883 = 119909
119894)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894)
(33)
Definition 21 The mutual information 119868(119883 119884) between 119883
and 119884 is defined as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
(34)
As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0
It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ
1= 119886
1 119886
2 119886
119870 andΩ
2= 119887
1 1198872 119887
119871
Define random variables 119883 Ω1rarr R and 119884 Ω
2rarr R as
one-to-one mappings as
119883(119886119894) = 119909
119894 119894 = 1 2 119870
119884 (119887119895) = 119910
119895 119895 = 1 2 119871
(35)
Then we have
119875119883119884
(119909119894 119910
119895) = 119875
119883119884(119886
119894 119887119895) (36)
It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example
Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions
1198751(119884 = 0) =
1
3 119875
1(119884 = 1) =
2
3
1198752(119883 = 1) =
1
3 119875
2(119883 = 2) =
1
3 119875
2(119883 = 3) =
1
3
(37)
We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information
(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13
(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0
(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13
(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =
29 119875(3 0) = 19 119875(3 1) = 29
42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]
Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =
1199011 119901
2 119901
119899 and 119876 = 119902
1 119902
2 119902
119899 is defined as
119863 (119875 119876) = sum119894
119901119894log
119901119894
119902119894
(38)
Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902
119894for all 119894
Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions
Mathematical Problems in Engineering 7
The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875
119883119884and the product
distribution 119875119883119875119884
Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875
119883119884and the product distribution 119875
11198752
Proof Using a mapping from 2-dimensional indices to one-dimensional index
(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899
for 119894 = 1 119870 119895 = 1 2 119871(39)
and using another mapping from one-dimensional indexback to two-dimensional indices
119894 = lceil119899
119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871
for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871
(40)
we rewrite 119868(119883 119884) as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(41)
Since119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) = 1
119870119871
sum119899=1
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
=
119870
sum119894=1
119871
sum119895=1
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895) = 1
(42)
we obtain
119868 (119883 119884) =
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(43)
Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies
119868 (119883 119884) ge 0 (44)
with equality if and only if119883 and 119884 are independent
Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875
119883119884(119909
lceil119899119871rceil 119910
119899minus(lceil119899119871rceilminus1)lowast119871) = 119875
1(119883 = 119909
lceil119899119871rceil)119875
2(119884 =
119910119899minus(lceil119899119871rceilminus1)lowast119871
) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +
sdot sdot sdot + 119870 that is 119875119883119884
(119909119894 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) for 119894 =
1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent
Corollary 26 If119883 is a constant random variable that is119870 =
1 then for any random variable 119884
119868 (119883 119884) = 0 (45)
Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875
1(119883 = 119909) = 119875
1(120596) = 1
For any 119895 = 1 2 119871
119875119883119884
(119909 119910119895) =
1
sum119894=1
119875119883119884
(119909 119910119895) = 119875
2(119884 = 119910
119895)
= 1198751(119883 = 119909) 119875
2(119884 = 119910
119895)
(46)
Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0
Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then
0 le 119867 (119883) le log119870 (47)
with equality if and only if the119870 values are equally probable
Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold
119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)
= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)
Proof Consider
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119884|119883
(119909119894 119910
119895)
1198752(119884 = 119910
119895)
8 Mathematical Problems in Engineering
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
+
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
119871
sum119894=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
1198752(119884 = 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= 119867 (119884) minus 119867 (119884 | 119883)
(49)
Combining the above properties and noting that 119867(119883 | 119884)
and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties
Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then
0 le 119868 (119883 119884) le 119867 (119884) le log 119871
0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)
Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent
5 Newly Defined Mutual Information inMachine Learning
Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883
1 119883
2 119883
119872 and their associated output values
119910 from an output space 119884 (ie dependent variable) Here1198831 119883
2 119883
119872are called features that is independent vari-
ables Hence 119878 can be expressed as
119878 = 1199091198941 119909
1198942 119909
119894119872 119910
119894 119894 = 1 2 119873 (51)
where feature 119883119895has values 119909
1119895 119909
2119895 119909
119873119895for 119895 = 1 2
119872
A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883
1 119883
2 119883
119872 Rather it is decided by only a subset of
them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows
(1) Evaluate the relationship between each individualinput feature 119909
119894and the output 119884
(2) Select the best set of attributes according to somecriterion
51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596
1 120596
2 120596
119870 If 119883 has missing values we will use 120596
1
to represent all the missing values Assume 119884 has 119871 differentvalues 120588
1 120588
2 120588
119871
Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874
119894119895be the frequency (could be 0) of (120596
119894 120588
119895) for
119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899
119894sdotand 119899
sdot119895 respectively Then
119899119894sdot= sum
119895
119874119894119895
119899sdot119895= sum
119894
119874119894119895
119873 = sum119894
sum119895
119874119894119895= sum
119894
119899119894sdot= sum
119895
119899sdot119895
(52)
Let us denote the relative frequency119874119894119895119873 by 119901
119894119895We have the
two-way relative frequency table see Table 2Since
119870
sum119894=1
119871
sum119895=1
119901119894119895=
119870
sum119894=1
119901119894sdot=
119871
sum119895=1
119901sdot119895= 1 (53)
119901119894sdot119870119894=1
119901sdot119895119871119895=1
and 119901119894119895119870119894=1
can each serve as a probabilitymeasure
Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables
119883 (Ω1F
1 119875
119883) 997888rarr 119877 (54)
as 119883(120596119894) = 119909
119894 where Ω
1= 120596
1 120596
2 120596
119870 and 119875
119883(120596
119894) =
119899119894sdot119873 = 119901
119894sdotfor 119894 = 1 2 119870 Note that 119909
1 119909
2 119909
119870
could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909
119894) = 119875
119883(120596
119894)
Mathematical Problems in Engineering 9
Table 1 Frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11987411
11987412
sdot sdot sdot 1198741119895
sdot sdot sdot 1198741119871
1198991∙
1205962
11987421
11987422
sdot sdot sdot 1198742119895
sdot sdot sdot 1198742119871
1198992∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1198741198941
1198741198942
sdot sdot sdot 119874119894119895
sdot sdot sdot 119874119894119871
119899119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1198741198701
1198741198702
sdot sdot sdot 119874119870119895
sdot sdot sdot 119874119870119871
119899119870∙
Total 119899∙1
119899∙2
sdot sdot sdot 119899∙119895
sdot sdot sdot 119899∙119871
119873
Table 2 Relative frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11990111
11990112
sdot sdot sdot 1199011119895
sdot sdot sdot 1199011119871
1199011∙
1205962
11990121
11990122
sdot sdot sdot 1199012119895
sdot sdot sdot 1199012119871
1199012∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1199011198941
1199011198942
sdot sdot sdot 119901119894119895
sdot sdot sdot 119901119894119871
119901119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1199011198701
1199011198702
sdot sdot sdot 119901119870119895
sdot sdot sdot 119901119870119871
119901119870∙
Total 119901∙1
119901∙2
sdot sdot sdot 119901∙119895
sdot sdot sdot 119901∙119871
1
Similarly
119884 (Ω2F
2 119875
2) 997888rarr 119877 (55)
as 119884(120588119895) = 119910
119895 where Ω
2= 120588
1 120588
2 120588
119871 and 119875
119884(120588
119895) =
119899sdot119895119873 = 119901
sdot119895for 119895 = 1 2 119871 Also 119910
1 119910
2 119910
119870could be
any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875
119884(119884 =
119910119895) = 119875
119884(120588
119895)
Now define a mapping 119875119883119884
fromΩ1timesΩ
2to 119877 as follows
119875119883119884
(120596119894 120588
119895) = 119901
119894119895=
119874119894119895
119873 (56)
Since119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) = 1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119871
sum119895=1
119901119894119895= 119901
119894sdot= 119875
119883(120596
119894)
119870
sum119894=1
119875119883119884
(120596119894 120588
119895) =
119870
sum119894=1
119901119894119895= 119901
sdot119895= 119875
119884(120588
119895)
(57)
119901119894119895119870119894=1
is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) log
119875119883119884
(120596119894 120588
119895)
119875119883(120596
119894) 119875
119884(120588
119895)
=
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
(58)
It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value
Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)
Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that
(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871
(2) 119901119894119895
= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and
119895 = 1 2 119871
(3) for each 119895 = 1 2 119871 since1198741119895+119874
2119895+sdot sdot sdot+119874
119870119895= 119899
sdot119895
there are 119899sdot119895nonzero 119874
119894119895rsquos or equivalently 119899
sdot119895nonzero
119901119894119895rsquos
(4) 119901119894sdot= 1119873 119894 = 1 2 119870
Using the above observations and the fact that 0 log 0 = 0 wehave
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
=
119870
sum119894=1
1199011198941log
1199011198941
119901119894sdot119901sdot1
+
119870
sum119894=1
1199011198942log
1199011198942
119901119894sdot119901sdot2
+ sdot sdot sdot +
119870
sum119894=1
119901119894119871log
119901119894119871
119901119894sdot119901sdot119871
10 Mathematical Problems in Engineering
= sum1199011198941 =0
1
119873log 1119873
119901sdot1119873
+ sum1199011198942 =0
1
119873log 1119873
119901sdot2119873
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1119873
119901sdot119871119873
= sum1199011198941 =0
1
119873log 1
119901sdot1
+ sum1199011198942 =0
1
119873log 1
119901sdot2
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1
119901sdot119871
=119899sdot1
119873log 1
119901sdot1
+119899sdot2
119873log 1
119901sdot2
+ sdot sdot sdot +119899sdot119871
119873log 1
119901sdot119871
= 119901sdot1log 1
119901sdot1
+ 119901sdot2log 1
119901sdot2
+ sdot sdot sdot + 119901sdot119871log 1
119901sdot119871
= 119867 (119884)
(59)
52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1
To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways
521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs
522 Relative Values From Property 4 we have 0 le
119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs
523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]
Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then
119901 ln119901
119902asymp (119901 minus 119902) +
(119901 minus 119902)2
2119902 (60)
The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is
Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain
2119868 (119883 119884) = 2119873
119870
sum119894=1
119871
sum119895=1
119901119894119895ln
119901119894119895
119901119894sdot119901sdot119895
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895119873
(119899119894sdot119873) (119899
sdot119895119873)
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895
119899119894sdot119899sdot119895119873
asymp 2
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus
119899119894sdot119899sdot119895
119873)
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895minus 2
sum119894119899119894sdot
119873
sum119895119899sdot119895
119873
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2119873 minus 2119873
119873+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
=
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 1205942
(61)
The last equation means the previous expressionsum119870
119894=1sum119871
119895=1((119874
119894119895minus 119899
119894sdot119899sdot119895119873)
2(119899
119894sdot119899sdot119895119873)) follows 1205942 distribu-
tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)
approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)
The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no
relationship between them)1198671119883 and119884 are dependent (ie there is a relationship
between them)
Mathematical Problems in Engineering 11
The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
asymp 2119873 times 119868 (119883 119884) (62)
is greater than 1205942119880 the upper-tail critical value from a Chi-
square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is
Select feature 119883 if 119868 (119883 119884) gt1205942119880
2119873 (63)
Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942
119880to be 169 from a Chi-
square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873
Assume a training set has119873 examples We can do featureselection by the following procedure
(i) Step 1 Choose a level of significance of 120572 say 005
(ii) Step 2 Find 119870 the number of values of feature119883
(iii) Step 3 Build the contingency table for119883 and 119884
(iv) Step 4 Calculate 119868(119883 119884) from the contingency table
(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom
from a Chi-square table or any other sources such asSAS
(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise
(vii) Step 7 Repeat Steps 2ndash6 for all features
If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure
53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1
Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process
(i) Step 1 select features by removing those with smallmutual information
(ii) Step 2 do binning for the rest of numerical features
(iii) Step 3 select features by mutual information
54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning
Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships
The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value
6 Conclusions
In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring
Conflict of Interests
The author declares that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 3
31 Shannonrsquos Original Definition
Definition 7 Let 119909 be a chance variable with probabilities1199011 119901
2 119901
119899 whose sum is 1 Then
119867(119909) = minus
119899
sum119894=1
119901119894log119901
119894 (6)
is called the entropy of 119909Suppose two chance variables 119909 and 119910 have 119898 and 119899
possibilities respectively Let indices 119894 and 119895 range over all the119898 possibilities and all the 119899 possibilities respectively Let 119901(119894)be the probability of 119894 and 119901(119894 119895) the probability of the jointoccurrence of 119894 and 119895 Denote the conditional probability of 119894given 119895 by 119901(119894 | 119895) and conditional probability of 119895 given 119894 by119901(119895 | 119894)
Definition 8 The joint entropy of 119909 and 119910 is defined as
119867(119909 119910) = minussum119894119895
119901 (119894 119895) log119901 (119894 119895) (7)
Definition 9 The conditional entropy of 119910119867119909(119910) is defined
as
119867119909(119910) = minussum
119894119895
119901 (119894 119895) log119901 (119895 | 119894)
= minussum119894119895
119901 (119894 119895) log119901 (119894 119895)
sum119895119901 (119894 119895)
(8)
The conditional entropy of 119909119867119910(119909) can be defined similarly
Then the following relation holds
119867(119909 119910) = 119867 (119909) + 119867119909(119910) = 119867 (119910) + 119867
119910(119909)
119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867
119909(119910)
= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)
(9)
Definition 10 The rate of transmission of information 119877 isdefined as the difference between 119867(119909) and 119867
119910(119909) Then 119877
can be written in two other forms
119877 = 119867 (119909) minus119867119910(119909) = 119867 (119910) minus 119867
119909(119910)
= 119867 (119909) + 119867 (119910) minus 119867 (119909 119910)
(10)
Remark 11 Shannon did not derive the explicit formula for119877
119877 = sumij119901 (119894 119895) log
119901 (119894 119895)
119901 (119894) 119901 (119895) (11)
However he did imply it in Appendix 7 [4]
32 Class 1 Definitions
321 Kullbackrsquos Definition Kullback [5] redefined entropymore mathematically in a standalone homework question asfollows Consider two discrete random variables 119909 119910 where
119901119894119895= Prob (119909 = 119909
119894 119910 = 119910
119895) gt 0
119894 = 1 2 119898 119895 = 1 2 119899
119901119894sdot=
119899
sum119895=1
119901119894119895
119901sdot119895=
119898
sum119894=1
119901119894119895
msumi=1
nsumj=1
119901119894119895=
119898
sum119894=1
119901119894sdot=
119899
sum119895=1
119901sdot119895= 1
(12)
Define the joint entropy entropy and conditional entropy asfollows
119867(119909 119910) = minussum119894
sum119895
119901119894119895log119901
119894119895
119867 (119909) = minussum119894
119901119894sdotlog119901
119894sdot
119867 (119910) = minussum119895
119901sdot119895log119901
sdot119895
119867 (119910 | 119909119894) = minussum
119895
119901119894119895
119901119894sdot
log119901119894119895
119901119894sdot
119867 (119910 | 119909) = sum119894
119901119894119867(119910 | 119909
119894) = minussum
119894
sum119895
119901119894119895log
119901119894119895
119901119894sdot
(13)
Then119867(119909 119910) = 119867(119909) +119867(119910 | 119909) le 119867(119909) +119867(119910) and119867(119910) ge
119867(119910 | 119909)
322 Information Conveyed Ash [7] beganwith two randomvariables 119883 and 119884 and assumed 119883 and 119884 had the sameprobability space He systematically defined the entropyconditional entropy and joint entropy following Shannonrsquospath in [4] At the end he denoted 119867(119883) minus 119867(119883 | 119884) by119868(119883 | 119884) and called it the information conveyed about 119883 by119884
323 Information of One Variable with respect to the OtherPinsker [6] treated the fundamental concepts of Shannon ina more advanced manner by employing probability theorySuppose 120585 is a random variable defined on a probability space(Ω 119878
120596 119875
120585) and is taking values in a measurable space (119883 119878
119909)
and 120578 is a random variable defined on a probability space(Ψ 119878
120595 119875
120578) and is taking values in a measurable space (119884 119878
119910)
Then the pair 120585 120578 of random variables may be regarded as asingle random variable (120585 120578)with values in the product space119883 times 119884 of all pairs (119909 119910) with 119909 isin 119883 119910 isin 119884 The distribution119875(120585120578)
(sdot) = 119875120585120578(sdot) of (120585 120578) is called the joint distribution of
4 Mathematical Problems in Engineering
random variables 120585 and 120578 By the product of the distributions119875120585(sdot) and 119875
120578(sdot) denoted by 119875
120585times120578(sdot) we mean the distribution
defined on 119878119909times 119878
119910
119875120585times120578
(119864 times 119865) = 119875120585(119864) times 119875
120578(119865) (14)
for 119864 isin 119878119909and 119865 isin 119878
119910 If the joint distribution 119875
120585120578(sdot)
coincides with the product distribution 119875120585times120578
(sdot) the randomvariables 120585 and 120578 are said to be independent If 120585 and 120578 arediscrete random variables say 119883 and 119884 contain countablymany points 119909
1 119909
2 and 119910
1 119910
2 then
119868 (120585 120578) = sum119894119895
119875120585times120578
(119909119894 119910
119895) log
119875120585120578(119909
119894 119910
119895)
119875120585(119909
119894) 119875
120578(119910
119895) (15)
119868 is called the information of 120585 and 120578with respect to the other
324 A Modern Definition in Information Theory Of thevarious definitions of mutual information the most widelyaccepted of recent years is the one by Cover andThomas [8]
Let 119883 be a discrete variable with alphabet 120594 and prob-ability mass function 119901(119909) = Pr119883 = 119909 119909 isin 120594 Let 119884
be a discrete variable with alphabet Υ and probability massfunction 119901(119910) = Pr119884 = 119910 119910 isin Υ Suppose 119883 and 119884 havea joint mass function (joint distribution) 119901(119909 119910) Then themutual information 119868(119883 119884) can be defined as
119868 (119883 119884) = sum119909isin120594
sum119910isinΥ
119901 (119909 119910) log119901 (119909 119910)
119901 (119909) 119901 (119910) (16)
33 Class 2 Definitions In Class 2 definitions random vari-ables are replaced by ensembles and mutual information isso-called average mutual information Gallager [11] adopteda more general and more rigorous approach to introducethe concept of mutual information in communication theoryIndeed he combined and compiled the results from Fano [9]and Abramson [10]
Suppose that discrete ensemble 119883 has a sample space1198861 119886
2 119886
119870 and discrete ensemble 119884 has a sample space
1198871 1198872 119887
119871 Consider the joint sample space (119886
119896 119887119895) 1 le
119896 le 119870 1 le 119895 le 119869 A probability measure on the joint samplespace is given by the join probability 119875
119883119884(119886119896 119887119895) defined for
1 le 119896 le 119870 1 le 119895 le 119869 The combination of a joint samplespace and probability measure for outcomes 119909 and 119910 is calleda joint 119883119884ensemble Then the marginal probabilities can befound as
119875119883(119886
119896) =
119869
sum119895=1
119875119883119884
(119886119896 119887119895) 119896 = 1 2 119870 (17)
In more abbreviated notation this is written as
119875 (119909) = sum119910
119875 (119909 119910) (18)
Likewise
119875119884(119887119895) =
119870
sum119896=1
119875119883119884
(119886119896 119887119895) 119895 = 1 2 119869 (19)
In more abbreviated notation this is written as
119875 (119910) = sum119909
119875 (119909 119910) (20)
If 119875119883(119886119896) gt 0 the conditional probability that outcome 119910 is
119887119895 given that outcome of 119909 is 119886
119896 is defined as
119875119884|119883
(119887119895| 119886
119896) =
119875119883119884
(119886119896 119887119895)
119875119883(119886
119896)
(21)
The mutual information between events 119909 = 119886119896and 119910 = 119887
119895is
defined as
119868119883119884
(119886119896 119887119895) = log
119875119883|119884
(119886119896| 119887
119895)
119875119883(119886
119896)
= log119875119883119884
(119886119896 119887119895)
119875119883(119886
119896) 119875
119884(119887119895)
= log119875119883|119884
(119886119896| 119887
119895)
119875119883(119886
119896)
= 119868119884119883
(119887119895 119886
119896)
(22)
Since the mutual information defined above is a randomvariable on the 119883119884 joint ensemble the mean value which iscalled the average mutual information denoted by 119868(119883 119884) isgiven by
119868 (119883 119884) =
119870
sum119896=1
119871
sum119895=1
119875119883119884
(119886119896 119887119895) log
119875119883119884
(119886119896 119887119895)
119875119883(119886
119896) 119875
119884(119887119895) (23)
Remark 12 By means of an information channel consistingof a transmitter of alphabet 119860 with elements 119886
119894and total
elements 119905 and a receiver of alphabet 119861 with elements 119887119895
and total elements 119903 Abramson [10] denoted 119867(119860) minus 119867(119860 |
119861) = sum119860119861
119875(119886 119887) log(119875(119886 119887)119875(119886)119875(119887)) by 119868(119860 119861) and calledit mutual information of 119860 and 119861
The mutual information 119868(119883 119884) between 2 continuousrandomvariables119883 and119884 [8] (also called rate of transmissionin [1]) is defined as
119868 (119883 119884) = ∬119875 (119909 119910) log119875 (119909 119910)
119875 (119909) 119875 (119910)119889119909 119889119910 (24)
where 119875(119909 119910) is the joint probability density function of119883 and 119884 and 119875(119909) and 119875(119910) are the marginal densityfunctions associated with 119883 and 119884 respectively The mutualinformation between 2 continuous random variables is alsocalled the differential mutual information
However the differential mutual information ismuch lesspopular than its discrete counterpart On the one hand thejoint density function involved is unknown inmost cases andhence must be estimated [13 14] On the other hand datain engineering and machine learning are mostly finite andso mutual information between discrete random variables isused
4 A New Unified Definition ofMutual Information
In Section 3 we reviewed various definitions of mutual infor-mation Shannonrsquos original definition laid the foundation
Mathematical Problems in Engineering 5
of information theory Kullbackrsquos definition used randomvariables for the first time and was more mathematical andmore compact Although Ashrsquos definition followed Shannonrsquospath it was more systematic Pinskerrsquos definition was mostmathematical in that it employed probability theory Gal-lagerrsquos definition was more general and more rigorous incommunication theory Cover and Thomasrsquos definition is sosuccinct that it is now a standard definition in informationtheory
However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Except Pinskerrsquos definition Class 2definitions either neglect the probability spaces or assumethe two random variables have the same probability spaceBoth Class 1 definitions and Class 2 definitions assume a jointdistribution or a joint probability measure exists Yet they allignore an important fact that the joint distribution or the jointprobability measure is not unique
41 Unified Definition of Mutual Information Let 119883 be afinite discrete random variable on discrete probability space(Ω
1F
1 119875
1) with Ω
1= 120596
1 120596
2 120596
119899 and range 119909
1 119909
2
119909119870 with 119870 le 119899 Let 119884 be a discrete random variable on
probability space (Ω2F
2 119875
2) with Ω
2= 120588
1 120588
2 120588
119898 and
range 1199101 119910
2 119910
119871 with 119871 le 119898
If119883 and119884have the sameprobability space (ΩF 119875) thenthe joint distribution is simply
119875119883119884
(119883 = 119909 119884 = 119910) = 119875 (120596 isin Ω 119883 (120596) = 119909 119884 (120596) = 119910)
(25)
However when 119883 and 119884 have different probability spacesand so different probability measures the joint distributionis more complicated
Definition 13 The joint sample space of random variables 119883and 119884 is defined as the product Ω
1times Ω
2of all pairs (120596
119894 120588
119895)
119894 = 1 2 119899 and 119895 = 1 2 119898The joint 120590-fieldF1timesF
2is
defined as the product of all pairs (1198601 119860
2) where119860
1and119860
2
are elements of F1and F
2 respectively A joint probability
measure 119875119883119884
of 1198751and 119875
2is a probability measure on F
1times
F2 119875
119883119884(119860 times 119861) such that for any 119860 sube Ω
1and 119861 sube Ω
2
1198751(119860) = 119875
119883119884(119860 times Ω
2) =
119898
sum119895=1
119875119883119884
(119860 times 120588119895)
1198752(119861) = 119875
119883119884(Ω
1times 119861) =
119898
sum119894=1
119875119883119884
(120596119894 times 119861)
(26)
(Ω1timesΩ
2 F
1timesF
2 119875
119883119884) is called the joint probability space
of 119883 and 119884 and 119875119883119884
(119883 = 119909119894 times 119884 = 119910
119895) for 119894 = 1 2 119899
and 119895 = 1 2 119898 the joint distribution of119883 and 119884
Combining Definitions 2 and 13 we immediately obtainthe following results
Proposition 14 A sequence of nonnegative numbers 119901119894119895 1 le
119894 le 119870 1 le 119895 le 119871 whose sum is 1 can serve as a probabilitymeasure on F
1times F
2 119875
119883119884(120596
119894 120588
119895) = 119901
119894119895 The probability of
any event 119860 times 119861 sube Ω1times Ω
2is computed simply by adding the
probabilities of the individual points of (120596 120588) isin 119860 times 119861 If inaddition for 119894 = 1 2 119870 and 119895 = 1 2 119871 the followinghold
119871
sum119895=1
119901119894119895= 119875
119883(120596
119894)
119870
sum119894=1
119901119894119895= 119875
119884(120588
119895)
(27)
then 119875119883119884
(120596119894 120588
119895) = 119901
119894119895is a joint distribution of119883 and 119884
For convenience from now on we will shorten 119875119883119884
(119883 =
119909119894 times 119884 = 119910
119895) as 119875
119883119884(119909
119894 119910
119895)
This two-dimensional measure should not be confusedwith one-dimensional joint distribution when 119883 and 119884 havethe same probability space
Remark 15 If (Ω1F
1 119875
1) = (Ω
2F
2 119875
2) instead of using
the two-dimensional measure 119875119883119884
(119883 = 119909119894 times 119884 =
119910119895) we may use the one-dimensional measure 119875
1(119883 =
119909119894 and 119884 = 119910
119895) Then (26) always hold In this sense our
new definition of joint distribution reduces to the definitionof joint distribution with the same probability space
Definition 16 The conditional probability 119884 = 119910119895 given 119883 =
119909119894 is defined as
119875119884|119883
(119884 = 119910119895| 119883 = 119909
119894) =
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) (28)
Theorem 17 For any two discrete random variables thereis at least one joint probability measure called the productprobability measure or simply product distribution
Proof Let random variables 119883 and 119884 be defined as beforeDefine a function fromΩ
1times Ω
2to [0 1] as follows
119875119883119884
(120596119894 120588
119895) = 119875
1(120596
119894) 119875
2(120588
119895) (29)
Then
119899
sum119894=1
119898
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119899
sum119894=1
119898
sum119895=1
1198751(120596
119894) 119875
2(120588
119895)
=
119899
sum119894=1
1198751(120596
119894)
119898
sum119895=1
1198752(120588
119895) = 1
(30)
Hence 119875119883119884
can serve as a probability measure on F1times
F2by Definition 2 The probability of any event 119860 times 119861 sube
Ω1times Ω
2is computed simply by adding the probabilities of
6 Mathematical Problems in Engineering
the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596
1198941 120596
1198942 120596
119894119904 isin Ω
1of 119904 elements
119875119883119884
(119860 times Ω2) =
119898
sum119895=1
119875119883119884
(119860 times 120588119895)
=
119898
sum119895=1
119904
sum119906=1
119875119883119884
(120596119894119906 times 120588
119895)
=
119898
sum119895=1
119904
sum119906=1
1198751(120596
119894119906) 119875
2(120588
119895)
=
119898
sum119895=1
1198752(120588
119895)
119904
sum119906=1
1198751(120596
119894119906)
=
119904
sum119906=1
1198751(120596
119894119906) = 119875
1(119860)
(31)
Similarly 119875119883119884
(Ω1times 119861) = 119875
2(119861) for any 119861 isin Ω
2 Hence
119875119883119884
(119883 = 119909119894 times 119884 = 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) is a
joint probability measure of119883 and 119884 by Definition 13
Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875
119883119884(sdot) if 119875
119883119884(sdot) coincides
with the product distribution 119875119883times119884
(sdot)
Definition 19 The joint entropy119867(119883 119884) is defined as
119867(119883 119884) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119883119884(119909
119894 119910
119895) (32)
Definition 20 Theconditional entropy119867(119884 | 119883) is as follows
119867(119884 | 119883) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119884 = 119910
119895| 119883 = 119909
119894)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894)
(33)
Definition 21 The mutual information 119868(119883 119884) between 119883
and 119884 is defined as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
(34)
As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0
It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ
1= 119886
1 119886
2 119886
119870 andΩ
2= 119887
1 1198872 119887
119871
Define random variables 119883 Ω1rarr R and 119884 Ω
2rarr R as
one-to-one mappings as
119883(119886119894) = 119909
119894 119894 = 1 2 119870
119884 (119887119895) = 119910
119895 119895 = 1 2 119871
(35)
Then we have
119875119883119884
(119909119894 119910
119895) = 119875
119883119884(119886
119894 119887119895) (36)
It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example
Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions
1198751(119884 = 0) =
1
3 119875
1(119884 = 1) =
2
3
1198752(119883 = 1) =
1
3 119875
2(119883 = 2) =
1
3 119875
2(119883 = 3) =
1
3
(37)
We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information
(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13
(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0
(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13
(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =
29 119875(3 0) = 19 119875(3 1) = 29
42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]
Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =
1199011 119901
2 119901
119899 and 119876 = 119902
1 119902
2 119902
119899 is defined as
119863 (119875 119876) = sum119894
119901119894log
119901119894
119902119894
(38)
Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902
119894for all 119894
Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions
Mathematical Problems in Engineering 7
The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875
119883119884and the product
distribution 119875119883119875119884
Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875
119883119884and the product distribution 119875
11198752
Proof Using a mapping from 2-dimensional indices to one-dimensional index
(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899
for 119894 = 1 119870 119895 = 1 2 119871(39)
and using another mapping from one-dimensional indexback to two-dimensional indices
119894 = lceil119899
119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871
for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871
(40)
we rewrite 119868(119883 119884) as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(41)
Since119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) = 1
119870119871
sum119899=1
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
=
119870
sum119894=1
119871
sum119895=1
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895) = 1
(42)
we obtain
119868 (119883 119884) =
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(43)
Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies
119868 (119883 119884) ge 0 (44)
with equality if and only if119883 and 119884 are independent
Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875
119883119884(119909
lceil119899119871rceil 119910
119899minus(lceil119899119871rceilminus1)lowast119871) = 119875
1(119883 = 119909
lceil119899119871rceil)119875
2(119884 =
119910119899minus(lceil119899119871rceilminus1)lowast119871
) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +
sdot sdot sdot + 119870 that is 119875119883119884
(119909119894 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) for 119894 =
1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent
Corollary 26 If119883 is a constant random variable that is119870 =
1 then for any random variable 119884
119868 (119883 119884) = 0 (45)
Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875
1(119883 = 119909) = 119875
1(120596) = 1
For any 119895 = 1 2 119871
119875119883119884
(119909 119910119895) =
1
sum119894=1
119875119883119884
(119909 119910119895) = 119875
2(119884 = 119910
119895)
= 1198751(119883 = 119909) 119875
2(119884 = 119910
119895)
(46)
Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0
Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then
0 le 119867 (119883) le log119870 (47)
with equality if and only if the119870 values are equally probable
Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold
119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)
= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)
Proof Consider
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119884|119883
(119909119894 119910
119895)
1198752(119884 = 119910
119895)
8 Mathematical Problems in Engineering
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
+
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
119871
sum119894=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
1198752(119884 = 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= 119867 (119884) minus 119867 (119884 | 119883)
(49)
Combining the above properties and noting that 119867(119883 | 119884)
and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties
Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then
0 le 119868 (119883 119884) le 119867 (119884) le log 119871
0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)
Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent
5 Newly Defined Mutual Information inMachine Learning
Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883
1 119883
2 119883
119872 and their associated output values
119910 from an output space 119884 (ie dependent variable) Here1198831 119883
2 119883
119872are called features that is independent vari-
ables Hence 119878 can be expressed as
119878 = 1199091198941 119909
1198942 119909
119894119872 119910
119894 119894 = 1 2 119873 (51)
where feature 119883119895has values 119909
1119895 119909
2119895 119909
119873119895for 119895 = 1 2
119872
A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883
1 119883
2 119883
119872 Rather it is decided by only a subset of
them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows
(1) Evaluate the relationship between each individualinput feature 119909
119894and the output 119884
(2) Select the best set of attributes according to somecriterion
51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596
1 120596
2 120596
119870 If 119883 has missing values we will use 120596
1
to represent all the missing values Assume 119884 has 119871 differentvalues 120588
1 120588
2 120588
119871
Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874
119894119895be the frequency (could be 0) of (120596
119894 120588
119895) for
119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899
119894sdotand 119899
sdot119895 respectively Then
119899119894sdot= sum
119895
119874119894119895
119899sdot119895= sum
119894
119874119894119895
119873 = sum119894
sum119895
119874119894119895= sum
119894
119899119894sdot= sum
119895
119899sdot119895
(52)
Let us denote the relative frequency119874119894119895119873 by 119901
119894119895We have the
two-way relative frequency table see Table 2Since
119870
sum119894=1
119871
sum119895=1
119901119894119895=
119870
sum119894=1
119901119894sdot=
119871
sum119895=1
119901sdot119895= 1 (53)
119901119894sdot119870119894=1
119901sdot119895119871119895=1
and 119901119894119895119870119894=1
can each serve as a probabilitymeasure
Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables
119883 (Ω1F
1 119875
119883) 997888rarr 119877 (54)
as 119883(120596119894) = 119909
119894 where Ω
1= 120596
1 120596
2 120596
119870 and 119875
119883(120596
119894) =
119899119894sdot119873 = 119901
119894sdotfor 119894 = 1 2 119870 Note that 119909
1 119909
2 119909
119870
could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909
119894) = 119875
119883(120596
119894)
Mathematical Problems in Engineering 9
Table 1 Frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11987411
11987412
sdot sdot sdot 1198741119895
sdot sdot sdot 1198741119871
1198991∙
1205962
11987421
11987422
sdot sdot sdot 1198742119895
sdot sdot sdot 1198742119871
1198992∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1198741198941
1198741198942
sdot sdot sdot 119874119894119895
sdot sdot sdot 119874119894119871
119899119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1198741198701
1198741198702
sdot sdot sdot 119874119870119895
sdot sdot sdot 119874119870119871
119899119870∙
Total 119899∙1
119899∙2
sdot sdot sdot 119899∙119895
sdot sdot sdot 119899∙119871
119873
Table 2 Relative frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11990111
11990112
sdot sdot sdot 1199011119895
sdot sdot sdot 1199011119871
1199011∙
1205962
11990121
11990122
sdot sdot sdot 1199012119895
sdot sdot sdot 1199012119871
1199012∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1199011198941
1199011198942
sdot sdot sdot 119901119894119895
sdot sdot sdot 119901119894119871
119901119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1199011198701
1199011198702
sdot sdot sdot 119901119870119895
sdot sdot sdot 119901119870119871
119901119870∙
Total 119901∙1
119901∙2
sdot sdot sdot 119901∙119895
sdot sdot sdot 119901∙119871
1
Similarly
119884 (Ω2F
2 119875
2) 997888rarr 119877 (55)
as 119884(120588119895) = 119910
119895 where Ω
2= 120588
1 120588
2 120588
119871 and 119875
119884(120588
119895) =
119899sdot119895119873 = 119901
sdot119895for 119895 = 1 2 119871 Also 119910
1 119910
2 119910
119870could be
any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875
119884(119884 =
119910119895) = 119875
119884(120588
119895)
Now define a mapping 119875119883119884
fromΩ1timesΩ
2to 119877 as follows
119875119883119884
(120596119894 120588
119895) = 119901
119894119895=
119874119894119895
119873 (56)
Since119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) = 1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119871
sum119895=1
119901119894119895= 119901
119894sdot= 119875
119883(120596
119894)
119870
sum119894=1
119875119883119884
(120596119894 120588
119895) =
119870
sum119894=1
119901119894119895= 119901
sdot119895= 119875
119884(120588
119895)
(57)
119901119894119895119870119894=1
is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) log
119875119883119884
(120596119894 120588
119895)
119875119883(120596
119894) 119875
119884(120588
119895)
=
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
(58)
It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value
Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)
Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that
(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871
(2) 119901119894119895
= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and
119895 = 1 2 119871
(3) for each 119895 = 1 2 119871 since1198741119895+119874
2119895+sdot sdot sdot+119874
119870119895= 119899
sdot119895
there are 119899sdot119895nonzero 119874
119894119895rsquos or equivalently 119899
sdot119895nonzero
119901119894119895rsquos
(4) 119901119894sdot= 1119873 119894 = 1 2 119870
Using the above observations and the fact that 0 log 0 = 0 wehave
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
=
119870
sum119894=1
1199011198941log
1199011198941
119901119894sdot119901sdot1
+
119870
sum119894=1
1199011198942log
1199011198942
119901119894sdot119901sdot2
+ sdot sdot sdot +
119870
sum119894=1
119901119894119871log
119901119894119871
119901119894sdot119901sdot119871
10 Mathematical Problems in Engineering
= sum1199011198941 =0
1
119873log 1119873
119901sdot1119873
+ sum1199011198942 =0
1
119873log 1119873
119901sdot2119873
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1119873
119901sdot119871119873
= sum1199011198941 =0
1
119873log 1
119901sdot1
+ sum1199011198942 =0
1
119873log 1
119901sdot2
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1
119901sdot119871
=119899sdot1
119873log 1
119901sdot1
+119899sdot2
119873log 1
119901sdot2
+ sdot sdot sdot +119899sdot119871
119873log 1
119901sdot119871
= 119901sdot1log 1
119901sdot1
+ 119901sdot2log 1
119901sdot2
+ sdot sdot sdot + 119901sdot119871log 1
119901sdot119871
= 119867 (119884)
(59)
52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1
To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways
521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs
522 Relative Values From Property 4 we have 0 le
119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs
523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]
Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then
119901 ln119901
119902asymp (119901 minus 119902) +
(119901 minus 119902)2
2119902 (60)
The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is
Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain
2119868 (119883 119884) = 2119873
119870
sum119894=1
119871
sum119895=1
119901119894119895ln
119901119894119895
119901119894sdot119901sdot119895
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895119873
(119899119894sdot119873) (119899
sdot119895119873)
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895
119899119894sdot119899sdot119895119873
asymp 2
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus
119899119894sdot119899sdot119895
119873)
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895minus 2
sum119894119899119894sdot
119873
sum119895119899sdot119895
119873
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2119873 minus 2119873
119873+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
=
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 1205942
(61)
The last equation means the previous expressionsum119870
119894=1sum119871
119895=1((119874
119894119895minus 119899
119894sdot119899sdot119895119873)
2(119899
119894sdot119899sdot119895119873)) follows 1205942 distribu-
tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)
approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)
The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no
relationship between them)1198671119883 and119884 are dependent (ie there is a relationship
between them)
Mathematical Problems in Engineering 11
The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
asymp 2119873 times 119868 (119883 119884) (62)
is greater than 1205942119880 the upper-tail critical value from a Chi-
square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is
Select feature 119883 if 119868 (119883 119884) gt1205942119880
2119873 (63)
Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942
119880to be 169 from a Chi-
square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873
Assume a training set has119873 examples We can do featureselection by the following procedure
(i) Step 1 Choose a level of significance of 120572 say 005
(ii) Step 2 Find 119870 the number of values of feature119883
(iii) Step 3 Build the contingency table for119883 and 119884
(iv) Step 4 Calculate 119868(119883 119884) from the contingency table
(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom
from a Chi-square table or any other sources such asSAS
(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise
(vii) Step 7 Repeat Steps 2ndash6 for all features
If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure
53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1
Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process
(i) Step 1 select features by removing those with smallmutual information
(ii) Step 2 do binning for the rest of numerical features
(iii) Step 3 select features by mutual information
54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning
Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships
The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value
6 Conclusions
In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring
Conflict of Interests
The author declares that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
4 Mathematical Problems in Engineering
random variables 120585 and 120578 By the product of the distributions119875120585(sdot) and 119875
120578(sdot) denoted by 119875
120585times120578(sdot) we mean the distribution
defined on 119878119909times 119878
119910
119875120585times120578
(119864 times 119865) = 119875120585(119864) times 119875
120578(119865) (14)
for 119864 isin 119878119909and 119865 isin 119878
119910 If the joint distribution 119875
120585120578(sdot)
coincides with the product distribution 119875120585times120578
(sdot) the randomvariables 120585 and 120578 are said to be independent If 120585 and 120578 arediscrete random variables say 119883 and 119884 contain countablymany points 119909
1 119909
2 and 119910
1 119910
2 then
119868 (120585 120578) = sum119894119895
119875120585times120578
(119909119894 119910
119895) log
119875120585120578(119909
119894 119910
119895)
119875120585(119909
119894) 119875
120578(119910
119895) (15)
119868 is called the information of 120585 and 120578with respect to the other
324 A Modern Definition in Information Theory Of thevarious definitions of mutual information the most widelyaccepted of recent years is the one by Cover andThomas [8]
Let 119883 be a discrete variable with alphabet 120594 and prob-ability mass function 119901(119909) = Pr119883 = 119909 119909 isin 120594 Let 119884
be a discrete variable with alphabet Υ and probability massfunction 119901(119910) = Pr119884 = 119910 119910 isin Υ Suppose 119883 and 119884 havea joint mass function (joint distribution) 119901(119909 119910) Then themutual information 119868(119883 119884) can be defined as
119868 (119883 119884) = sum119909isin120594
sum119910isinΥ
119901 (119909 119910) log119901 (119909 119910)
119901 (119909) 119901 (119910) (16)
33 Class 2 Definitions In Class 2 definitions random vari-ables are replaced by ensembles and mutual information isso-called average mutual information Gallager [11] adopteda more general and more rigorous approach to introducethe concept of mutual information in communication theoryIndeed he combined and compiled the results from Fano [9]and Abramson [10]
Suppose that discrete ensemble 119883 has a sample space1198861 119886
2 119886
119870 and discrete ensemble 119884 has a sample space
1198871 1198872 119887
119871 Consider the joint sample space (119886
119896 119887119895) 1 le
119896 le 119870 1 le 119895 le 119869 A probability measure on the joint samplespace is given by the join probability 119875
119883119884(119886119896 119887119895) defined for
1 le 119896 le 119870 1 le 119895 le 119869 The combination of a joint samplespace and probability measure for outcomes 119909 and 119910 is calleda joint 119883119884ensemble Then the marginal probabilities can befound as
119875119883(119886
119896) =
119869
sum119895=1
119875119883119884
(119886119896 119887119895) 119896 = 1 2 119870 (17)
In more abbreviated notation this is written as
119875 (119909) = sum119910
119875 (119909 119910) (18)
Likewise
119875119884(119887119895) =
119870
sum119896=1
119875119883119884
(119886119896 119887119895) 119895 = 1 2 119869 (19)
In more abbreviated notation this is written as
119875 (119910) = sum119909
119875 (119909 119910) (20)
If 119875119883(119886119896) gt 0 the conditional probability that outcome 119910 is
119887119895 given that outcome of 119909 is 119886
119896 is defined as
119875119884|119883
(119887119895| 119886
119896) =
119875119883119884
(119886119896 119887119895)
119875119883(119886
119896)
(21)
The mutual information between events 119909 = 119886119896and 119910 = 119887
119895is
defined as
119868119883119884
(119886119896 119887119895) = log
119875119883|119884
(119886119896| 119887
119895)
119875119883(119886
119896)
= log119875119883119884
(119886119896 119887119895)
119875119883(119886
119896) 119875
119884(119887119895)
= log119875119883|119884
(119886119896| 119887
119895)
119875119883(119886
119896)
= 119868119884119883
(119887119895 119886
119896)
(22)
Since the mutual information defined above is a randomvariable on the 119883119884 joint ensemble the mean value which iscalled the average mutual information denoted by 119868(119883 119884) isgiven by
119868 (119883 119884) =
119870
sum119896=1
119871
sum119895=1
119875119883119884
(119886119896 119887119895) log
119875119883119884
(119886119896 119887119895)
119875119883(119886
119896) 119875
119884(119887119895) (23)
Remark 12 By means of an information channel consistingof a transmitter of alphabet 119860 with elements 119886
119894and total
elements 119905 and a receiver of alphabet 119861 with elements 119887119895
and total elements 119903 Abramson [10] denoted 119867(119860) minus 119867(119860 |
119861) = sum119860119861
119875(119886 119887) log(119875(119886 119887)119875(119886)119875(119887)) by 119868(119860 119861) and calledit mutual information of 119860 and 119861
The mutual information 119868(119883 119884) between 2 continuousrandomvariables119883 and119884 [8] (also called rate of transmissionin [1]) is defined as
119868 (119883 119884) = ∬119875 (119909 119910) log119875 (119909 119910)
119875 (119909) 119875 (119910)119889119909 119889119910 (24)
where 119875(119909 119910) is the joint probability density function of119883 and 119884 and 119875(119909) and 119875(119910) are the marginal densityfunctions associated with 119883 and 119884 respectively The mutualinformation between 2 continuous random variables is alsocalled the differential mutual information
However the differential mutual information ismuch lesspopular than its discrete counterpart On the one hand thejoint density function involved is unknown inmost cases andhence must be estimated [13 14] On the other hand datain engineering and machine learning are mostly finite andso mutual information between discrete random variables isused
4 A New Unified Definition ofMutual Information
In Section 3 we reviewed various definitions of mutual infor-mation Shannonrsquos original definition laid the foundation
Mathematical Problems in Engineering 5
of information theory Kullbackrsquos definition used randomvariables for the first time and was more mathematical andmore compact Although Ashrsquos definition followed Shannonrsquospath it was more systematic Pinskerrsquos definition was mostmathematical in that it employed probability theory Gal-lagerrsquos definition was more general and more rigorous incommunication theory Cover and Thomasrsquos definition is sosuccinct that it is now a standard definition in informationtheory
However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Except Pinskerrsquos definition Class 2definitions either neglect the probability spaces or assumethe two random variables have the same probability spaceBoth Class 1 definitions and Class 2 definitions assume a jointdistribution or a joint probability measure exists Yet they allignore an important fact that the joint distribution or the jointprobability measure is not unique
41 Unified Definition of Mutual Information Let 119883 be afinite discrete random variable on discrete probability space(Ω
1F
1 119875
1) with Ω
1= 120596
1 120596
2 120596
119899 and range 119909
1 119909
2
119909119870 with 119870 le 119899 Let 119884 be a discrete random variable on
probability space (Ω2F
2 119875
2) with Ω
2= 120588
1 120588
2 120588
119898 and
range 1199101 119910
2 119910
119871 with 119871 le 119898
If119883 and119884have the sameprobability space (ΩF 119875) thenthe joint distribution is simply
119875119883119884
(119883 = 119909 119884 = 119910) = 119875 (120596 isin Ω 119883 (120596) = 119909 119884 (120596) = 119910)
(25)
However when 119883 and 119884 have different probability spacesand so different probability measures the joint distributionis more complicated
Definition 13 The joint sample space of random variables 119883and 119884 is defined as the product Ω
1times Ω
2of all pairs (120596
119894 120588
119895)
119894 = 1 2 119899 and 119895 = 1 2 119898The joint 120590-fieldF1timesF
2is
defined as the product of all pairs (1198601 119860
2) where119860
1and119860
2
are elements of F1and F
2 respectively A joint probability
measure 119875119883119884
of 1198751and 119875
2is a probability measure on F
1times
F2 119875
119883119884(119860 times 119861) such that for any 119860 sube Ω
1and 119861 sube Ω
2
1198751(119860) = 119875
119883119884(119860 times Ω
2) =
119898
sum119895=1
119875119883119884
(119860 times 120588119895)
1198752(119861) = 119875
119883119884(Ω
1times 119861) =
119898
sum119894=1
119875119883119884
(120596119894 times 119861)
(26)
(Ω1timesΩ
2 F
1timesF
2 119875
119883119884) is called the joint probability space
of 119883 and 119884 and 119875119883119884
(119883 = 119909119894 times 119884 = 119910
119895) for 119894 = 1 2 119899
and 119895 = 1 2 119898 the joint distribution of119883 and 119884
Combining Definitions 2 and 13 we immediately obtainthe following results
Proposition 14 A sequence of nonnegative numbers 119901119894119895 1 le
119894 le 119870 1 le 119895 le 119871 whose sum is 1 can serve as a probabilitymeasure on F
1times F
2 119875
119883119884(120596
119894 120588
119895) = 119901
119894119895 The probability of
any event 119860 times 119861 sube Ω1times Ω
2is computed simply by adding the
probabilities of the individual points of (120596 120588) isin 119860 times 119861 If inaddition for 119894 = 1 2 119870 and 119895 = 1 2 119871 the followinghold
119871
sum119895=1
119901119894119895= 119875
119883(120596
119894)
119870
sum119894=1
119901119894119895= 119875
119884(120588
119895)
(27)
then 119875119883119884
(120596119894 120588
119895) = 119901
119894119895is a joint distribution of119883 and 119884
For convenience from now on we will shorten 119875119883119884
(119883 =
119909119894 times 119884 = 119910
119895) as 119875
119883119884(119909
119894 119910
119895)
This two-dimensional measure should not be confusedwith one-dimensional joint distribution when 119883 and 119884 havethe same probability space
Remark 15 If (Ω1F
1 119875
1) = (Ω
2F
2 119875
2) instead of using
the two-dimensional measure 119875119883119884
(119883 = 119909119894 times 119884 =
119910119895) we may use the one-dimensional measure 119875
1(119883 =
119909119894 and 119884 = 119910
119895) Then (26) always hold In this sense our
new definition of joint distribution reduces to the definitionof joint distribution with the same probability space
Definition 16 The conditional probability 119884 = 119910119895 given 119883 =
119909119894 is defined as
119875119884|119883
(119884 = 119910119895| 119883 = 119909
119894) =
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) (28)
Theorem 17 For any two discrete random variables thereis at least one joint probability measure called the productprobability measure or simply product distribution
Proof Let random variables 119883 and 119884 be defined as beforeDefine a function fromΩ
1times Ω
2to [0 1] as follows
119875119883119884
(120596119894 120588
119895) = 119875
1(120596
119894) 119875
2(120588
119895) (29)
Then
119899
sum119894=1
119898
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119899
sum119894=1
119898
sum119895=1
1198751(120596
119894) 119875
2(120588
119895)
=
119899
sum119894=1
1198751(120596
119894)
119898
sum119895=1
1198752(120588
119895) = 1
(30)
Hence 119875119883119884
can serve as a probability measure on F1times
F2by Definition 2 The probability of any event 119860 times 119861 sube
Ω1times Ω
2is computed simply by adding the probabilities of
6 Mathematical Problems in Engineering
the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596
1198941 120596
1198942 120596
119894119904 isin Ω
1of 119904 elements
119875119883119884
(119860 times Ω2) =
119898
sum119895=1
119875119883119884
(119860 times 120588119895)
=
119898
sum119895=1
119904
sum119906=1
119875119883119884
(120596119894119906 times 120588
119895)
=
119898
sum119895=1
119904
sum119906=1
1198751(120596
119894119906) 119875
2(120588
119895)
=
119898
sum119895=1
1198752(120588
119895)
119904
sum119906=1
1198751(120596
119894119906)
=
119904
sum119906=1
1198751(120596
119894119906) = 119875
1(119860)
(31)
Similarly 119875119883119884
(Ω1times 119861) = 119875
2(119861) for any 119861 isin Ω
2 Hence
119875119883119884
(119883 = 119909119894 times 119884 = 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) is a
joint probability measure of119883 and 119884 by Definition 13
Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875
119883119884(sdot) if 119875
119883119884(sdot) coincides
with the product distribution 119875119883times119884
(sdot)
Definition 19 The joint entropy119867(119883 119884) is defined as
119867(119883 119884) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119883119884(119909
119894 119910
119895) (32)
Definition 20 Theconditional entropy119867(119884 | 119883) is as follows
119867(119884 | 119883) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119884 = 119910
119895| 119883 = 119909
119894)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894)
(33)
Definition 21 The mutual information 119868(119883 119884) between 119883
and 119884 is defined as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
(34)
As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0
It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ
1= 119886
1 119886
2 119886
119870 andΩ
2= 119887
1 1198872 119887
119871
Define random variables 119883 Ω1rarr R and 119884 Ω
2rarr R as
one-to-one mappings as
119883(119886119894) = 119909
119894 119894 = 1 2 119870
119884 (119887119895) = 119910
119895 119895 = 1 2 119871
(35)
Then we have
119875119883119884
(119909119894 119910
119895) = 119875
119883119884(119886
119894 119887119895) (36)
It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example
Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions
1198751(119884 = 0) =
1
3 119875
1(119884 = 1) =
2
3
1198752(119883 = 1) =
1
3 119875
2(119883 = 2) =
1
3 119875
2(119883 = 3) =
1
3
(37)
We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information
(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13
(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0
(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13
(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =
29 119875(3 0) = 19 119875(3 1) = 29
42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]
Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =
1199011 119901
2 119901
119899 and 119876 = 119902
1 119902
2 119902
119899 is defined as
119863 (119875 119876) = sum119894
119901119894log
119901119894
119902119894
(38)
Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902
119894for all 119894
Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions
Mathematical Problems in Engineering 7
The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875
119883119884and the product
distribution 119875119883119875119884
Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875
119883119884and the product distribution 119875
11198752
Proof Using a mapping from 2-dimensional indices to one-dimensional index
(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899
for 119894 = 1 119870 119895 = 1 2 119871(39)
and using another mapping from one-dimensional indexback to two-dimensional indices
119894 = lceil119899
119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871
for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871
(40)
we rewrite 119868(119883 119884) as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(41)
Since119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) = 1
119870119871
sum119899=1
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
=
119870
sum119894=1
119871
sum119895=1
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895) = 1
(42)
we obtain
119868 (119883 119884) =
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(43)
Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies
119868 (119883 119884) ge 0 (44)
with equality if and only if119883 and 119884 are independent
Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875
119883119884(119909
lceil119899119871rceil 119910
119899minus(lceil119899119871rceilminus1)lowast119871) = 119875
1(119883 = 119909
lceil119899119871rceil)119875
2(119884 =
119910119899minus(lceil119899119871rceilminus1)lowast119871
) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +
sdot sdot sdot + 119870 that is 119875119883119884
(119909119894 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) for 119894 =
1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent
Corollary 26 If119883 is a constant random variable that is119870 =
1 then for any random variable 119884
119868 (119883 119884) = 0 (45)
Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875
1(119883 = 119909) = 119875
1(120596) = 1
For any 119895 = 1 2 119871
119875119883119884
(119909 119910119895) =
1
sum119894=1
119875119883119884
(119909 119910119895) = 119875
2(119884 = 119910
119895)
= 1198751(119883 = 119909) 119875
2(119884 = 119910
119895)
(46)
Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0
Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then
0 le 119867 (119883) le log119870 (47)
with equality if and only if the119870 values are equally probable
Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold
119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)
= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)
Proof Consider
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119884|119883
(119909119894 119910
119895)
1198752(119884 = 119910
119895)
8 Mathematical Problems in Engineering
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
+
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
119871
sum119894=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
1198752(119884 = 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= 119867 (119884) minus 119867 (119884 | 119883)
(49)
Combining the above properties and noting that 119867(119883 | 119884)
and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties
Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then
0 le 119868 (119883 119884) le 119867 (119884) le log 119871
0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)
Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent
5 Newly Defined Mutual Information inMachine Learning
Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883
1 119883
2 119883
119872 and their associated output values
119910 from an output space 119884 (ie dependent variable) Here1198831 119883
2 119883
119872are called features that is independent vari-
ables Hence 119878 can be expressed as
119878 = 1199091198941 119909
1198942 119909
119894119872 119910
119894 119894 = 1 2 119873 (51)
where feature 119883119895has values 119909
1119895 119909
2119895 119909
119873119895for 119895 = 1 2
119872
A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883
1 119883
2 119883
119872 Rather it is decided by only a subset of
them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows
(1) Evaluate the relationship between each individualinput feature 119909
119894and the output 119884
(2) Select the best set of attributes according to somecriterion
51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596
1 120596
2 120596
119870 If 119883 has missing values we will use 120596
1
to represent all the missing values Assume 119884 has 119871 differentvalues 120588
1 120588
2 120588
119871
Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874
119894119895be the frequency (could be 0) of (120596
119894 120588
119895) for
119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899
119894sdotand 119899
sdot119895 respectively Then
119899119894sdot= sum
119895
119874119894119895
119899sdot119895= sum
119894
119874119894119895
119873 = sum119894
sum119895
119874119894119895= sum
119894
119899119894sdot= sum
119895
119899sdot119895
(52)
Let us denote the relative frequency119874119894119895119873 by 119901
119894119895We have the
two-way relative frequency table see Table 2Since
119870
sum119894=1
119871
sum119895=1
119901119894119895=
119870
sum119894=1
119901119894sdot=
119871
sum119895=1
119901sdot119895= 1 (53)
119901119894sdot119870119894=1
119901sdot119895119871119895=1
and 119901119894119895119870119894=1
can each serve as a probabilitymeasure
Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables
119883 (Ω1F
1 119875
119883) 997888rarr 119877 (54)
as 119883(120596119894) = 119909
119894 where Ω
1= 120596
1 120596
2 120596
119870 and 119875
119883(120596
119894) =
119899119894sdot119873 = 119901
119894sdotfor 119894 = 1 2 119870 Note that 119909
1 119909
2 119909
119870
could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909
119894) = 119875
119883(120596
119894)
Mathematical Problems in Engineering 9
Table 1 Frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11987411
11987412
sdot sdot sdot 1198741119895
sdot sdot sdot 1198741119871
1198991∙
1205962
11987421
11987422
sdot sdot sdot 1198742119895
sdot sdot sdot 1198742119871
1198992∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1198741198941
1198741198942
sdot sdot sdot 119874119894119895
sdot sdot sdot 119874119894119871
119899119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1198741198701
1198741198702
sdot sdot sdot 119874119870119895
sdot sdot sdot 119874119870119871
119899119870∙
Total 119899∙1
119899∙2
sdot sdot sdot 119899∙119895
sdot sdot sdot 119899∙119871
119873
Table 2 Relative frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11990111
11990112
sdot sdot sdot 1199011119895
sdot sdot sdot 1199011119871
1199011∙
1205962
11990121
11990122
sdot sdot sdot 1199012119895
sdot sdot sdot 1199012119871
1199012∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1199011198941
1199011198942
sdot sdot sdot 119901119894119895
sdot sdot sdot 119901119894119871
119901119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1199011198701
1199011198702
sdot sdot sdot 119901119870119895
sdot sdot sdot 119901119870119871
119901119870∙
Total 119901∙1
119901∙2
sdot sdot sdot 119901∙119895
sdot sdot sdot 119901∙119871
1
Similarly
119884 (Ω2F
2 119875
2) 997888rarr 119877 (55)
as 119884(120588119895) = 119910
119895 where Ω
2= 120588
1 120588
2 120588
119871 and 119875
119884(120588
119895) =
119899sdot119895119873 = 119901
sdot119895for 119895 = 1 2 119871 Also 119910
1 119910
2 119910
119870could be
any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875
119884(119884 =
119910119895) = 119875
119884(120588
119895)
Now define a mapping 119875119883119884
fromΩ1timesΩ
2to 119877 as follows
119875119883119884
(120596119894 120588
119895) = 119901
119894119895=
119874119894119895
119873 (56)
Since119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) = 1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119871
sum119895=1
119901119894119895= 119901
119894sdot= 119875
119883(120596
119894)
119870
sum119894=1
119875119883119884
(120596119894 120588
119895) =
119870
sum119894=1
119901119894119895= 119901
sdot119895= 119875
119884(120588
119895)
(57)
119901119894119895119870119894=1
is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) log
119875119883119884
(120596119894 120588
119895)
119875119883(120596
119894) 119875
119884(120588
119895)
=
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
(58)
It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value
Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)
Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that
(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871
(2) 119901119894119895
= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and
119895 = 1 2 119871
(3) for each 119895 = 1 2 119871 since1198741119895+119874
2119895+sdot sdot sdot+119874
119870119895= 119899
sdot119895
there are 119899sdot119895nonzero 119874
119894119895rsquos or equivalently 119899
sdot119895nonzero
119901119894119895rsquos
(4) 119901119894sdot= 1119873 119894 = 1 2 119870
Using the above observations and the fact that 0 log 0 = 0 wehave
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
=
119870
sum119894=1
1199011198941log
1199011198941
119901119894sdot119901sdot1
+
119870
sum119894=1
1199011198942log
1199011198942
119901119894sdot119901sdot2
+ sdot sdot sdot +
119870
sum119894=1
119901119894119871log
119901119894119871
119901119894sdot119901sdot119871
10 Mathematical Problems in Engineering
= sum1199011198941 =0
1
119873log 1119873
119901sdot1119873
+ sum1199011198942 =0
1
119873log 1119873
119901sdot2119873
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1119873
119901sdot119871119873
= sum1199011198941 =0
1
119873log 1
119901sdot1
+ sum1199011198942 =0
1
119873log 1
119901sdot2
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1
119901sdot119871
=119899sdot1
119873log 1
119901sdot1
+119899sdot2
119873log 1
119901sdot2
+ sdot sdot sdot +119899sdot119871
119873log 1
119901sdot119871
= 119901sdot1log 1
119901sdot1
+ 119901sdot2log 1
119901sdot2
+ sdot sdot sdot + 119901sdot119871log 1
119901sdot119871
= 119867 (119884)
(59)
52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1
To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways
521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs
522 Relative Values From Property 4 we have 0 le
119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs
523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]
Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then
119901 ln119901
119902asymp (119901 minus 119902) +
(119901 minus 119902)2
2119902 (60)
The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is
Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain
2119868 (119883 119884) = 2119873
119870
sum119894=1
119871
sum119895=1
119901119894119895ln
119901119894119895
119901119894sdot119901sdot119895
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895119873
(119899119894sdot119873) (119899
sdot119895119873)
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895
119899119894sdot119899sdot119895119873
asymp 2
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus
119899119894sdot119899sdot119895
119873)
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895minus 2
sum119894119899119894sdot
119873
sum119895119899sdot119895
119873
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2119873 minus 2119873
119873+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
=
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 1205942
(61)
The last equation means the previous expressionsum119870
119894=1sum119871
119895=1((119874
119894119895minus 119899
119894sdot119899sdot119895119873)
2(119899
119894sdot119899sdot119895119873)) follows 1205942 distribu-
tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)
approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)
The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no
relationship between them)1198671119883 and119884 are dependent (ie there is a relationship
between them)
Mathematical Problems in Engineering 11
The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
asymp 2119873 times 119868 (119883 119884) (62)
is greater than 1205942119880 the upper-tail critical value from a Chi-
square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is
Select feature 119883 if 119868 (119883 119884) gt1205942119880
2119873 (63)
Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942
119880to be 169 from a Chi-
square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873
Assume a training set has119873 examples We can do featureselection by the following procedure
(i) Step 1 Choose a level of significance of 120572 say 005
(ii) Step 2 Find 119870 the number of values of feature119883
(iii) Step 3 Build the contingency table for119883 and 119884
(iv) Step 4 Calculate 119868(119883 119884) from the contingency table
(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom
from a Chi-square table or any other sources such asSAS
(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise
(vii) Step 7 Repeat Steps 2ndash6 for all features
If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure
53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1
Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process
(i) Step 1 select features by removing those with smallmutual information
(ii) Step 2 do binning for the rest of numerical features
(iii) Step 3 select features by mutual information
54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning
Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships
The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value
6 Conclusions
In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring
Conflict of Interests
The author declares that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 5
of information theory Kullbackrsquos definition used randomvariables for the first time and was more mathematical andmore compact Although Ashrsquos definition followed Shannonrsquospath it was more systematic Pinskerrsquos definition was mostmathematical in that it employed probability theory Gal-lagerrsquos definition was more general and more rigorous incommunication theory Cover and Thomasrsquos definition is sosuccinct that it is now a standard definition in informationtheory
However there are some mathematical flaws in thesevarious definitions of mutual information Class 2 definitionsredefine marginal probabilities from the joint probabilitiesAs a matter of fact the marginal probabilities are givenfrom the ensembles and hence should not be redefined fromthe joint probabilities Except Pinskerrsquos definition Class 2definitions either neglect the probability spaces or assumethe two random variables have the same probability spaceBoth Class 1 definitions and Class 2 definitions assume a jointdistribution or a joint probability measure exists Yet they allignore an important fact that the joint distribution or the jointprobability measure is not unique
41 Unified Definition of Mutual Information Let 119883 be afinite discrete random variable on discrete probability space(Ω
1F
1 119875
1) with Ω
1= 120596
1 120596
2 120596
119899 and range 119909
1 119909
2
119909119870 with 119870 le 119899 Let 119884 be a discrete random variable on
probability space (Ω2F
2 119875
2) with Ω
2= 120588
1 120588
2 120588
119898 and
range 1199101 119910
2 119910
119871 with 119871 le 119898
If119883 and119884have the sameprobability space (ΩF 119875) thenthe joint distribution is simply
119875119883119884
(119883 = 119909 119884 = 119910) = 119875 (120596 isin Ω 119883 (120596) = 119909 119884 (120596) = 119910)
(25)
However when 119883 and 119884 have different probability spacesand so different probability measures the joint distributionis more complicated
Definition 13 The joint sample space of random variables 119883and 119884 is defined as the product Ω
1times Ω
2of all pairs (120596
119894 120588
119895)
119894 = 1 2 119899 and 119895 = 1 2 119898The joint 120590-fieldF1timesF
2is
defined as the product of all pairs (1198601 119860
2) where119860
1and119860
2
are elements of F1and F
2 respectively A joint probability
measure 119875119883119884
of 1198751and 119875
2is a probability measure on F
1times
F2 119875
119883119884(119860 times 119861) such that for any 119860 sube Ω
1and 119861 sube Ω
2
1198751(119860) = 119875
119883119884(119860 times Ω
2) =
119898
sum119895=1
119875119883119884
(119860 times 120588119895)
1198752(119861) = 119875
119883119884(Ω
1times 119861) =
119898
sum119894=1
119875119883119884
(120596119894 times 119861)
(26)
(Ω1timesΩ
2 F
1timesF
2 119875
119883119884) is called the joint probability space
of 119883 and 119884 and 119875119883119884
(119883 = 119909119894 times 119884 = 119910
119895) for 119894 = 1 2 119899
and 119895 = 1 2 119898 the joint distribution of119883 and 119884
Combining Definitions 2 and 13 we immediately obtainthe following results
Proposition 14 A sequence of nonnegative numbers 119901119894119895 1 le
119894 le 119870 1 le 119895 le 119871 whose sum is 1 can serve as a probabilitymeasure on F
1times F
2 119875
119883119884(120596
119894 120588
119895) = 119901
119894119895 The probability of
any event 119860 times 119861 sube Ω1times Ω
2is computed simply by adding the
probabilities of the individual points of (120596 120588) isin 119860 times 119861 If inaddition for 119894 = 1 2 119870 and 119895 = 1 2 119871 the followinghold
119871
sum119895=1
119901119894119895= 119875
119883(120596
119894)
119870
sum119894=1
119901119894119895= 119875
119884(120588
119895)
(27)
then 119875119883119884
(120596119894 120588
119895) = 119901
119894119895is a joint distribution of119883 and 119884
For convenience from now on we will shorten 119875119883119884
(119883 =
119909119894 times 119884 = 119910
119895) as 119875
119883119884(119909
119894 119910
119895)
This two-dimensional measure should not be confusedwith one-dimensional joint distribution when 119883 and 119884 havethe same probability space
Remark 15 If (Ω1F
1 119875
1) = (Ω
2F
2 119875
2) instead of using
the two-dimensional measure 119875119883119884
(119883 = 119909119894 times 119884 =
119910119895) we may use the one-dimensional measure 119875
1(119883 =
119909119894 and 119884 = 119910
119895) Then (26) always hold In this sense our
new definition of joint distribution reduces to the definitionof joint distribution with the same probability space
Definition 16 The conditional probability 119884 = 119910119895 given 119883 =
119909119894 is defined as
119875119884|119883
(119884 = 119910119895| 119883 = 119909
119894) =
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) (28)
Theorem 17 For any two discrete random variables thereis at least one joint probability measure called the productprobability measure or simply product distribution
Proof Let random variables 119883 and 119884 be defined as beforeDefine a function fromΩ
1times Ω
2to [0 1] as follows
119875119883119884
(120596119894 120588
119895) = 119875
1(120596
119894) 119875
2(120588
119895) (29)
Then
119899
sum119894=1
119898
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119899
sum119894=1
119898
sum119895=1
1198751(120596
119894) 119875
2(120588
119895)
=
119899
sum119894=1
1198751(120596
119894)
119898
sum119895=1
1198752(120588
119895) = 1
(30)
Hence 119875119883119884
can serve as a probability measure on F1times
F2by Definition 2 The probability of any event 119860 times 119861 sube
Ω1times Ω
2is computed simply by adding the probabilities of
6 Mathematical Problems in Engineering
the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596
1198941 120596
1198942 120596
119894119904 isin Ω
1of 119904 elements
119875119883119884
(119860 times Ω2) =
119898
sum119895=1
119875119883119884
(119860 times 120588119895)
=
119898
sum119895=1
119904
sum119906=1
119875119883119884
(120596119894119906 times 120588
119895)
=
119898
sum119895=1
119904
sum119906=1
1198751(120596
119894119906) 119875
2(120588
119895)
=
119898
sum119895=1
1198752(120588
119895)
119904
sum119906=1
1198751(120596
119894119906)
=
119904
sum119906=1
1198751(120596
119894119906) = 119875
1(119860)
(31)
Similarly 119875119883119884
(Ω1times 119861) = 119875
2(119861) for any 119861 isin Ω
2 Hence
119875119883119884
(119883 = 119909119894 times 119884 = 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) is a
joint probability measure of119883 and 119884 by Definition 13
Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875
119883119884(sdot) if 119875
119883119884(sdot) coincides
with the product distribution 119875119883times119884
(sdot)
Definition 19 The joint entropy119867(119883 119884) is defined as
119867(119883 119884) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119883119884(119909
119894 119910
119895) (32)
Definition 20 Theconditional entropy119867(119884 | 119883) is as follows
119867(119884 | 119883) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119884 = 119910
119895| 119883 = 119909
119894)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894)
(33)
Definition 21 The mutual information 119868(119883 119884) between 119883
and 119884 is defined as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
(34)
As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0
It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ
1= 119886
1 119886
2 119886
119870 andΩ
2= 119887
1 1198872 119887
119871
Define random variables 119883 Ω1rarr R and 119884 Ω
2rarr R as
one-to-one mappings as
119883(119886119894) = 119909
119894 119894 = 1 2 119870
119884 (119887119895) = 119910
119895 119895 = 1 2 119871
(35)
Then we have
119875119883119884
(119909119894 119910
119895) = 119875
119883119884(119886
119894 119887119895) (36)
It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example
Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions
1198751(119884 = 0) =
1
3 119875
1(119884 = 1) =
2
3
1198752(119883 = 1) =
1
3 119875
2(119883 = 2) =
1
3 119875
2(119883 = 3) =
1
3
(37)
We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information
(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13
(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0
(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13
(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =
29 119875(3 0) = 19 119875(3 1) = 29
42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]
Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =
1199011 119901
2 119901
119899 and 119876 = 119902
1 119902
2 119902
119899 is defined as
119863 (119875 119876) = sum119894
119901119894log
119901119894
119902119894
(38)
Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902
119894for all 119894
Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions
Mathematical Problems in Engineering 7
The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875
119883119884and the product
distribution 119875119883119875119884
Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875
119883119884and the product distribution 119875
11198752
Proof Using a mapping from 2-dimensional indices to one-dimensional index
(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899
for 119894 = 1 119870 119895 = 1 2 119871(39)
and using another mapping from one-dimensional indexback to two-dimensional indices
119894 = lceil119899
119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871
for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871
(40)
we rewrite 119868(119883 119884) as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(41)
Since119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) = 1
119870119871
sum119899=1
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
=
119870
sum119894=1
119871
sum119895=1
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895) = 1
(42)
we obtain
119868 (119883 119884) =
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(43)
Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies
119868 (119883 119884) ge 0 (44)
with equality if and only if119883 and 119884 are independent
Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875
119883119884(119909
lceil119899119871rceil 119910
119899minus(lceil119899119871rceilminus1)lowast119871) = 119875
1(119883 = 119909
lceil119899119871rceil)119875
2(119884 =
119910119899minus(lceil119899119871rceilminus1)lowast119871
) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +
sdot sdot sdot + 119870 that is 119875119883119884
(119909119894 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) for 119894 =
1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent
Corollary 26 If119883 is a constant random variable that is119870 =
1 then for any random variable 119884
119868 (119883 119884) = 0 (45)
Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875
1(119883 = 119909) = 119875
1(120596) = 1
For any 119895 = 1 2 119871
119875119883119884
(119909 119910119895) =
1
sum119894=1
119875119883119884
(119909 119910119895) = 119875
2(119884 = 119910
119895)
= 1198751(119883 = 119909) 119875
2(119884 = 119910
119895)
(46)
Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0
Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then
0 le 119867 (119883) le log119870 (47)
with equality if and only if the119870 values are equally probable
Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold
119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)
= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)
Proof Consider
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119884|119883
(119909119894 119910
119895)
1198752(119884 = 119910
119895)
8 Mathematical Problems in Engineering
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
+
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
119871
sum119894=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
1198752(119884 = 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= 119867 (119884) minus 119867 (119884 | 119883)
(49)
Combining the above properties and noting that 119867(119883 | 119884)
and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties
Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then
0 le 119868 (119883 119884) le 119867 (119884) le log 119871
0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)
Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent
5 Newly Defined Mutual Information inMachine Learning
Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883
1 119883
2 119883
119872 and their associated output values
119910 from an output space 119884 (ie dependent variable) Here1198831 119883
2 119883
119872are called features that is independent vari-
ables Hence 119878 can be expressed as
119878 = 1199091198941 119909
1198942 119909
119894119872 119910
119894 119894 = 1 2 119873 (51)
where feature 119883119895has values 119909
1119895 119909
2119895 119909
119873119895for 119895 = 1 2
119872
A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883
1 119883
2 119883
119872 Rather it is decided by only a subset of
them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows
(1) Evaluate the relationship between each individualinput feature 119909
119894and the output 119884
(2) Select the best set of attributes according to somecriterion
51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596
1 120596
2 120596
119870 If 119883 has missing values we will use 120596
1
to represent all the missing values Assume 119884 has 119871 differentvalues 120588
1 120588
2 120588
119871
Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874
119894119895be the frequency (could be 0) of (120596
119894 120588
119895) for
119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899
119894sdotand 119899
sdot119895 respectively Then
119899119894sdot= sum
119895
119874119894119895
119899sdot119895= sum
119894
119874119894119895
119873 = sum119894
sum119895
119874119894119895= sum
119894
119899119894sdot= sum
119895
119899sdot119895
(52)
Let us denote the relative frequency119874119894119895119873 by 119901
119894119895We have the
two-way relative frequency table see Table 2Since
119870
sum119894=1
119871
sum119895=1
119901119894119895=
119870
sum119894=1
119901119894sdot=
119871
sum119895=1
119901sdot119895= 1 (53)
119901119894sdot119870119894=1
119901sdot119895119871119895=1
and 119901119894119895119870119894=1
can each serve as a probabilitymeasure
Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables
119883 (Ω1F
1 119875
119883) 997888rarr 119877 (54)
as 119883(120596119894) = 119909
119894 where Ω
1= 120596
1 120596
2 120596
119870 and 119875
119883(120596
119894) =
119899119894sdot119873 = 119901
119894sdotfor 119894 = 1 2 119870 Note that 119909
1 119909
2 119909
119870
could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909
119894) = 119875
119883(120596
119894)
Mathematical Problems in Engineering 9
Table 1 Frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11987411
11987412
sdot sdot sdot 1198741119895
sdot sdot sdot 1198741119871
1198991∙
1205962
11987421
11987422
sdot sdot sdot 1198742119895
sdot sdot sdot 1198742119871
1198992∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1198741198941
1198741198942
sdot sdot sdot 119874119894119895
sdot sdot sdot 119874119894119871
119899119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1198741198701
1198741198702
sdot sdot sdot 119874119870119895
sdot sdot sdot 119874119870119871
119899119870∙
Total 119899∙1
119899∙2
sdot sdot sdot 119899∙119895
sdot sdot sdot 119899∙119871
119873
Table 2 Relative frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11990111
11990112
sdot sdot sdot 1199011119895
sdot sdot sdot 1199011119871
1199011∙
1205962
11990121
11990122
sdot sdot sdot 1199012119895
sdot sdot sdot 1199012119871
1199012∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1199011198941
1199011198942
sdot sdot sdot 119901119894119895
sdot sdot sdot 119901119894119871
119901119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1199011198701
1199011198702
sdot sdot sdot 119901119870119895
sdot sdot sdot 119901119870119871
119901119870∙
Total 119901∙1
119901∙2
sdot sdot sdot 119901∙119895
sdot sdot sdot 119901∙119871
1
Similarly
119884 (Ω2F
2 119875
2) 997888rarr 119877 (55)
as 119884(120588119895) = 119910
119895 where Ω
2= 120588
1 120588
2 120588
119871 and 119875
119884(120588
119895) =
119899sdot119895119873 = 119901
sdot119895for 119895 = 1 2 119871 Also 119910
1 119910
2 119910
119870could be
any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875
119884(119884 =
119910119895) = 119875
119884(120588
119895)
Now define a mapping 119875119883119884
fromΩ1timesΩ
2to 119877 as follows
119875119883119884
(120596119894 120588
119895) = 119901
119894119895=
119874119894119895
119873 (56)
Since119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) = 1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119871
sum119895=1
119901119894119895= 119901
119894sdot= 119875
119883(120596
119894)
119870
sum119894=1
119875119883119884
(120596119894 120588
119895) =
119870
sum119894=1
119901119894119895= 119901
sdot119895= 119875
119884(120588
119895)
(57)
119901119894119895119870119894=1
is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) log
119875119883119884
(120596119894 120588
119895)
119875119883(120596
119894) 119875
119884(120588
119895)
=
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
(58)
It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value
Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)
Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that
(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871
(2) 119901119894119895
= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and
119895 = 1 2 119871
(3) for each 119895 = 1 2 119871 since1198741119895+119874
2119895+sdot sdot sdot+119874
119870119895= 119899
sdot119895
there are 119899sdot119895nonzero 119874
119894119895rsquos or equivalently 119899
sdot119895nonzero
119901119894119895rsquos
(4) 119901119894sdot= 1119873 119894 = 1 2 119870
Using the above observations and the fact that 0 log 0 = 0 wehave
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
=
119870
sum119894=1
1199011198941log
1199011198941
119901119894sdot119901sdot1
+
119870
sum119894=1
1199011198942log
1199011198942
119901119894sdot119901sdot2
+ sdot sdot sdot +
119870
sum119894=1
119901119894119871log
119901119894119871
119901119894sdot119901sdot119871
10 Mathematical Problems in Engineering
= sum1199011198941 =0
1
119873log 1119873
119901sdot1119873
+ sum1199011198942 =0
1
119873log 1119873
119901sdot2119873
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1119873
119901sdot119871119873
= sum1199011198941 =0
1
119873log 1
119901sdot1
+ sum1199011198942 =0
1
119873log 1
119901sdot2
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1
119901sdot119871
=119899sdot1
119873log 1
119901sdot1
+119899sdot2
119873log 1
119901sdot2
+ sdot sdot sdot +119899sdot119871
119873log 1
119901sdot119871
= 119901sdot1log 1
119901sdot1
+ 119901sdot2log 1
119901sdot2
+ sdot sdot sdot + 119901sdot119871log 1
119901sdot119871
= 119867 (119884)
(59)
52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1
To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways
521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs
522 Relative Values From Property 4 we have 0 le
119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs
523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]
Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then
119901 ln119901
119902asymp (119901 minus 119902) +
(119901 minus 119902)2
2119902 (60)
The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is
Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain
2119868 (119883 119884) = 2119873
119870
sum119894=1
119871
sum119895=1
119901119894119895ln
119901119894119895
119901119894sdot119901sdot119895
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895119873
(119899119894sdot119873) (119899
sdot119895119873)
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895
119899119894sdot119899sdot119895119873
asymp 2
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus
119899119894sdot119899sdot119895
119873)
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895minus 2
sum119894119899119894sdot
119873
sum119895119899sdot119895
119873
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2119873 minus 2119873
119873+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
=
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 1205942
(61)
The last equation means the previous expressionsum119870
119894=1sum119871
119895=1((119874
119894119895minus 119899
119894sdot119899sdot119895119873)
2(119899
119894sdot119899sdot119895119873)) follows 1205942 distribu-
tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)
approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)
The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no
relationship between them)1198671119883 and119884 are dependent (ie there is a relationship
between them)
Mathematical Problems in Engineering 11
The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
asymp 2119873 times 119868 (119883 119884) (62)
is greater than 1205942119880 the upper-tail critical value from a Chi-
square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is
Select feature 119883 if 119868 (119883 119884) gt1205942119880
2119873 (63)
Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942
119880to be 169 from a Chi-
square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873
Assume a training set has119873 examples We can do featureselection by the following procedure
(i) Step 1 Choose a level of significance of 120572 say 005
(ii) Step 2 Find 119870 the number of values of feature119883
(iii) Step 3 Build the contingency table for119883 and 119884
(iv) Step 4 Calculate 119868(119883 119884) from the contingency table
(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom
from a Chi-square table or any other sources such asSAS
(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise
(vii) Step 7 Repeat Steps 2ndash6 for all features
If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure
53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1
Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process
(i) Step 1 select features by removing those with smallmutual information
(ii) Step 2 do binning for the rest of numerical features
(iii) Step 3 select features by mutual information
54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning
Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships
The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value
6 Conclusions
In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring
Conflict of Interests
The author declares that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
6 Mathematical Problems in Engineering
the individual points of (120596 120588) isin 119860 times 119861 Moreover for any119860 = 120596
1198941 120596
1198942 120596
119894119904 isin Ω
1of 119904 elements
119875119883119884
(119860 times Ω2) =
119898
sum119895=1
119875119883119884
(119860 times 120588119895)
=
119898
sum119895=1
119904
sum119906=1
119875119883119884
(120596119894119906 times 120588
119895)
=
119898
sum119895=1
119904
sum119906=1
1198751(120596
119894119906) 119875
2(120588
119895)
=
119898
sum119895=1
1198752(120588
119895)
119904
sum119906=1
1198751(120596
119894119906)
=
119904
sum119906=1
1198751(120596
119894119906) = 119875
1(119860)
(31)
Similarly 119875119883119884
(Ω1times 119861) = 119875
2(119861) for any 119861 isin Ω
2 Hence
119875119883119884
(119883 = 119909119894 times 119884 = 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) is a
joint probability measure of119883 and 119884 by Definition 13
Definition 18 Random variables119883 and 119884 are said to be inde-pendent under a joint distribution 119875
119883119884(sdot) if 119875
119883119884(sdot) coincides
with the product distribution 119875119883times119884
(sdot)
Definition 19 The joint entropy119867(119883 119884) is defined as
119867(119883 119884) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119883119884(119909
119894 119910
119895) (32)
Definition 20 Theconditional entropy119867(119884 | 119883) is as follows
119867(119884 | 119883) = minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119884 = 119910
119895| 119883 = 119909
119894)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894)
(33)
Definition 21 The mutual information 119868(119883 119884) between 119883
and 119884 is defined as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
(34)
As other measures in information theory the base oflogarithm in (34) is left unspecified Indeed 119868(119883 119884) underone base is proportional to that under another base by thechange-of-base formula Moreover we take 0 log 0 to be 0This corresponds to the limit of 119909 log119909 as 119909 goes to 0
It is obvious that our new definition covers Class 2definitions It also covers Class 1 definitions by the followingarguments LetΩ
1= 119886
1 119886
2 119886
119870 andΩ
2= 119887
1 1198872 119887
119871
Define random variables 119883 Ω1rarr R and 119884 Ω
2rarr R as
one-to-one mappings as
119883(119886119894) = 119909
119894 119894 = 1 2 119870
119884 (119887119895) = 119910
119895 119895 = 1 2 119871
(35)
Then we have
119875119883119884
(119909119894 119910
119895) = 119875
119883119884(119886
119894 119887119895) (36)
It is worth noting that our new definition of mutual informa-tion has some advantages over various existing definitionsFor instance it can be easily used to do feature selectionas seen later In addition our new definition leads differentvalues for different joint distribution as demonstrated in thefollowing example
Example 22 Assumerandom variables 119883 and 119884 have thefollowing probability distributions
1198751(119884 = 0) =
1
3 119875
1(119884 = 1) =
2
3
1198752(119883 = 1) =
1
3 119875
2(119883 = 2) =
1
3 119875
2(119883 = 3) =
1
3
(37)
We can generate four different joint probability distributionsto lead 4 different values of mutual information Howeverunder all the existing definitions a joint distribution must begiven in order to find mutual information
(1) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 13 119875(2 1) = 0119875(3 0) = 0 119875(3 1) = 13
(2) 119875(1 0) = 0 119875(1 1) = 13 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 13 119875(3 1) = 0
(3) 119875(1 0) = 13 119875(1 1) = 0 119875(2 0) = 0 119875(2 1) = 13119875(3 0) = 0 119875(3 1) = 13
(4) 119875(1 0) = 19 119875(1 1) = 29 119875(2 0) = 19 119875(2 1) =
29 119875(3 0) = 19 119875(3 1) = 29
42 Properties of Newly Defined Mutual Information Beforewe discuss some properties of mutual information we firstintroduce Kullback-Leibler distance [8]
Definition 23 The relative entropy or Kullback-Leibler dis-tance between two discrete probability distributions 119875 =
1199011 119901
2 119901
119899 and 119876 = 119902
1 119902
2 119902
119899 is defined as
119863 (119875 119876) = sum119894
119901119894log
119901119894
119902119894
(38)
Lemma 24 (see [8]) Let 119875 and 119876 be two discrete probabilitydistributions Then 119863(119875 119876) ge 0 with equality if and only if119901119894= 119902
119894for all 119894
Remark 25 The Kullback-Leibler distance is not a truedistance between distributions since it is not symmetric anddoes not satisfy the triangle inequality either Neverthelessit is often useful to think of relative entropy as a ldquodistancerdquobetween distributions
Mathematical Problems in Engineering 7
The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875
119883119884and the product
distribution 119875119883119875119884
Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875
119883119884and the product distribution 119875
11198752
Proof Using a mapping from 2-dimensional indices to one-dimensional index
(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899
for 119894 = 1 119870 119895 = 1 2 119871(39)
and using another mapping from one-dimensional indexback to two-dimensional indices
119894 = lceil119899
119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871
for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871
(40)
we rewrite 119868(119883 119884) as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(41)
Since119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) = 1
119870119871
sum119899=1
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
=
119870
sum119894=1
119871
sum119895=1
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895) = 1
(42)
we obtain
119868 (119883 119884) =
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(43)
Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies
119868 (119883 119884) ge 0 (44)
with equality if and only if119883 and 119884 are independent
Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875
119883119884(119909
lceil119899119871rceil 119910
119899minus(lceil119899119871rceilminus1)lowast119871) = 119875
1(119883 = 119909
lceil119899119871rceil)119875
2(119884 =
119910119899minus(lceil119899119871rceilminus1)lowast119871
) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +
sdot sdot sdot + 119870 that is 119875119883119884
(119909119894 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) for 119894 =
1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent
Corollary 26 If119883 is a constant random variable that is119870 =
1 then for any random variable 119884
119868 (119883 119884) = 0 (45)
Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875
1(119883 = 119909) = 119875
1(120596) = 1
For any 119895 = 1 2 119871
119875119883119884
(119909 119910119895) =
1
sum119894=1
119875119883119884
(119909 119910119895) = 119875
2(119884 = 119910
119895)
= 1198751(119883 = 119909) 119875
2(119884 = 119910
119895)
(46)
Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0
Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then
0 le 119867 (119883) le log119870 (47)
with equality if and only if the119870 values are equally probable
Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold
119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)
= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)
Proof Consider
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119884|119883
(119909119894 119910
119895)
1198752(119884 = 119910
119895)
8 Mathematical Problems in Engineering
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
+
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
119871
sum119894=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
1198752(119884 = 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= 119867 (119884) minus 119867 (119884 | 119883)
(49)
Combining the above properties and noting that 119867(119883 | 119884)
and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties
Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then
0 le 119868 (119883 119884) le 119867 (119884) le log 119871
0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)
Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent
5 Newly Defined Mutual Information inMachine Learning
Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883
1 119883
2 119883
119872 and their associated output values
119910 from an output space 119884 (ie dependent variable) Here1198831 119883
2 119883
119872are called features that is independent vari-
ables Hence 119878 can be expressed as
119878 = 1199091198941 119909
1198942 119909
119894119872 119910
119894 119894 = 1 2 119873 (51)
where feature 119883119895has values 119909
1119895 119909
2119895 119909
119873119895for 119895 = 1 2
119872
A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883
1 119883
2 119883
119872 Rather it is decided by only a subset of
them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows
(1) Evaluate the relationship between each individualinput feature 119909
119894and the output 119884
(2) Select the best set of attributes according to somecriterion
51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596
1 120596
2 120596
119870 If 119883 has missing values we will use 120596
1
to represent all the missing values Assume 119884 has 119871 differentvalues 120588
1 120588
2 120588
119871
Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874
119894119895be the frequency (could be 0) of (120596
119894 120588
119895) for
119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899
119894sdotand 119899
sdot119895 respectively Then
119899119894sdot= sum
119895
119874119894119895
119899sdot119895= sum
119894
119874119894119895
119873 = sum119894
sum119895
119874119894119895= sum
119894
119899119894sdot= sum
119895
119899sdot119895
(52)
Let us denote the relative frequency119874119894119895119873 by 119901
119894119895We have the
two-way relative frequency table see Table 2Since
119870
sum119894=1
119871
sum119895=1
119901119894119895=
119870
sum119894=1
119901119894sdot=
119871
sum119895=1
119901sdot119895= 1 (53)
119901119894sdot119870119894=1
119901sdot119895119871119895=1
and 119901119894119895119870119894=1
can each serve as a probabilitymeasure
Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables
119883 (Ω1F
1 119875
119883) 997888rarr 119877 (54)
as 119883(120596119894) = 119909
119894 where Ω
1= 120596
1 120596
2 120596
119870 and 119875
119883(120596
119894) =
119899119894sdot119873 = 119901
119894sdotfor 119894 = 1 2 119870 Note that 119909
1 119909
2 119909
119870
could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909
119894) = 119875
119883(120596
119894)
Mathematical Problems in Engineering 9
Table 1 Frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11987411
11987412
sdot sdot sdot 1198741119895
sdot sdot sdot 1198741119871
1198991∙
1205962
11987421
11987422
sdot sdot sdot 1198742119895
sdot sdot sdot 1198742119871
1198992∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1198741198941
1198741198942
sdot sdot sdot 119874119894119895
sdot sdot sdot 119874119894119871
119899119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1198741198701
1198741198702
sdot sdot sdot 119874119870119895
sdot sdot sdot 119874119870119871
119899119870∙
Total 119899∙1
119899∙2
sdot sdot sdot 119899∙119895
sdot sdot sdot 119899∙119871
119873
Table 2 Relative frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11990111
11990112
sdot sdot sdot 1199011119895
sdot sdot sdot 1199011119871
1199011∙
1205962
11990121
11990122
sdot sdot sdot 1199012119895
sdot sdot sdot 1199012119871
1199012∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1199011198941
1199011198942
sdot sdot sdot 119901119894119895
sdot sdot sdot 119901119894119871
119901119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1199011198701
1199011198702
sdot sdot sdot 119901119870119895
sdot sdot sdot 119901119870119871
119901119870∙
Total 119901∙1
119901∙2
sdot sdot sdot 119901∙119895
sdot sdot sdot 119901∙119871
1
Similarly
119884 (Ω2F
2 119875
2) 997888rarr 119877 (55)
as 119884(120588119895) = 119910
119895 where Ω
2= 120588
1 120588
2 120588
119871 and 119875
119884(120588
119895) =
119899sdot119895119873 = 119901
sdot119895for 119895 = 1 2 119871 Also 119910
1 119910
2 119910
119870could be
any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875
119884(119884 =
119910119895) = 119875
119884(120588
119895)
Now define a mapping 119875119883119884
fromΩ1timesΩ
2to 119877 as follows
119875119883119884
(120596119894 120588
119895) = 119901
119894119895=
119874119894119895
119873 (56)
Since119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) = 1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119871
sum119895=1
119901119894119895= 119901
119894sdot= 119875
119883(120596
119894)
119870
sum119894=1
119875119883119884
(120596119894 120588
119895) =
119870
sum119894=1
119901119894119895= 119901
sdot119895= 119875
119884(120588
119895)
(57)
119901119894119895119870119894=1
is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) log
119875119883119884
(120596119894 120588
119895)
119875119883(120596
119894) 119875
119884(120588
119895)
=
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
(58)
It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value
Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)
Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that
(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871
(2) 119901119894119895
= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and
119895 = 1 2 119871
(3) for each 119895 = 1 2 119871 since1198741119895+119874
2119895+sdot sdot sdot+119874
119870119895= 119899
sdot119895
there are 119899sdot119895nonzero 119874
119894119895rsquos or equivalently 119899
sdot119895nonzero
119901119894119895rsquos
(4) 119901119894sdot= 1119873 119894 = 1 2 119870
Using the above observations and the fact that 0 log 0 = 0 wehave
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
=
119870
sum119894=1
1199011198941log
1199011198941
119901119894sdot119901sdot1
+
119870
sum119894=1
1199011198942log
1199011198942
119901119894sdot119901sdot2
+ sdot sdot sdot +
119870
sum119894=1
119901119894119871log
119901119894119871
119901119894sdot119901sdot119871
10 Mathematical Problems in Engineering
= sum1199011198941 =0
1
119873log 1119873
119901sdot1119873
+ sum1199011198942 =0
1
119873log 1119873
119901sdot2119873
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1119873
119901sdot119871119873
= sum1199011198941 =0
1
119873log 1
119901sdot1
+ sum1199011198942 =0
1
119873log 1
119901sdot2
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1
119901sdot119871
=119899sdot1
119873log 1
119901sdot1
+119899sdot2
119873log 1
119901sdot2
+ sdot sdot sdot +119899sdot119871
119873log 1
119901sdot119871
= 119901sdot1log 1
119901sdot1
+ 119901sdot2log 1
119901sdot2
+ sdot sdot sdot + 119901sdot119871log 1
119901sdot119871
= 119867 (119884)
(59)
52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1
To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways
521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs
522 Relative Values From Property 4 we have 0 le
119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs
523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]
Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then
119901 ln119901
119902asymp (119901 minus 119902) +
(119901 minus 119902)2
2119902 (60)
The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is
Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain
2119868 (119883 119884) = 2119873
119870
sum119894=1
119871
sum119895=1
119901119894119895ln
119901119894119895
119901119894sdot119901sdot119895
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895119873
(119899119894sdot119873) (119899
sdot119895119873)
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895
119899119894sdot119899sdot119895119873
asymp 2
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus
119899119894sdot119899sdot119895
119873)
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895minus 2
sum119894119899119894sdot
119873
sum119895119899sdot119895
119873
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2119873 minus 2119873
119873+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
=
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 1205942
(61)
The last equation means the previous expressionsum119870
119894=1sum119871
119895=1((119874
119894119895minus 119899
119894sdot119899sdot119895119873)
2(119899
119894sdot119899sdot119895119873)) follows 1205942 distribu-
tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)
approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)
The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no
relationship between them)1198671119883 and119884 are dependent (ie there is a relationship
between them)
Mathematical Problems in Engineering 11
The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
asymp 2119873 times 119868 (119883 119884) (62)
is greater than 1205942119880 the upper-tail critical value from a Chi-
square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is
Select feature 119883 if 119868 (119883 119884) gt1205942119880
2119873 (63)
Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942
119880to be 169 from a Chi-
square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873
Assume a training set has119873 examples We can do featureselection by the following procedure
(i) Step 1 Choose a level of significance of 120572 say 005
(ii) Step 2 Find 119870 the number of values of feature119883
(iii) Step 3 Build the contingency table for119883 and 119884
(iv) Step 4 Calculate 119868(119883 119884) from the contingency table
(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom
from a Chi-square table or any other sources such asSAS
(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise
(vii) Step 7 Repeat Steps 2ndash6 for all features
If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure
53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1
Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process
(i) Step 1 select features by removing those with smallmutual information
(ii) Step 2 do binning for the rest of numerical features
(iii) Step 3 select features by mutual information
54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning
Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships
The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value
6 Conclusions
In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring
Conflict of Interests
The author declares that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 7
The following property shows that mutual informationunder a joint probability measure is the Kullback-Leiblerdistance between the joint distribution 119875
119883119884and the product
distribution 119875119883119875119884
Property 1 Mutual information of random variables 119883 and119884 is the Kullback-Leibler distance between the joint distribu-tion 119875
119883119884and the product distribution 119875
11198752
Proof Using a mapping from 2-dimensional indices to one-dimensional index
(119894 119895) 997888rarr (119894 minus 1) lowast 119871 + 119895 ≜ 119899
for 119894 = 1 119870 119895 = 1 2 119871(39)
and using another mapping from one-dimensional indexback to two-dimensional indices
119894 = lceil119899
119871rceil 119895 = 119899 minus (119894 minus 1) lowast 119871
for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1) 119871 + sdot sdot sdot + 119870119871
(40)
we rewrite 119868(119883 119884) as
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(41)
Since119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) = 1
119870119871
sum119899=1
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
=
119870
sum119894=1
119871
sum119895=1
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895) = 1
(42)
we obtain
119868 (119883 119884) =
119870119871
sum119899=1
119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
sdot log119875119883119884
(119909lceil119899119871rceil
119910119899minus(lceil119899119871rceilminus1)lowast119871
)
1198751(119883 = 119909
lceil119899119871rceil) 119875
2(119884 = 119910
119899minus(lceil119899119871rceilminus1)lowast119871)
(43)
Property 2 Let 119883 and 119884 be two discrete random variablesThe mutual information between 119883 and 119884 satisfies
119868 (119883 119884) ge 0 (44)
with equality if and only if119883 and 119884 are independent
Proof Let us use the mappings between two-dimensionalindices and one-dimensional index in the proof of Property 1By Lemma 24 119868(119883 119884) ge 0 with equality if and onlyif 119875
119883119884(119909
lceil119899119871rceil 119910
119899minus(lceil119899119871rceilminus1)lowast119871) = 119875
1(119883 = 119909
lceil119899119871rceil)119875
2(119884 =
119910119899minus(lceil119899119871rceilminus1)lowast119871
) for 119899 = 1 2 119871 119871 + 1 2119871 (119870 minus 1)119871 +
sdot sdot sdot + 119870 that is 119875119883119884
(119909119894 119910
119895) = 119875
1(119883 = 119909
119894)119875
2(119884 = 119910
119895) for 119894 =
1 119870 and 119895 = 1 2 119871 or119883 and 119884 are independent
Corollary 26 If119883 is a constant random variable that is119870 =
1 then for any random variable 119884
119868 (119883 119884) = 0 (45)
Proof Suppose the range of119883 is a constant 119909 and the samplespace has only one point 120596 Then 119875
1(119883 = 119909) = 119875
1(120596) = 1
For any 119895 = 1 2 119871
119875119883119884
(119909 119910119895) =
1
sum119894=1
119875119883119884
(119909 119910119895) = 119875
2(119884 = 119910
119895)
= 1198751(119883 = 119909) 119875
2(119884 = 119910
119895)
(46)
Thus 119883 and 119884 are independent By Property 2 119868(119883 119884) = 0
Lemma 27 (see [8]) Let 119883 be discrete random variables with119870 values Then
0 le 119867 (119883) le log119870 (47)
with equality if and only if the119870 values are equally probable
Property 3 Let 119883 and 119884 be two discrete random variablesThen the following relationships among mutual informationentropy and conditional entry hold
119868 (119883 119884) = 119867 (119883) minus 119867 (119883 | 119884)
= 119867 (119884) minus 119867 (119884 | 119883) = 119868 (119884119883) (48)
Proof Consider
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119883119884
(119909119894 119910
119895)
1198751(119883 = 119909
119894) 119875
2(119884 = 119910
119895)
=
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log
119875119884|119883
(119909119894 119910
119895)
1198752(119884 = 119910
119895)
8 Mathematical Problems in Engineering
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
+
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
119871
sum119894=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
1198752(119884 = 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= 119867 (119884) minus 119867 (119884 | 119883)
(49)
Combining the above properties and noting that 119867(119883 | 119884)
and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties
Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then
0 le 119868 (119883 119884) le 119867 (119884) le log 119871
0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)
Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent
5 Newly Defined Mutual Information inMachine Learning
Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883
1 119883
2 119883
119872 and their associated output values
119910 from an output space 119884 (ie dependent variable) Here1198831 119883
2 119883
119872are called features that is independent vari-
ables Hence 119878 can be expressed as
119878 = 1199091198941 119909
1198942 119909
119894119872 119910
119894 119894 = 1 2 119873 (51)
where feature 119883119895has values 119909
1119895 119909
2119895 119909
119873119895for 119895 = 1 2
119872
A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883
1 119883
2 119883
119872 Rather it is decided by only a subset of
them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows
(1) Evaluate the relationship between each individualinput feature 119909
119894and the output 119884
(2) Select the best set of attributes according to somecriterion
51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596
1 120596
2 120596
119870 If 119883 has missing values we will use 120596
1
to represent all the missing values Assume 119884 has 119871 differentvalues 120588
1 120588
2 120588
119871
Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874
119894119895be the frequency (could be 0) of (120596
119894 120588
119895) for
119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899
119894sdotand 119899
sdot119895 respectively Then
119899119894sdot= sum
119895
119874119894119895
119899sdot119895= sum
119894
119874119894119895
119873 = sum119894
sum119895
119874119894119895= sum
119894
119899119894sdot= sum
119895
119899sdot119895
(52)
Let us denote the relative frequency119874119894119895119873 by 119901
119894119895We have the
two-way relative frequency table see Table 2Since
119870
sum119894=1
119871
sum119895=1
119901119894119895=
119870
sum119894=1
119901119894sdot=
119871
sum119895=1
119901sdot119895= 1 (53)
119901119894sdot119870119894=1
119901sdot119895119871119895=1
and 119901119894119895119870119894=1
can each serve as a probabilitymeasure
Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables
119883 (Ω1F
1 119875
119883) 997888rarr 119877 (54)
as 119883(120596119894) = 119909
119894 where Ω
1= 120596
1 120596
2 120596
119870 and 119875
119883(120596
119894) =
119899119894sdot119873 = 119901
119894sdotfor 119894 = 1 2 119870 Note that 119909
1 119909
2 119909
119870
could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909
119894) = 119875
119883(120596
119894)
Mathematical Problems in Engineering 9
Table 1 Frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11987411
11987412
sdot sdot sdot 1198741119895
sdot sdot sdot 1198741119871
1198991∙
1205962
11987421
11987422
sdot sdot sdot 1198742119895
sdot sdot sdot 1198742119871
1198992∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1198741198941
1198741198942
sdot sdot sdot 119874119894119895
sdot sdot sdot 119874119894119871
119899119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1198741198701
1198741198702
sdot sdot sdot 119874119870119895
sdot sdot sdot 119874119870119871
119899119870∙
Total 119899∙1
119899∙2
sdot sdot sdot 119899∙119895
sdot sdot sdot 119899∙119871
119873
Table 2 Relative frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11990111
11990112
sdot sdot sdot 1199011119895
sdot sdot sdot 1199011119871
1199011∙
1205962
11990121
11990122
sdot sdot sdot 1199012119895
sdot sdot sdot 1199012119871
1199012∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1199011198941
1199011198942
sdot sdot sdot 119901119894119895
sdot sdot sdot 119901119894119871
119901119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1199011198701
1199011198702
sdot sdot sdot 119901119870119895
sdot sdot sdot 119901119870119871
119901119870∙
Total 119901∙1
119901∙2
sdot sdot sdot 119901∙119895
sdot sdot sdot 119901∙119871
1
Similarly
119884 (Ω2F
2 119875
2) 997888rarr 119877 (55)
as 119884(120588119895) = 119910
119895 where Ω
2= 120588
1 120588
2 120588
119871 and 119875
119884(120588
119895) =
119899sdot119895119873 = 119901
sdot119895for 119895 = 1 2 119871 Also 119910
1 119910
2 119910
119870could be
any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875
119884(119884 =
119910119895) = 119875
119884(120588
119895)
Now define a mapping 119875119883119884
fromΩ1timesΩ
2to 119877 as follows
119875119883119884
(120596119894 120588
119895) = 119901
119894119895=
119874119894119895
119873 (56)
Since119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) = 1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119871
sum119895=1
119901119894119895= 119901
119894sdot= 119875
119883(120596
119894)
119870
sum119894=1
119875119883119884
(120596119894 120588
119895) =
119870
sum119894=1
119901119894119895= 119901
sdot119895= 119875
119884(120588
119895)
(57)
119901119894119895119870119894=1
is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) log
119875119883119884
(120596119894 120588
119895)
119875119883(120596
119894) 119875
119884(120588
119895)
=
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
(58)
It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value
Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)
Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that
(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871
(2) 119901119894119895
= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and
119895 = 1 2 119871
(3) for each 119895 = 1 2 119871 since1198741119895+119874
2119895+sdot sdot sdot+119874
119870119895= 119899
sdot119895
there are 119899sdot119895nonzero 119874
119894119895rsquos or equivalently 119899
sdot119895nonzero
119901119894119895rsquos
(4) 119901119894sdot= 1119873 119894 = 1 2 119870
Using the above observations and the fact that 0 log 0 = 0 wehave
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
=
119870
sum119894=1
1199011198941log
1199011198941
119901119894sdot119901sdot1
+
119870
sum119894=1
1199011198942log
1199011198942
119901119894sdot119901sdot2
+ sdot sdot sdot +
119870
sum119894=1
119901119894119871log
119901119894119871
119901119894sdot119901sdot119871
10 Mathematical Problems in Engineering
= sum1199011198941 =0
1
119873log 1119873
119901sdot1119873
+ sum1199011198942 =0
1
119873log 1119873
119901sdot2119873
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1119873
119901sdot119871119873
= sum1199011198941 =0
1
119873log 1
119901sdot1
+ sum1199011198942 =0
1
119873log 1
119901sdot2
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1
119901sdot119871
=119899sdot1
119873log 1
119901sdot1
+119899sdot2
119873log 1
119901sdot2
+ sdot sdot sdot +119899sdot119871
119873log 1
119901sdot119871
= 119901sdot1log 1
119901sdot1
+ 119901sdot2log 1
119901sdot2
+ sdot sdot sdot + 119901sdot119871log 1
119901sdot119871
= 119867 (119884)
(59)
52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1
To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways
521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs
522 Relative Values From Property 4 we have 0 le
119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs
523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]
Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then
119901 ln119901
119902asymp (119901 minus 119902) +
(119901 minus 119902)2
2119902 (60)
The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is
Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain
2119868 (119883 119884) = 2119873
119870
sum119894=1
119871
sum119895=1
119901119894119895ln
119901119894119895
119901119894sdot119901sdot119895
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895119873
(119899119894sdot119873) (119899
sdot119895119873)
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895
119899119894sdot119899sdot119895119873
asymp 2
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus
119899119894sdot119899sdot119895
119873)
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895minus 2
sum119894119899119894sdot
119873
sum119895119899sdot119895
119873
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2119873 minus 2119873
119873+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
=
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 1205942
(61)
The last equation means the previous expressionsum119870
119894=1sum119871
119895=1((119874
119894119895minus 119899
119894sdot119899sdot119895119873)
2(119899
119894sdot119899sdot119895119873)) follows 1205942 distribu-
tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)
approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)
The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no
relationship between them)1198671119883 and119884 are dependent (ie there is a relationship
between them)
Mathematical Problems in Engineering 11
The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
asymp 2119873 times 119868 (119883 119884) (62)
is greater than 1205942119880 the upper-tail critical value from a Chi-
square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is
Select feature 119883 if 119868 (119883 119884) gt1205942119880
2119873 (63)
Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942
119880to be 169 from a Chi-
square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873
Assume a training set has119873 examples We can do featureselection by the following procedure
(i) Step 1 Choose a level of significance of 120572 say 005
(ii) Step 2 Find 119870 the number of values of feature119883
(iii) Step 3 Build the contingency table for119883 and 119884
(iv) Step 4 Calculate 119868(119883 119884) from the contingency table
(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom
from a Chi-square table or any other sources such asSAS
(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise
(vii) Step 7 Repeat Steps 2ndash6 for all features
If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure
53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1
Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process
(i) Step 1 select features by removing those with smallmutual information
(ii) Step 2 do binning for the rest of numerical features
(iii) Step 3 select features by mutual information
54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning
Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships
The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value
6 Conclusions
In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring
Conflict of Interests
The author declares that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
8 Mathematical Problems in Engineering
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
+
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895)
= minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
119871
sum119894=1
119875119883119884
(119909119894 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= minus
119871
sum119895=1
1198752(119884 = 119910
119895) log119875
2(119884 = 119910
119895)
minus (minus
119870
sum119894=1
119871
sum119895=1
119875119883119884
(119909119894 119910
119895) log119875
119884|119883(119909
119894 119910
119895))
= 119867 (119884) minus 119867 (119884 | 119883)
(49)
Combining the above properties and noting that 119867(119883 | 119884)
and 119867(119884 | 119883) are both nonnegative we obtain the followingproperties
Property 4 Let 119883 and 119884 be 2 discrete random variables with119870 and 119871 values respectively Then
0 le 119868 (119883 119884) le 119867 (119884) le log 119871
0 le 119868 (119883 119884) le 119867 (119883) le log119870(50)
Moreover 119868(119883 119884) = 0 if and only if119883 and119884 are independent
5 Newly Defined Mutual Information inMachine Learning
Machine learning is the science of getting machines (com-puters) to automatically learn from data In a typical learningsetting a training set 119878 contains 119873 examples (also knownas samples observations or records) from an input space119883 = 119883
1 119883
2 119883
119872 and their associated output values
119910 from an output space 119884 (ie dependent variable) Here1198831 119883
2 119883
119872are called features that is independent vari-
ables Hence 119878 can be expressed as
119878 = 1199091198941 119909
1198942 119909
119894119872 119910
119894 119894 = 1 2 119873 (51)
where feature 119883119895has values 119909
1119895 119909
2119895 119909
119873119895for 119895 = 1 2
119872
A fundamental objective in machine learning is to finda functional relationship between input 119883 and output 119884In general there are a very large number of features manyof which are not needed Sometimes the output 119884 isnot determined by the complete set of the input features119883
1 119883
2 119883
119872 Rather it is decided by only a subset of
them This kind of reduction is called feature selection Itspurpose is to choose a subset of features to capture therelevant information An easy and natural way for featureselection is as follows
(1) Evaluate the relationship between each individualinput feature 119909
119894and the output 119884
(2) Select the best set of attributes according to somecriterion
51 Calculation of Newly Defined Mutual Information Sincemutual information measures dependency between randomvariables we may use it to do feature selection in machinelearning Let us calculate mutual information between aninput feature 119883 and output 119884 Assume 119883 has 119870 differentvalues 120596
1 120596
2 120596
119870 If 119883 has missing values we will use 120596
1
to represent all the missing values Assume 119884 has 119871 differentvalues 120588
1 120588
2 120588
119871
Let us build a two-way frequency or contingency table bymaking 119883 as the row variable and 119884 as the column variablelike in [8] Let119874
119894119895be the frequency (could be 0) of (120596
119894 120588
119895) for
119894 = 1 to 119870 and 119895 = 1 to 119871 Let the row and column marginaltotals be 119899
119894sdotand 119899
sdot119895 respectively Then
119899119894sdot= sum
119895
119874119894119895
119899sdot119895= sum
119894
119874119894119895
119873 = sum119894
sum119895
119874119894119895= sum
119894
119899119894sdot= sum
119895
119899sdot119895
(52)
Let us denote the relative frequency119874119894119895119873 by 119901
119894119895We have the
two-way relative frequency table see Table 2Since
119870
sum119894=1
119871
sum119895=1
119901119894119895=
119870
sum119894=1
119901119894sdot=
119871
sum119895=1
119901sdot119895= 1 (53)
119901119894sdot119870119894=1
119901sdot119895119871119895=1
and 119901119894119895119870119894=1
can each serve as a probabilitymeasure
Now we can define random variables for 119883 and 119884 asfollows For convenience we will use the same names 119883 and119884 for the random variables
119883 (Ω1F
1 119875
119883) 997888rarr 119877 (54)
as 119883(120596119894) = 119909
119894 where Ω
1= 120596
1 120596
2 120596
119870 and 119875
119883(120596
119894) =
119899119894sdot119873 = 119901
119894sdotfor 119894 = 1 2 119870 Note that 119909
1 119909
2 119909
119870
could be any real numbers as long as they are distinct toguarantee that 119883 is a one-to-one mapping In this case119875119883(119883 = 119909
119894) = 119875
119883(120596
119894)
Mathematical Problems in Engineering 9
Table 1 Frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11987411
11987412
sdot sdot sdot 1198741119895
sdot sdot sdot 1198741119871
1198991∙
1205962
11987421
11987422
sdot sdot sdot 1198742119895
sdot sdot sdot 1198742119871
1198992∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1198741198941
1198741198942
sdot sdot sdot 119874119894119895
sdot sdot sdot 119874119894119871
119899119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1198741198701
1198741198702
sdot sdot sdot 119874119870119895
sdot sdot sdot 119874119870119871
119899119870∙
Total 119899∙1
119899∙2
sdot sdot sdot 119899∙119895
sdot sdot sdot 119899∙119871
119873
Table 2 Relative frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11990111
11990112
sdot sdot sdot 1199011119895
sdot sdot sdot 1199011119871
1199011∙
1205962
11990121
11990122
sdot sdot sdot 1199012119895
sdot sdot sdot 1199012119871
1199012∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1199011198941
1199011198942
sdot sdot sdot 119901119894119895
sdot sdot sdot 119901119894119871
119901119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1199011198701
1199011198702
sdot sdot sdot 119901119870119895
sdot sdot sdot 119901119870119871
119901119870∙
Total 119901∙1
119901∙2
sdot sdot sdot 119901∙119895
sdot sdot sdot 119901∙119871
1
Similarly
119884 (Ω2F
2 119875
2) 997888rarr 119877 (55)
as 119884(120588119895) = 119910
119895 where Ω
2= 120588
1 120588
2 120588
119871 and 119875
119884(120588
119895) =
119899sdot119895119873 = 119901
sdot119895for 119895 = 1 2 119871 Also 119910
1 119910
2 119910
119870could be
any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875
119884(119884 =
119910119895) = 119875
119884(120588
119895)
Now define a mapping 119875119883119884
fromΩ1timesΩ
2to 119877 as follows
119875119883119884
(120596119894 120588
119895) = 119901
119894119895=
119874119894119895
119873 (56)
Since119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) = 1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119871
sum119895=1
119901119894119895= 119901
119894sdot= 119875
119883(120596
119894)
119870
sum119894=1
119875119883119884
(120596119894 120588
119895) =
119870
sum119894=1
119901119894119895= 119901
sdot119895= 119875
119884(120588
119895)
(57)
119901119894119895119870119894=1
is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) log
119875119883119884
(120596119894 120588
119895)
119875119883(120596
119894) 119875
119884(120588
119895)
=
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
(58)
It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value
Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)
Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that
(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871
(2) 119901119894119895
= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and
119895 = 1 2 119871
(3) for each 119895 = 1 2 119871 since1198741119895+119874
2119895+sdot sdot sdot+119874
119870119895= 119899
sdot119895
there are 119899sdot119895nonzero 119874
119894119895rsquos or equivalently 119899
sdot119895nonzero
119901119894119895rsquos
(4) 119901119894sdot= 1119873 119894 = 1 2 119870
Using the above observations and the fact that 0 log 0 = 0 wehave
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
=
119870
sum119894=1
1199011198941log
1199011198941
119901119894sdot119901sdot1
+
119870
sum119894=1
1199011198942log
1199011198942
119901119894sdot119901sdot2
+ sdot sdot sdot +
119870
sum119894=1
119901119894119871log
119901119894119871
119901119894sdot119901sdot119871
10 Mathematical Problems in Engineering
= sum1199011198941 =0
1
119873log 1119873
119901sdot1119873
+ sum1199011198942 =0
1
119873log 1119873
119901sdot2119873
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1119873
119901sdot119871119873
= sum1199011198941 =0
1
119873log 1
119901sdot1
+ sum1199011198942 =0
1
119873log 1
119901sdot2
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1
119901sdot119871
=119899sdot1
119873log 1
119901sdot1
+119899sdot2
119873log 1
119901sdot2
+ sdot sdot sdot +119899sdot119871
119873log 1
119901sdot119871
= 119901sdot1log 1
119901sdot1
+ 119901sdot2log 1
119901sdot2
+ sdot sdot sdot + 119901sdot119871log 1
119901sdot119871
= 119867 (119884)
(59)
52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1
To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways
521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs
522 Relative Values From Property 4 we have 0 le
119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs
523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]
Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then
119901 ln119901
119902asymp (119901 minus 119902) +
(119901 minus 119902)2
2119902 (60)
The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is
Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain
2119868 (119883 119884) = 2119873
119870
sum119894=1
119871
sum119895=1
119901119894119895ln
119901119894119895
119901119894sdot119901sdot119895
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895119873
(119899119894sdot119873) (119899
sdot119895119873)
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895
119899119894sdot119899sdot119895119873
asymp 2
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus
119899119894sdot119899sdot119895
119873)
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895minus 2
sum119894119899119894sdot
119873
sum119895119899sdot119895
119873
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2119873 minus 2119873
119873+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
=
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 1205942
(61)
The last equation means the previous expressionsum119870
119894=1sum119871
119895=1((119874
119894119895minus 119899
119894sdot119899sdot119895119873)
2(119899
119894sdot119899sdot119895119873)) follows 1205942 distribu-
tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)
approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)
The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no
relationship between them)1198671119883 and119884 are dependent (ie there is a relationship
between them)
Mathematical Problems in Engineering 11
The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
asymp 2119873 times 119868 (119883 119884) (62)
is greater than 1205942119880 the upper-tail critical value from a Chi-
square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is
Select feature 119883 if 119868 (119883 119884) gt1205942119880
2119873 (63)
Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942
119880to be 169 from a Chi-
square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873
Assume a training set has119873 examples We can do featureselection by the following procedure
(i) Step 1 Choose a level of significance of 120572 say 005
(ii) Step 2 Find 119870 the number of values of feature119883
(iii) Step 3 Build the contingency table for119883 and 119884
(iv) Step 4 Calculate 119868(119883 119884) from the contingency table
(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom
from a Chi-square table or any other sources such asSAS
(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise
(vii) Step 7 Repeat Steps 2ndash6 for all features
If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure
53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1
Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process
(i) Step 1 select features by removing those with smallmutual information
(ii) Step 2 do binning for the rest of numerical features
(iii) Step 3 select features by mutual information
54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning
Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships
The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value
6 Conclusions
In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring
Conflict of Interests
The author declares that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 9
Table 1 Frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11987411
11987412
sdot sdot sdot 1198741119895
sdot sdot sdot 1198741119871
1198991∙
1205962
11987421
11987422
sdot sdot sdot 1198742119895
sdot sdot sdot 1198742119871
1198992∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1198741198941
1198741198942
sdot sdot sdot 119874119894119895
sdot sdot sdot 119874119894119871
119899119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1198741198701
1198741198702
sdot sdot sdot 119874119870119895
sdot sdot sdot 119874119870119871
119899119870∙
Total 119899∙1
119899∙2
sdot sdot sdot 119899∙119895
sdot sdot sdot 119899∙119871
119873
Table 2 Relative frequency table
1205881
1205882
sdot sdot sdot 120588119895
sdot sdot sdot 120588119871
Total1205961
11990111
11990112
sdot sdot sdot 1199011119895
sdot sdot sdot 1199011119871
1199011∙
1205962
11990121
11990122
sdot sdot sdot 1199012119895
sdot sdot sdot 1199012119871
1199012∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119894
1199011198941
1199011198942
sdot sdot sdot 119901119894119895
sdot sdot sdot 119901119894119871
119901119894∙
sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot sdot
120596119870
1199011198701
1199011198702
sdot sdot sdot 119901119870119895
sdot sdot sdot 119901119870119871
119901119870∙
Total 119901∙1
119901∙2
sdot sdot sdot 119901∙119895
sdot sdot sdot 119901∙119871
1
Similarly
119884 (Ω2F
2 119875
2) 997888rarr 119877 (55)
as 119884(120588119895) = 119910
119895 where Ω
2= 120588
1 120588
2 120588
119871 and 119875
119884(120588
119895) =
119899sdot119895119873 = 119901
sdot119895for 119895 = 1 2 119871 Also 119910
1 119910
2 119910
119870could be
any real numbers as long as they are distinct to guaranteethat 119884 is a one-to-one mapping In this case 119875
119884(119884 =
119910119895) = 119875
119884(120588
119895)
Now define a mapping 119875119883119884
fromΩ1timesΩ
2to 119877 as follows
119875119883119884
(120596119894 120588
119895) = 119901
119894119895=
119874119894119895
119873 (56)
Since119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) = 1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) =
119871
sum119895=1
119901119894119895= 119901
119894sdot= 119875
119883(120596
119894)
119870
sum119894=1
119875119883119884
(120596119894 120588
119895) =
119870
sum119894=1
119901119894119895= 119901
sdot119895= 119875
119884(120588
119895)
(57)
119901119894119895119870119894=1
is a joint probability measure by Proposition 14Finally we can calculate mutual information as follows
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119875119883119884
(120596119894 120588
119895) log
119875119883119884
(120596119894 120588
119895)
119875119883(120596
119894) 119875
119884(120588
119895)
=
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
(58)
It follows fromCorollary 26 that if119883 has only one value then119868(119883 119884) = 0 On the other hand if 119883 has all distinct valuesthe following result shows that mutual information will reachthe maximum value
Proposition 28 If all the values of 119883 are distinct then119868(119883 119884) = 119867(119884)
Proof If all the values of 119883 are distinct then the number ofdifferent values of119883 equals the number of observations thatis 119870 = 119873 From Tables 1 and 2 we observe that
(1) 119874119894119895= 0 or 1 for all 119894 = 1 2 119870 and 119895 = 1 2 119871
(2) 119901119894119895
= 119874119894119895119873 = 0 or 1119873 for all 119894 = 1 2 119870 and
119895 = 1 2 119871
(3) for each 119895 = 1 2 119871 since1198741119895+119874
2119895+sdot sdot sdot+119874
119870119895= 119899
sdot119895
there are 119899sdot119895nonzero 119874
119894119895rsquos or equivalently 119899
sdot119895nonzero
119901119894119895rsquos
(4) 119901119894sdot= 1119873 119894 = 1 2 119870
Using the above observations and the fact that 0 log 0 = 0 wehave
119868 (119883 119884) =
119870
sum119894=1
119871
sum119895=1
119901119894119895log
119901119894119895
119901119894sdot119901sdot119895
=
119870
sum119894=1
1199011198941log
1199011198941
119901119894sdot119901sdot1
+
119870
sum119894=1
1199011198942log
1199011198942
119901119894sdot119901sdot2
+ sdot sdot sdot +
119870
sum119894=1
119901119894119871log
119901119894119871
119901119894sdot119901sdot119871
10 Mathematical Problems in Engineering
= sum1199011198941 =0
1
119873log 1119873
119901sdot1119873
+ sum1199011198942 =0
1
119873log 1119873
119901sdot2119873
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1119873
119901sdot119871119873
= sum1199011198941 =0
1
119873log 1
119901sdot1
+ sum1199011198942 =0
1
119873log 1
119901sdot2
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1
119901sdot119871
=119899sdot1
119873log 1
119901sdot1
+119899sdot2
119873log 1
119901sdot2
+ sdot sdot sdot +119899sdot119871
119873log 1
119901sdot119871
= 119901sdot1log 1
119901sdot1
+ 119901sdot2log 1
119901sdot2
+ sdot sdot sdot + 119901sdot119871log 1
119901sdot119871
= 119867 (119884)
(59)
52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1
To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways
521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs
522 Relative Values From Property 4 we have 0 le
119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs
523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]
Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then
119901 ln119901
119902asymp (119901 minus 119902) +
(119901 minus 119902)2
2119902 (60)
The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is
Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain
2119868 (119883 119884) = 2119873
119870
sum119894=1
119871
sum119895=1
119901119894119895ln
119901119894119895
119901119894sdot119901sdot119895
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895119873
(119899119894sdot119873) (119899
sdot119895119873)
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895
119899119894sdot119899sdot119895119873
asymp 2
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus
119899119894sdot119899sdot119895
119873)
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895minus 2
sum119894119899119894sdot
119873
sum119895119899sdot119895
119873
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2119873 minus 2119873
119873+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
=
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 1205942
(61)
The last equation means the previous expressionsum119870
119894=1sum119871
119895=1((119874
119894119895minus 119899
119894sdot119899sdot119895119873)
2(119899
119894sdot119899sdot119895119873)) follows 1205942 distribu-
tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)
approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)
The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no
relationship between them)1198671119883 and119884 are dependent (ie there is a relationship
between them)
Mathematical Problems in Engineering 11
The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
asymp 2119873 times 119868 (119883 119884) (62)
is greater than 1205942119880 the upper-tail critical value from a Chi-
square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is
Select feature 119883 if 119868 (119883 119884) gt1205942119880
2119873 (63)
Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942
119880to be 169 from a Chi-
square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873
Assume a training set has119873 examples We can do featureselection by the following procedure
(i) Step 1 Choose a level of significance of 120572 say 005
(ii) Step 2 Find 119870 the number of values of feature119883
(iii) Step 3 Build the contingency table for119883 and 119884
(iv) Step 4 Calculate 119868(119883 119884) from the contingency table
(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom
from a Chi-square table or any other sources such asSAS
(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise
(vii) Step 7 Repeat Steps 2ndash6 for all features
If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure
53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1
Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process
(i) Step 1 select features by removing those with smallmutual information
(ii) Step 2 do binning for the rest of numerical features
(iii) Step 3 select features by mutual information
54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning
Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships
The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value
6 Conclusions
In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring
Conflict of Interests
The author declares that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
10 Mathematical Problems in Engineering
= sum1199011198941 =0
1
119873log 1119873
119901sdot1119873
+ sum1199011198942 =0
1
119873log 1119873
119901sdot2119873
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1119873
119901sdot119871119873
= sum1199011198941 =0
1
119873log 1
119901sdot1
+ sum1199011198942 =0
1
119873log 1
119901sdot2
+ sdot sdot sdot + sum119901119894119871 =0
1
119873log 1
119901sdot119871
=119899sdot1
119873log 1
119901sdot1
+119899sdot2
119873log 1
119901sdot2
+ sdot sdot sdot +119899sdot119871
119873log 1
119901sdot119871
= 119901sdot1log 1
119901sdot1
+ 119901sdot2log 1
119901sdot2
+ sdot sdot sdot + 119901sdot119871log 1
119901sdot119871
= 119867 (119884)
(59)
52 Applications of Newly Defined Mutual Information inCredit Scoring Credit scoring is used to describe the processof evaluating the risk a customer poses of defaulting ona financial obligation [15ndash19] The objective is to assigncustomers to one of two groups good and bad Machinelearning has been successfully used to build models for creditscoring [20] In credit scoring119884 is a binary variable good andbad and may be represented by 0 and 1
To apply mutual information to credit scoring we firstcalculatemutual information for every pair of (119883 119884) and thendo feature selection based on values of mutual informationWe propose three ways
521 Absolute Values Method From Property 4 we seethat mutual information 119868(119883 119884) is nonnegative and upperbounded by log(119871) and that 119868(119883 119884) = 0 if and only if 119883 and119884 are independent In this sense high mutual informationindicates a large reduction in uncertainty while low mutualinformation indicates a small reduction In particular zeromutual information means the two random variables areindependent Hence we may select those features whosemutual information with 119884 is larger than some thresholdbased on needs
522 Relative Values From Property 4 we have 0 le
119868(119883 119884)119867(119884) le 1 Note that 119868(119883 119884)119867(119884) is relativemutual information which measures howmuch information119883 catches from 119884 Thus we may select those features whoserelativemutual information 119868(119883 119884)119867(119884) is larger than somethreshold between 0 and 1 based on needs
523 Chi-Square Test for Independency For convenience wewill use the natural logarithm in mutual information Wefirst state an approximation formula for the natural logarithmfunction It can be proved by the Taylor expansion like inKullbackrsquos book [5]
Lemma 29 Let 119901 and 119902 be two positive numbers less than orequal to 1 Then
119901 ln119901
119902asymp (119901 minus 119902) +
(119901 minus 119902)2
2119902 (60)
The equality holds if and only if 119901 = 119902 Moreover the close 119901 isto 119902 the better the approximation is
Now let us denote119873times119868(119883 119884) by 119868(119883 119884) Then applyingLemma 29 we obtain
2119868 (119883 119884) = 2119873
119870
sum119894=1
119871
sum119895=1
119901119894119895ln
119901119894119895
119901119894sdot119901sdot119895
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895119873
(119899119894sdot119873) (119899
sdot119895119873)
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895ln
119874119894119895
119899119894sdot119899sdot119895119873
asymp 2
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus
119899119894sdot119899sdot119895
119873)
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2
119870
sum119894=1
119871
sum119895=1
119874119894119895minus 2
sum119894119899119894sdot
119873
sum119895119899sdot119895
119873
+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 2119873 minus 2119873
119873+
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
=
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
= 1205942
(61)
The last equation means the previous expressionsum119870
119894=1sum119871
119895=1((119874
119894119895minus 119899
119894sdot119899sdot119895119873)
2(119899
119894sdot119899sdot119895119873)) follows 1205942 distribu-
tion According to [5] it follows 1205942 distribution with adegree of freedom of (119870 minus 1)(119871 minus 1) Hence 2119873 times 119868(119883 119884)
approximately follows 1205942 distribution with a degree offreedom of (119870minus 1)(119871 minus 1) This is the well-known Chi-squaretest for independence of two random variables This allowsusing the Chi-square distribution to assign a significantlevel corresponding to the values of mutual information and(119870 minus 1)(119871 minus 1)
The null and alternative hypotheses are as follows1198670 119883 and 119884 are independent (ie there is no
relationship between them)1198671119883 and119884 are dependent (ie there is a relationship
between them)
Mathematical Problems in Engineering 11
The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
asymp 2119873 times 119868 (119883 119884) (62)
is greater than 1205942119880 the upper-tail critical value from a Chi-
square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is
Select feature 119883 if 119868 (119883 119884) gt1205942119880
2119873 (63)
Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942
119880to be 169 from a Chi-
square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873
Assume a training set has119873 examples We can do featureselection by the following procedure
(i) Step 1 Choose a level of significance of 120572 say 005
(ii) Step 2 Find 119870 the number of values of feature119883
(iii) Step 3 Build the contingency table for119883 and 119884
(iv) Step 4 Calculate 119868(119883 119884) from the contingency table
(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom
from a Chi-square table or any other sources such asSAS
(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise
(vii) Step 7 Repeat Steps 2ndash6 for all features
If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure
53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1
Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process
(i) Step 1 select features by removing those with smallmutual information
(ii) Step 2 do binning for the rest of numerical features
(iii) Step 3 select features by mutual information
54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning
Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships
The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value
6 Conclusions
In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring
Conflict of Interests
The author declares that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 11
The decision rule is to reject the null hypothesis at the 120572 levelof significance if the 1205942 statistic
119870
sum119894=1
119871
sum119895=1
(119874119894119895minus 119899
119894sdot119899sdot119895119873)
2
119899119894sdot119899sdot119895119873
asymp 2119873 times 119868 (119883 119884) (62)
is greater than 1205942119880 the upper-tail critical value from a Chi-
square distribution with (119870 minus 1)(119871 minus 1) degrees of freedomThat is
Select feature 119883 if 119868 (119883 119884) gt1205942119880
2119873 (63)
Take credit scoring for example In this case 119871 = 2 Assumefeature119883 has 10 different values that is119870 = 10 Using a levelof significance of 120572 = 005 we find 1205942
119880to be 169 from a Chi-
square table with (119870 minus 1)(119871 minus 1) = 9 and select this featureonly if 119868(119883 119884) gt 1692119873
Assume a training set has119873 examples We can do featureselection by the following procedure
(i) Step 1 Choose a level of significance of 120572 say 005
(ii) Step 2 Find 119870 the number of values of feature119883
(iii) Step 3 Build the contingency table for119883 and 119884
(iv) Step 4 Calculate 119868(119883 119884) from the contingency table
(v) Step 5 Find 1205942119880with (119870minus1)(119871minus1) degrees of freedom
from a Chi-square table or any other sources such asSAS
(vi) Step 6 Select 119883 if 119868(119883 119884) gt 1692119873 and discard itotherwise
(vii) Step 7 Repeat Steps 2ndash6 for all features
If the number of features selected from the above procedure issmaller or larger thanwhat youwant youmay adjust the levelof significant 120572 and reselect features using the procedure
53 Adjustment of Mutual Information in Feature SelectionIn Section 52 we have proposed 3 ways to select featurebased on mutual information It seems that the larger themutual information 119868(119883 119884) the more dependent 119883 on 119884However Proposition 28 says that if 119883 has all distinctvalues then 119868(119883 119884)will reach the maximum value119867(119884) and119868(119883 119884)119867(119884) will reach the maximum value 1
Therefore if 119883 has too many different values one maybin or group these values first Based on the binned valuesmutual information is calculated again For numerical vari-ables we may adopt a three-step process
(i) Step 1 select features by removing those with smallmutual information
(ii) Step 2 do binning for the rest of numerical features
(iii) Step 3 select features by mutual information
54 Comparison with Existing Feature Selection MethodsThere are many other feature selection methods in machinelearning and credit scoring An easy way is to build a logisticmodel for each feature with respect to the dependent variableand then select features with 119901 values less than some specificvalues However thismethod does not apply to any nonlinearmodels in machine learning
Another easy way of feature selection is to calculate thecovariance of each feature with respect to the dependentvariable and then select features whose values are larger thansome specific value Yetmutual information is better than thecovariance method [21] in that mutual information measuresthe general dependence of random variables without makingany assumptions about the nature of their underlying rela-tionships
The most popular feature selection in credit scoring isdone by information value [15ndash19] To calculate informationbetween an independent variable and the dependent variablea binning algorithm is used to group similar attributes intoa bin Difference between the information of good accountsand that of bad accounts in each bin is then calculatedFinally information value is calculated as the sum of infor-mation differences of all bins Features with informationvalue larger than 002 are believed to have strong predictivepower However mutual information is a bettermeasure thaninformation value Information value focuses only on thelinear relationships of variables whereas mutual informationcan potentially offer some advantages information value fornonlinear models such as gradient boosting model [22]Moreover information value depends on binning algorithmsand the bin size Different binning algorithms andor differ-ent bin sizes will have different information value
6 Conclusions
In this paper we have presented a unified definition formutual information using random variables with differentprobability spaces Our idea is to define the joint distri-bution of two random variables by taking the marginalprobabilities into consideration With our new definition ofmutual information different joint distributions will resultin different values of mutual information After establishingsome properties of the new defined mutual informationwe proposed a method to calculate mutual information inmachine learning Finally we applied our newly definedmutual information to credit scoring
Conflict of Interests
The author declares that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The author has benefited from a brief discussion with DrZhigang Zhou and Dr Fuping Huang of Elevate about prob-ability theory
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
12 Mathematical Problems in Engineering
References
[1] G D Tourassi E D Frederick M K Markey and C E FloydJr ldquoApplication of the mutual information criterion for featureselection in computer-aided diagnosisrdquoMedical Physics vol 28no 12 pp 2394ndash2402 2001
[2] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Research vol 3pp 1157ndash1182 2003
[3] A Navot On the role of feature selection in machine learning[PhD thesis] Hebrew University 2006
[4] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 no 3 pp 379ndash423 1948
[5] S Kullback Information Theory and Statistics John Wiley ampSons New York NY USA 1959
[6] M S Pinsker Information and Information Stability of RandomVariables and Processes Academy of Science USSR 1960(English Translation by A Feinstein in 1964 and published byHolden-Day San Francisco USA)
[7] R B Ash Information Theory Interscience Publishers NewYork NY USA 1965
[8] T M Cover and J A Thomas Elements of Information TheoryJohn Wiley amp Sons New York NY USA 2nd edition 2006
[9] R M Fano Transmission of Information MIT Press Cam-bridge Mass USA John Wiley amp Sons New York NY USA1961
[10] N Abramson Information Theory and Coding McGraw-HillNew York NY USA 1963
[11] RGGallager InformationTheory andReliable CommunicationJohn Wiley amp Sons New York NY USA 1968
[12] R B Ash and C A Doleans-Dade Probability amp MeasureTheory Academic Press San Diego Calif USA 2nd edition2000
[13] I Braga ldquoA constructive density-ratio approach tomutual infor-mation estimation experiments in feature selectionrdquo Journal ofInformation and Data Management vol 5 no 1 pp 134ndash1432014
[14] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[15] M Refaat Credit Risk Scorecards Development and Implemen-tation Using SAS Lulucom New York NY USA 2011
[16] N Siddiqi Credit Risk Scorecards Developing and ImplementingIntelligent Credit Scoring John Wiley amp Sons New York NYUSA 2006
[17] G Zeng ldquoMetric divergence measures and information valuein credit scoringrdquo Journal of Mathematics vol 2013 Article ID848271 10 pages 2013
[18] G Zeng ldquoA rule of thumb for reject inference in credit scoringrdquoMathematical Finance Letters vol 2014 article 2 2014
[19] G Zeng ldquoA necessary condition for a good binning algorithmin credit scoringrdquo Applied Mathematical Sciences vol 8 no 65pp 3229ndash3242 2014
[20] K KennedyCredit scoring usingmachine learning [PhD thesis]School of Computing Dublin Institute of Technology DublinIreland 2013
[21] R J McEliece The Theory of Information and Coding Cam-bridge University Press Cambridge UK Student edition 2004
[22] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of