Fuzzy Optim Decis MakingDOI 10.1007/s10700-014-9178-0
Estimation of mutual information by the fuzzyhistogram
Maryam Amir Haeri ·Mohammad Mehdi Ebadzadeh
© Springer Science+Business Media New York 2014
Abstract Mutual Information (MI) is an important dependency measure betweenrandom variables, due to its tight connection with information theory. It has numerousapplications, both in theory and practice. However, when employed in practice, it isoften necessary to estimate the MI from available data. There are several methods toapproximate the MI, but arguably one of the simplest and most widespread techniquesis the histogram-based approach. This paper suggests the use of fuzzy partitioningfor the histogram-based MI estimation. It uses a general form of fuzzy membershipfunctions, which includes the class of crisp membership functions as a special case. Itis accordingly shown that the average absolute error of the fuzzy-histogram method isless than that of the naïve histogram method. Moreover, the accuracy of our technique iscomparable, and in some cases superior to the accuracy of the Kernel density estimation(KDE) method, which is one of the best MI estimation methods. Furthermore, thecomputational cost of our technique is significantly less than that of the KDE. Thenew estimation method is investigated from different aspects, such as average error,bias and variance. Moreover, we explore the usefulness of the fuzzy-histogram MIestimator in a real-world bioinformatics application. Our experiments show that, incontrast to the naïve histogram MI estimator, the fuzzy-histogram MI estimator is ableto reveal all dependencies between the gene-expression data.
Keywords Mutual Information · Information Theory · Estimation · Fuzzy MutualInformation · Fuzzy Histogram
M. Amir Haeri · M. M. Ebadzadeh (B)Department of Computer Engineering and Information Technology,Amirkabir University of Technology, Tehran, Irane-mail: [email protected]
M. Amir Haerie-mail: [email protected]
123
M. Amir Haeri, M. M. Ebadzadeh
1 Introduction
Finding dependencies between random variables is an important task in many problems(Karasuyama and Sugiyama 2012; Ang et al. 2012; Steuer et al. 2002; Tenekedjiev andNikolova 2008), such as independent component analysis and feature selection. Thereare several measures that quantify the linear dependency between random variables,such as the Pearson correlation coefficient and the Spearman correlation coefficient.Such measures are not sufficient to quantify the general dependency between tworandom variables. On the other hand, mutual information (MI) provides a generaldependency measure between two random variables. MI can measure any kind ofrelationship between the random variables.
The MI of two random variables depends on their distributions. However, most ofthe time, it is required to find the MI of two variables whose distributions are unknown,and only some samples from them are available. To estimate the MI, one has to estimatethe entropies or probability density functions (pdf’s) from the data samples.
There are several methods to estimate the MI from finite samples. The most popularmethod for MI estimation is the histogram-based method (Moddemeijer 1989), whichpartitions the space into several bins, and counts the number of elements in each bin.This method is very easy and efficient from the computational point of view. However,the approximation given by the counting process is discontinuous, and the estimationis very sensitive to the number of bins (Loquin and Strauss 2006; Schaffernicht et al.2010).
Moon et al. (1995) presented another MI estimation approach called Kernel densityestimation (KDE). KDE utilizes kernels to approximate pdf’s. Probability densityfunctions can be estimated by the superposition of a set of kernel functions. In general,KDE provides a high quality estimation for MI. However, it is very time-consumingand computationally intensive (Steuer et al. 2002; Kraskov et al. 2004; Loquin andStrauss 2006).
Kraskov et al. (2004) suggested the K-nearest neighbors (KNN) method to estimatethe MI. This method is based on estimating entropies from KNN distances.
Another method of estimating the MI is adaptive partitioning, which was introducedby Darbellay and Vajda (1999). This method is based on the histogram approach, butit is not parametric. In their approach, the partition is improved until the conditionalindependence has been achieved in the bins.
Wang et al. (2005) suggested a nonlinear correlation measure called nonlinearcorrelation coefficient (NCC). Their measure is based on the MI carried by the ranksequences of the original data. Unlike the mutual information, NCC takes values fromthe closed interval [0,1].
The accumulation process in the histogram-based method depends on the answerof the question “whether a sample x belongs to the bin ai or not.” However, becauseof the vagueness in the boundaries of histogram bins, it is not possible to answer thisquestion exactly. A reasonable solution to overcome this problem is using fuzzinessin the partitioning (Loquin and Strauss 2006; Crouzet and Strauss 2011).
Loquin and Strauss (2006, 2008) suggested a histogram density estimator basedon a fuzzy partition. They proved the consistency of this estimator based on themean square error (MSE) (Loquin and Strauss 2008). Moreover, they showed that
123
Estimation of mutual information by the fuzzy histogram
the main advantage of this estimator is the enhancement of the robustness of thehistogram density estimator, with respect to arbitrary partitioning. Since the MI oftwo variables is a function of the density of the variables, using histogram estima-tor based on fuzzy partitioning (fuzzy-histogram) can improve the histogram MIestimator.
This paper introduces the fuzzy-histogram mutual-information estimator. Thefuzzy-histogram MI estimator uses the fuzzy partitioning. We consider a generalform of fuzzy membership functions whose shapes are controlled by a parameter.By increasing this parameter, the membership functions tend from fuzzy towardscrisp. Using these general membership functions, it is demonstrated that the histogrammethod with fuzzy partitioning outperforms the naïve histogram method (based on theaverage absolute error).
The rest of this paper is organized as follows: Section 2 is dedicated to the his-togram MI estimator. In Sect. 3 the fuzzy-histogram method for estimating MI isintroduced. Section 4 investigates different aspects of the fuzzy-histogram MI esti-mator. Section 5 compares the fuzzy-histogram MI estimator, histogram MI esti-mator, and KDE in a bioinformatics application. Finally, Sect. 6 concludes thepaper.
2 Preliminaries
In this section, the histogram MI estimation method is explained briefly. Moreover,the bias and variance of this method are studied.
2.1 The histogram MI estimator
The MI of two continuous random variables X and Y is defined as I (X,Y ) =∫∫X,Y p(x, y) log p(x,y)
p(x)p(y)dxdy. Here, p(x, y) is the joint pdf of X and Y , and p(x)and p(y) are the marginal pdf’s of X and Y , respectively.
Suppose that we have N simultaneous samples of X and Y . To estimate the MIby the histogram method, the variable X is partitioned into MX bins, and vari-able Y is partitioned into MY bins. We call the i th bin of X, ai (where 1 ≤ i ≤MX ), and the j th bin of Y, b j (where 1 ≤ j ≤ MY ). Furthermore, p(ai ) isdefined as the probability of observing a sample of X in the bin ai . The probabil-ity p(ai ) is estimated by the relative frequency of samples of X observed in thecell ai , and is equal to ki
N . Here, ki is the number of samples of X observed inthe bin ai . Moreover, p(ai , b j ) is defined as the probability of observing a sam-ple of (X,Y) in the bin (ai , b j ) (i.e. x lies in the bin ai and y in the bin b j ), and
is approximated byki jN . Here, ki j is the number of samples observed in the bin
(ai , b j ).Hence, the MI of X and Y is estimated as follows:
I (X,Y ) =∑
i j
ki j
Nlog
ki j N
ki k j. (1)
123
M. Amir Haeri, M. M. Ebadzadeh
2.2 Bias of the histogram MI estimator
Based on Moddemeijer (1989), the histogram-based estimator is a biased estimator.The total bias is the sum of N-bias and R-bias. Here, these two types of bias areexplained briefly.
– N-bias: This bias is caused by the finite sample size and depends on the samplesize N . When the MI I (X,Y ) is estimated from a finite sample size N by thehistogram estimator, the N-bias is as follows Moddemeijer (1989):
ΔI (X,Y )N-bias = MX MY − MX − MY + 1
2N, (2)
where MX and MY are the number of histogram bins. Note that the N-bias doesnot depend on the probability distribution of the variables, and only depends onthe sample size and the number of bins. According to the Eq. 2, when N tends toinfinity, the N -bias tends to zero.
– R-Bias: Insufficient representation of the probability distribution function (pdf)by the histogram method leads to the R-bias. R-bias is specified by the estimationmethod and the pdf’s of the variables, and it is caused by two separate sources: (1)the limited integration area, and (2) the finite resolution.Moddemeijer (1989) showed that the bias caused by the limited integration areais negligible in comparison with the bias caused by the finite resolution, and it canbe ignored. For the histogram MI estimator, they demonstrated that the R-biasedcaused by the finite resolution is as follows:
ΔI (X,Y )R-bias =+∞∫
−∞
1
24p(x)
(∂p(x)
∂x
)2
(Δx)2dx
++∞∫
−∞
1
24p(y)
(∂p(y)
∂y
)2
(Δy)2dy
−+∞∫
−∞
+∞∫
−∞
1
24p(x, y)
((∂p(x, y)
∂x
)2
(Δx)2
+(∂p(x, y)
∂y
)2
(Δy)2)
dxdy. (3)
The integrals of the Eq. 3 measure the smoothness of the probability densityfunctions. When the pdf’s are smooth, the first derivatives are approximately equalto zero. Hence, the squared derivatives are almost zero and R-bias is minimized.
The N-bias of the histogram MI estimator leads to overestimation, and the R-biasleads to underestimation. By increasing the number of bins (decreasing the bin length)the R-bias is decreased, and the N-bias is increased. Hence, the number of bins makesa trade-off between the N-bias and the R-bias.
123
Estimation of mutual information by the fuzzy histogram
2.3 Variance of the histogram MI estimator
A good estimator must have a low variance. The variance of the histogram estimatorof the MI can be written as follows Moddemeijer (1989):
VAR[ I (X,Y )] ≈ 1
NVAR
[
logp(x, y)
p(x)p(y)
]
, (4)
where x and y are the vectors of N simultaneous samples of X and Y .The variance of the histogram MI estimator approximately does not depend on the
cell sizes, except in the following cases: (1) the number of bins is one, (2) the numberof bins tends to infinity. In these cases, the variance is equal to zero.
3 Fuzzy-histogram method for estimating mutual information
In this section, we present the fuzzy-histogram MI estimator. As mentioned in the intro-duction, Loquin and Strauss (2008) showed that the fuzzy-histogram density estimatorcan improve the robustness of the histogram density estimator. Since estimating theMI depends on the estimation of the probabilities p(x), p(y) and p(x, y), utilizingfuzzy partitioning can improve the histogram MI estimator. In this section, a generalform for membership function is suggested. The shape of membership functions iscontrolled by a parameter called β. As β tends to infinity, the membership functionstend to crisp functions. By using this general form, it is possible to test whether fuzzifi-cation can improve the histogram method, and which membership functions are betterin estimating the MI by the fuzzy histogram method.
3.1 Fuzzy partitioning
Let D = [a, b] ⊂ R be the domain of variable X . We want to partition D as follows. Letγ1 < γ2 < · · · < γn be n ≥ 3 points of D such that γ1 = a and γn = b. Let the lengthof the bins be equal to h. Therefore, γk = a + (k − 1)h. Now define two other pointsγ0 = a − h and γn+1 = b + h. Consider the extended domain D′ = [γ0, γn+1] ⊂ R.Define n fuzzy subsets A1, A2, . . . , An on the extended domain D′, with membershipfunctions μA1(x), μA2(x), . . . , μAn (x). These fuzzy sets should satisfy the followingproperties:
1. μAk (γk) = 1;2. μAk (x)monotonically increases on [γ0, γk], and μAk (x)monotonically decreases
on [γk, γn+1];3. ∀x ∈ D′, ∃k such that μAk (x) > 0.
Some examples of membership functions with mentioned properties are listedbelow:
– the crisp partition: K A(x) = 1[ −12 ,
12 ](x),
– the cosine partition: 12 (cos(πx)+ 1) 1[−1,1](x),
123
M. Amir Haeri, M. M. Ebadzadeh
Fig. 1 Fuzzy partitioning withtriangular membershipfunctions. Here, the number ofbins is equal to six
−15 −10 −5 0 5 10 150
0.5
1
x
μ(x)
−10 −8 −6 −4 −2 0 2 4 6 8 100
0.2
0.4
0.6
0.8
1
x
y
(b) β=1
β=2
β=4
β=8
β=10
−10 −8 −6 −4 −2 0 2 4 6 8 100
0.05
0.1
0.15
0.2
0.25
0.3
x
y
(a) β=1
β=2
β=4
β=8
β=10
Fig. 2 a Generalized normal function (GNF), b normalized GNF which can be used as a membershipfunction. Here, α = 2
– the triangular partition: (1 − |x |) 1[−1,1](x),– the generalized normal partition as described next.
Figure 1 illustrates a fuzzy partitioning of interval [−10,10] with triangular mem-bership functions.
A good choice for the membership functions is the generalized normal function(GNF), which provides a general from for the membership functions.
GNF is a parametric continuous function. GNF is the pdf of the generalized normaldistribution. This type of function adds a shape parameter to the normal function. Theformula of the GNF is as follows:
f (x) = β
2αΓ (1/β)
e−
(|x − μ|/α)β
, (5)
where Γ is the gamma function (Γ (z) = ∫ ∞0 e−t t z−1dt).
By changing the parameter β of this function, its shape is changed. When β = 2,the shape of the function is like the pdf of the normal distribution. Furthermore, whenβ = 1 it is like the pdf of the Laplace distribution and its shape is similar to thetriangular function. By increasing the parameter β, the shape of the function graduallybecomes similar to the pulse function. Moreover, α is the scale parameter.
Figure 2a demonstrates the GNF. To use the generalized normal function as amembership function, its outputs must be normalized over the interval [0,1]. Figure 2bshows the normalized GNF. As shown in the figure, GNF is capable of generating a
123
Estimation of mutual information by the fuzzy histogram
−15 −10 −5 0 5 10 150
0.5
1
x
μ (x)
(a)
−15 −10 −5 0 5 10 150
0.5
1
x
μ(x)
(b)
Fig. 3 Fuzzy partitioning using the GNF membership functions, the number of bins is equal to six. aβ = 10, b β = 2
wide range of membership functions (triangular, normal,..., crisp). For example, whenthe membership functions are GNF’s with β ≥ 10, they are similar to the crispmembership functions.
Figure 3 demonstrates two fuzzy partitionings of the interval [−10, 10] by GNFwith β = 10 and β = 2. Here, α is equal to h
2 = 2.
3.2 Estimating the mutual information
In estimating the MI by the fuzzy-histogram method, the probabilities are calculateddifferently from the crisp histogram method. Suppose that we have N simultaneousmeasurements of two continuous variables X and Y . The measurements of X and Yare partitioned into MX and MY bins respectively, as described in the Sect. 3.1. Foreach bin ai belonging to X , a fuzzy membership function μAi (x) is defined. Similarlyfor each bin b j belonging to Y , a fuzzy membership function μB j (y) is considered.The MI of X and Y is given by I (X,Y ) = H(X) + H(Y ) − H(X,Y ), where theentropies are estimated as follows:
H(X) = −MX∑
i=1
p(ai ) log p(ai ), (6)
H(Y ) = −MY∑
j=1
p(b j ) log p(b j ), (7)
H(X,Y ) = −MX∑
i=1
MY∑
j=1
p(ai , b j
)log p
(ai , b j
). (8)
In the fuzzy-histogram approach the probability of state (bin) ai of data X is cal-culated as follows:
p(ai ) = Mai∑MX
l=1 Mal
, (9)
where Mai is the sum of membership values of samples of X belonging to the fuzzyset Ai :
123
M. Amir Haeri, M. M. Ebadzadeh
Mai =N∑
k=1
μAi (xk). (10)
In the crisp histogram MI estimation method, the probability of observing a sampleof X in the bin ai is estimated by the relative frequency of samples of X observed in thebin ai , and it equals to ki
N . In the fuzzy histogram method, this probability is estimatedby the fraction in Eq. 9: the nominator is the sum of membership values of samples ofX belonging to the fuzzy set Ai , and the denominator is the sum of membership valuesof samples of X belonging to all fuzzy sets. Hence, the crisp histogram is a specialcase of the fuzzy histogram, where the membership value of each sample belonging toa bin is either one or zero. Thus, in this case,
∑Nk=1 μAi (xk) is equal to the frequency
of samples of X observed in the cell ai , and∑MX
l=1 Mal is equal to the N . Therefore,Eq. 9 is equivalent to the relative frequency of the crisp case.
Similarly, the probability of state (bin) b j of data Y and Mb j are as follows:
p(b j ) = Mb j∑MY
s=1 Mbs
, (11)
Mb j =∑N
k=1μB j (yk). (12)
In the crisp histogram method, p(ai , b j ) is defined as the probability of observinga sample (x, y) of (X,Y ) in the bin (ai , b j ) (where x lies in the bin ai and y in thebin b j ). Let ki j be the number of samples which lie in the bin (ai , b j ). Based on the
frequentist approach to probability, p(ai , b j ) ≈ ki jN .
Now denote by (Ai , B j ) the fuzzy set associated with the bin (ai , b j ). Let μ(Ai ,B j)
be the membership function of (Ai , B j ). We use the product for defining the mem-bership function, that is, for any data-point (xk, yk), we have μ(Ai ,B j )(xk, yk) =μAi (xk) · μB j (yk).
Following the analogy of the crisp case, the frequentist approach suggests that theprobability p(ai , b j ) can be estimated by the relative sum of membership values ofsamples of (X,Y ) belonging to the fuzzy set (Ai , B j ). Therefore, the joint probabilityof the bin (ai , b j ) is computed by:
p(ai , b j ) = Mai b j∑MX
l=1
∑MYs=1 Mal bs
, (13)
where Mai b j is obtained by the following equation:
Mai b j =N∑
k=1
μ(Ai ,B j )(xk, yk) =N∑
k=1
μAi (xk) · μB j (yk). (14)
Using the sum-product instead of max-min or max-product is natural due to theanalogy between the fuzzy and crisp cases, and the way one counts the data-points lyingin each bin. In other words, the summation of membership values in the fuzzy method
123
Estimation of mutual information by the fuzzy histogram
plays a similar role as counting the number of samples observed in the bin (ai , b j ) inthe crisp case. Additionally, similar to p(ai ), the crisp histogram is a special case ofthe fuzzy histogram; because in the crisp case, if a sample belongs to a bin (ai , b j ),its membership value is 1, and it is 0 otherwise. Thus, in the crisp case, p(ai , b j ) ascomputed by Eq. 13 is equal to the relative frequency of samples belonging to the bin(ai , b j ).
4 Experimental results
In this section, the fuzzy-histogram MI estimator is investigated from different aspects.Firstly, in a simple experiment, the fuzzy-histogram MI estimator is compared with thehistogram estimator over independent variables. In the second part, the effects of theparameters of the fuzzy-histogram estimator, including the shape of the membershipfunctions, and the number of bins are investigated for three different distributions. Inthe third part, the accuracy and the running time of the fuzzy-histogram estimator arecompared with those of the histogram estimator, and the KDE. The fourth experimentcompares the bias of the fuzzy-histogram estimator with the histogram estimator. Thefifth experiment investigates the variance of the fuzzy-histogram estimator. The sixthexperiment compares different dependency measures on the data with different degreesof dependency. The final experiment is devoted to comparing the histogram estimator,fuzzy-histogram estimator, and the KDE in a real-world application.
4.1 Mutual information of independent variables
In this simple experiment, the difference between the accuracy of the fuzzy-histogramMI estimator and the histogram estimator for independent variables is shown. Wehave two independent and uniformly distributed variables X and Y . Since X and Yare independent, and the true value of MI of two independent variables are zero,I exact (X,Y ) = 0. In the Fig. 4, two estimators of MI are compared for differentsample size N and different number of bins M (M = MX = MY ). The experimentrepeated 1,000 times with independent realizations of X and Y . The estimations of MI
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
1
N
Est
imat
ed M
I
(a) Number of Bins=5
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
1
N
Est
imat
ed M
I
(b) Number of Bins=10
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
1
N
Est
imat
ed M
I
(c) Number of Bins=15
Fuzzy Histogram Estimator Histogram Estimator
Fig. 4 The fuzzy-histogram estimation and histogram estimation of two independent variables
123
M. Amir Haeri, M. M. Ebadzadeh
reported in the graphs are averaged over these 1,000 trials and the error bars show thestandard deviations. In this experiment, the triangular membership function is used forpartitioning (similar to Fig. 1). Both estimations of the MI are overestimated. However,the overestimation of the fuzzy-histogram method is less than the overestimation of thehistogram method. For less N and more bins, the difference between the two methodsare more considerable. Thus, the fuzzy-histogram estimator provides more accurateestimation for independent variables, especially when the number of samples is few.
Additionally, the experimental overestimation of the histogram estimator is consis-tent with the theoretical N-bias. For example, when N = 100 and the number of binsis equal to 10, the theoretical overestimation is as follows:
E[I − Iexcat
] = (10 − 1)(10 − 1)
2 ∗ 100≈ 0.41,
and the experimental N-bias for the histogram estimator is equal to 0.47. However, N-bias of the fuzzy-histogram estimator is less. In this case (N = 100), it is equal to 0.18.In the Section 4.4 more investigations are conducted on the bias of the fuzzy-histogramestimator.
4.2 The effects of the parameters of the fuzzy-histogram method
In this experiment, the effects of the parameters of the fuzzy-histogram MI estimatorare investigated by experimental analysis for three distributions. To find an appropriatemembership function, the GNF (Eq. 5) is used. As mentioned in Sect. 3, GNF is aparametric continuous function (see Fig. 2). By changing the parameter β of thisfunction its shape is changed. When β = 2, the shape of the function is like thepdf of the normal distribution. Furthermore, when β = 1, it is like the pdf of theLaplace distribution and its shape is similar to the triangular function. By increasingthe parameter β, the shape of the function gradually becomes similar to the pulsefunction. Moreover, α is the scale parameter.
Hence, the fuzzy-histogram estimator with generalized normal membership func-tion, has three important parameters, β, and α, which indicate the shape and scale ofmembership function and the number of bins.
Here, the impacts of parameters of the fuzzy-histogram MI estimator are investi-gated for data with three distributions, bivariate normal, bivariate gamma-exponential,and bivariate ordered Weinman exponential distribution. In the following, these dis-tributions and their exact MI between their variates are brought.
1. Bivariate Normal Distribution: the pdf of this distribution is as follows:
p(x, y) = 1
2πσ1σ2√
1 − ρ2e
−z2(1−ρ2) , (15)
where z = (x−μ1)2
σ 21
− 2ρ(x−μ1)(y−μ2)σ1σ2
+ (y−μ2)2
σ 22
and ρ = corr(X,Y ).
123
Estimation of mutual information by the fuzzy histogram
The exact MI between the variates X and Y of the bivariate normal distributionwith the correlation coefficient ρ is Moddemeijer (1989):
I (X,Y ) = 1
2log
(1
1 − ρ2
)
. (16)
2. Bivariate Gamma Exponential Distribution: the pdf of this distribution is Darbellay(2000) and Zografos and Nadarajah (2005):
p(x, y) = θθ21 θ3
Γ (θ2)xθ2 e−θ1x−θ3xy . (17)
The exact MI between the variates of this distribution is Darbellay (2000):
I (X,Y ) = ψ(θ2)− ln(θ2)+ 1
θ2, (18)
where ψ(z) is the digamma function ψ(z) = dd(z) lnΓ (z) = Γ ′(z)
Γ (z) .3. Bivariate Ordered Weinman Exponential Distribution: the pdf of two-dimensional
ordered Weinman exponential distribution is as follows Darbellay (2000) andZografos and Nadarajah (2005):
p(x, y) =(
2
θ0e− 2θ0(x−x0)
)
×(
1
θ1e− 1θ1(y−x)
)
, (19)
with x0 � x � y, and θ0, θ1 > 0.The exact MI between the variates of the bivariate ordered Weinman exponentialdistribution is Darbellay (2000):
I (X,Y ) =
⎧⎪⎪⎨
⎪⎪⎩
ln(
1θ1
(θ02 − θ1
))+ Ψ
(θ0
θ0−2θ1
)− Ψ (1) if θ1 <
θ02
−Ψ (1) if θ1 = θ02
ln(
1θ1
(θ1 − θ0
2
))+ Ψ
(2θ1
2θ1−2θ0
)− Ψ (1) if θ1 >
θ02
(20)
In each of these distributions, the exact MI depends on some parameters of thedistribution. For bivariate normal distribution, the MI between its variates dependsonly on the correlation coefficient ρ. For the gamma-exponential distribution, MIdepends on the parameter θ2, and for the ordered Weinman exponential distribution,it depends on θ0 and θ1 or precisely on the proportion θ1/θ0.
Here, we want to find appropriate parameters of the fuzzy-histogram MI estimatorfor each distribution, such that for different data driven from that distribution withdifferent exact MI, the average absolute error is minimized. For example, for the normaldistribution, we want to find appropriate parameter values among several parametervalues such that for different bivariate normal data with differentρ the average absoluteerror is minimized.
123
M. Amir Haeri, M. M. Ebadzadeh
−4 −2 0 2 4−4−2
024
x
yρ=0
−4 −2 0 2 4−4−2
024
x
y
ρ=0.1
−4 −2 0 2 4−4−2
024
x
y
ρ=0.2
−4 −2 0 2 4−4−2
024
x
y
ρ=0.3
−4 −2 0 2 4−4−2
024
x
y
ρ=0.4
−4 −2 0 2 4−4−2
024
x
y
ρ=0.5
−4 −2 0 2 4−4−2
024
x
y
ρ=0.6
−4 −2 0 2 4−4−2
024
x
yρ=0.7
−4 −2 0 2 4−4−2
024
x
y
ρ=0.8
−4 −2 0 2 4−4−2
024
x
y
ρ=0.9
Fig. 5 Data with bivariate normal distribution with different correlation coefficients ρ
Thus, for this experiment, different realizations were generated from each distribu-tion with different parameter settings. Bivariate normal samples were generated with10 different correlation coefficient ρ = {0, 0.1, . . . , 0.8, 0.9}. The mean vector and
the covariance matrix are set to Mean =[
00
]
and∑=
[1 ρρ 1
]
respectively.
Various realizations of the bivariate normal distribution with differentρ are shown inthe Fig. 5, to illustrate graphically the relation between the variates of this distribution.Here, the sample size N is 500.
For the bivariate gamma-exponential distribution samples were generated withθ1, θ3 = 1 and θ2 = {1, 2, . . . , 19, 20}. Figure 6 illustrates several realizations ofthis distribution with different parameter values.
Finally for the bivariate ordered Weinman exponential distribution samples weregenerated with θ0 = 100 and θ1 = {10, 20, . . . , 90, 100}. Figure 7 illustrates severalrealizations of this distribution with different parameter values.
As mentioned above, for finding appropriate parameters of the fuzzy histogram foreach of these distributions, the average absolute error between the estimated MI andexact MI is used. This average is computed over different realizations with differentparameter settings. Formally:
1. P is the number of different parameter settings for the underlying probabilitydistribution, for which the sampling took place. For instance, consider the bivariatenormal distribution in Table 1. Since the values of μ1, μ2, σ1, and σ2 are fixed,and only 10 different values of ρ are used, we have 10 different settings. Therefore,P = 10.
2. T is the number of trials; that is, the number of realizations over which the erroris computed.
123
Estimation of mutual information by the fuzzy histogram
0 10 20 300
2
x
yθ
2=1
0 10 20 300
2
x
y
θ2=3
0 10 20 300
2
x
y
θ2=5
0 10 20 300
2
x
y
θ2=7
0 10 20 300
2
xy
θ2=9
0 10 20 300
2
x
y
θ2=11
0 10 20 300
2
x
y
θ2=13
0 10 20 300
2
x
yθ
2=15
0 10 20 300
2
x
y
θ2=17
0 10 20 300
2
x
y
θ2=19
Fig. 6 Data with gamma-exponential distribution with different θ2’s
0 100 200 300 4000
500
x
y
θ 1 / θ0=0.1
0 100 200 300 4000
500
x
y
θ 1 / θ0=0.2
0 100 200 300 4000
500
x
y
θ 1 / θ0=0.3
0 100 200 300 4000
500
x
y
θ 1 / θ0=0.4
0 100 200 300 4000
500
x
y
θ 1 / θ0=0.5
0 100 200 300 4000
500
x
y
θ 1 / θ0=0.6
0 100 200 300 4000
500
x
y
θ 1 / θ0=0.7
0 100 200 300 4000
500
x
y
θ 1 / θ0=0.8
0 100 200 300 4000
500
x
y
θ 1 / θ0=0.9
0 100 200 300 4000
500
x
y
θ 1 / θ0=1
Fig. 7 Data with bivariate ordered Weinman exponential distribution with different θ1/θ0. The parameterθ0 is equal to 100 and the parameter θ1 is changed from 10 to 100 by a step equals to 10
123
M. Amir Haeri, M. M. Ebadzadeh
Table 1 The parameter setting of the distributions used in the experiments
Distribution Parameters
Bivariate normal μ1 = μ2 = 0, σ1 = σ2 = 1,
ρ = {0.0, 0.1, . . . , 0.8, 0.9}Bivariate gamma-exponential θ1, θ3 = 1, θ2 = {1, 2, . . . , 19, 20}Bivariate ordered Weinman exponential θ0 = 100, θ1/θ0 = {0.1, 0.2, . . . , 0.9, 1}
Define AvgError Ti as the average error in the i th setting, where the number of trials
is T :
AvgError Ti = 1
T
T∑
j=1
(∣∣∣ Ii j − Ii
exact∣∣∣). (21)
Moreover, define AvgError P as the average error over the P different parametersettings of the underlying distribution:
AvgError P = 1
P
P∑
i=1
AvgError Ti . (22)
The effects of parameter values of the fuzzy histogram are evaluated based on theAvgError P , as explored next. In the experiments, the distributions and settings men-tioned in Table 1 are used.
4.2.1 Parameter β
First, we examine the effect of parameter β which indicates the shape of the member-ship function. For this reason, the parameters α and the number of bins are fixed. α isfixed to the following values.
αX = h X
2, h X = max(X)− min(X)
MX − 1
αY = hY
2, hY = max(Y )− min(Y )
MY − 1(23)
In summary α is fixed to the h/2. Moreover, the number of bins M is identical forthe X and Y (M = MX = MY ). For the bivariate normal and the bivariate gamma-exponential distributions, the number of bins is fixed to 10 (M = MX = MY = 10),and for the bivariate ordered Weinman exponential distribution, the number of bins isfixed to M = MX = MY = 30. In this experiment N = 500.
Figure 8 shows The average error AvgError P (see Eq. 22) versus β for each ofthe three distributions. The number of trials T is 50. For the normal distributionwhen β = 2 the average error is minimized. Moreover, for the gamma-exponential
123
Estimation of mutual information by the fuzzy histogram
0 2 4 6 8 100.04
0.05
0.06
0.07
0.08
0.09
β
Ave
rage
Err
orNormal Distribution
0 2 4 6 8 100.03
0.04
0.05
0.06
0.07
0.08
β
Ave
rage
Err
or
Gamma−Exponential Distribution
0 2 4 6 8 100.1
0.15
0.2
0.25
0.3
0.35
β
Ave
rage
Err
or
Ordered Weinman Distribution
Fig. 8 The average error AvgError P (see Eq. 22) versus β
distribution and the ordered Weinman exponential distribution the average error takesits minimum when β = 1 and β = 2 respectively.
This experiment indicates that fuzzification can improve the average error. Based onthe results of Fig. 8, when β is small the average error is minimum and by increasing βthe average error is increased. Moreover, by increasing β, the membership functionstend to the crisp membership functions. Thus, in average the fuzzy-histogram MIestimator provide more accurate estimation than the histogram estimation, for thedata sampled from these three distributions.
4.2.2 Number of bins
To investigate the impact of the number of bins M (M = MX = MY ), α and β arefixed. α is fixed to h/2, (see Eq. 23) and β is set to the specific values which led to theminimum average error in the previous experiment.
Figure 9 illustrates the average of absolute error (AvgError P ) over 50 trials versusthe number of bins.
As can be seen in the graphs, when M is greater or less than a certain value, theerror is increased. The reason is that when the number of bins is increased, the N-biasis increased and the R-bias is decreased, and when the number of bins is decreased,the R-bias is increased and the N-bias decreased. Hence, the number of bins makes atrade-off between the N-bias and R-bias.
Furthermore, as shown in the graphs, by changing the number of bins (M), thevariation of the average error of the fuzzy-histogram method is less than that of thehistogram method. Thus, the sensitivity of the fuzzy histogram to M is a bit less thanthe sensitivity of the histogram method.
4.2.3 Parameter α
Here, the impact of α and the relation among the α, β, and the number of bins arestudied. In the previous subsections, α was fixed to the h/2 (see Eq. 23). Here, wefound an appropriate scaling parameter α among different coefficients of h.
Figure 10 demonstrates the averages of absolute error AvgError P for the fuzzy-histogram estimator with the GNF membership functions with different parametersα, β, and the number of bins M (M = MX = MY ) for each distribution. Here, thesample size N is equal to 500.
123
M. Amir Haeri, M. M. Ebadzadeh
0 5 10 15 200
0.2
0.4
0.6
0.8
1
1.2
Number of Bins
Ave
rage
Err
orN=100
0 5 10 15 200
0.050.1
0.150.2
0.250.3
0.35
Number of Bins
Ave
rage
Err
or
N=1000
0 5 10 15 200
0.050.1
0.150.2
0.250.3
0.35
Number of Bins
Ave
rage
Err
or
N=500Histogram EstimatorFuzzy Histogram Estimator
0 5 10 15 200
0.10.20.30.40.50.60.70.8
Number of Bins
Ave
rage
Err
or
N=100
0 5 10 15 200
0.05
0.1
0.15
0.2
0.25
Number of Bins
Ave
rage
Err
orN=500
0 5 10 15 20
0.03
0.08
0.13
0.18
0.23
Number of Bins
Ave
rage
Err
or
N=1000
Histogram EstimatorFuzzy Histogram Estimator
0 10 20 30 40 500
0.5
1
1.5
2
Number of Bins
Ave
rage
Err
or
N=100
10 20 30 40 500
0.1
0.2
0.3
0.4
0.5
0.6
Number of Bins
Ave
rage
Err
or
N=500
10 20 30 40 500
0.05
0.1
0.15
0.2
0.25
0.3
Number of Bins
Ave
rage
Err
or
N=1000Fuzzy Histogram EstimatorHistogram Estimator
(a)
(b)
(c)
Fig. 9 The averages of the absolute error (AvgError P ) over 50 trials versus the number of bins, for datawith: a bivariate normal distribution, b bivariate gamma-exponential, and c bivariate ordered Weinmandistribution
As can be seen in the graphs, the results for the three distributions are similar. Forthe appropriate values of β and M found previously, appropriate values of α are aroundh/2. Thus, h/2 is a proper value for α. Additionally, these results are consistent withthe results of the Sects. 4.2.1 and 4.2.2, and demonstrated the relations between α, βand M .
4.3 The accuracy and the running time of the fuzzy-histogram MI estimator
After study the impacts of the parameters of the fuzzy-histogram method and findingappropriate values for these parameters, the accuracy of the fuzzy-histogram estimatoris compared with the histogram estimator, for the three mentioned distributions (seeTable 1).
Figure 11 demonstrates the histogram and the fuzzy-histogram estimation and theexact MI for the three distributions. The values were averaged over 50 trials and theerror bars denote the standard deviation. The parameters of the methods were set tothe specific values which led to the minimum average absolute errors in the Sect. 4.2.
123
Estimation of mutual information by the fuzzy histogram
0.3h
0.4h
0.5h
0.6h
0.7h
0.8h
0.9h
510
1520
00.10.20.30.4
Number of Bins
β=1
α
Ave
rrag
e E
rror
0.3h0.4h
0.5h0.6h
0.7h0.8h
0.9h
510
1520
0
0.2
0.4
Number of Bins
β=2
α
Ave
rage
Err
or0.3h
0.4h0.5h
0.6h0.7h
0.8h0.9h
510
1520
0
0.2
0.4
Number of Bins
β=3
α
Ave
rage
Err
or0.3h
0.4h0.5h
0.6h0.7h
0.8h0.9h
36
912
15
0
0.1
0.2
Number of Bins
β=3
αA
vera
ge E
rror
0.3h0.4h
0.5h0.6h
0.7h0.8h
0.9h
36
912
15
0
0.1
0.2
Number of Bins
β=1
α
Ave
rage
Err
or
0.3h0.4h
0.5h0.6h
0.7h0.8h
0.9h
36
912
15
0
0.1
0.2
Number of Bins
β=2
α
Ave
rage
Err
or
0.3h0.4h
0.5h0.6h
0.7h0.8h
0.9h
1520
2530
35
0
0.5
1
Number of Bins
β=1
α
Ave
rage
Err
or
0.3h0.4h
0.5h0.6h
0.7h0.8h
0.9h
1520
2530
35
0
0.5
1
Number of Bins
β=2
α
Ave
rage
Err
or
0.3h0.4h
0.5h0.6h
0.7h0.8h
0.9h
1520
2530
35
0
0.5
1
Number of B
ins
β=3
α
Ave
rage
Err
or
(a)
(b)
(c)
Fig. 10 The averages of the absolute error (AvgError P ) of the fuzzy-histogram estimator, with the GNFmembership functions with different parametersα, β, and the number of bins. The data is distributed accord-ing to a bivariate normal distribution, b bivariate gamma-exponential, and c bivariate ordered Weinmandistribution
As shown in the graphs, most of the times, the fuzzy-histogram estimation is closerto the exact MI. Based on this experiment, for both estimators by increasing thesample size, the accuracy of the estimation is improved. However, when the numberof samples is few, the estimation of the fuzzy-histogram is considerably better than thatof the histogram method. Hence, this experiment indicates that the fuzzy-histogrammethod provides a good estimation, even when the available sample data is not large-enough.
Kernel density estimation (KDE) is known as an effective algorithm for estimatingMI Schaffernicht et al. (2010). KDE outperforms the naïve histogram estimator interms of accuracy. KDE provides a high quality estimation. However, it is a very timeconsuming method. Here, we also compare the fuzzy-histogram method with KDE.
123
M. Amir Haeri, M. M. Ebadzadeh
0 0.2 0.4
0.5
1
1.5
2
θ1/θ
0
Est
imat
ed M
I
N=100
0 0.2 0.4
0.5
1
1.5
2
θ1/θ
0
Est
imat
ed M
I
N=500
0 0.2 0.4
0.5
1
1.5
2
θ1/θ0
Est
imat
ed M
IN=1000
Exact MIFuzzy Histogram EstimatorHistogram Estimator
(a)
(b)
(c)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
ρ
Est
imat
ed M
I
N=1000
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
ρ
Est
imat
ed M
I
N=500
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
ρ
Est
imat
ed M
IN=100
Exact MIFuzzy Histogram EstimatorHistogram Estimator
Exact MIFuzzy Histogram EstimatorHistogram Estimator
0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
θ2
Est
imat
ed M
IN=500
0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
θ2
Est
imat
ed M
I
N=1000
0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
θ2
Est
imat
ed M
I
N=100
Exact MIHistogram EstimatorFuzzy Histogram Estimator
Fig. 11 The fuzzy-histogram estimation, and histogram estimation of the MI, between the variates of the:a bivariate normal data, b bivariate gamma-exponential, and c bivariate ordered Weinman distribution. Thevalues were averaged over 50 trials and the error bars denote the standard deviation. The black solid linesindicate the exact MI which obtained from equation 16, equation 18, and equation 20
The average and standard deviation (over 50 trials) of absolute error (|I exact − I |)of the fuzzy-histogram method, histogram method, and KDE are compared inTables 2, 3 and 4. Note that the parameters of the fuzzy and crisp histogram meth-ods were set to their appropriate values which were obtained in the previous sec-tion. For the KDE its optimum value (Moon et al. 1995) of the smoothing para-meter was used. In Tables 2, 3 and 4, for the cases that the error of the fuzzymethod is better than the errors of both the histogram and KDE, the error ofthe fuzzy method is written in boldface. Moreover, when the error of the fuzzy-histogram method is only better than that of the histogram method, the error ofthe fuzzy method is underlined. In the cases that the error of the fuzzy-histogrammethod is only better than the error of the KDE, the error of the fuzzy method isitalicized.
123
Estimation of mutual information by the fuzzy histogram
Tabl
e2
The
abso
lute
erro
rsof
the
estim
atio
nm
etho
dsfo
rth
eda
taw
ithth
ebi
vari
ate
norm
aldi
stri
butio
n
p0
0.1
0.2
0.3
0.4
N=
100
KD
E0.
0802
±0.
0113
0.08
71±
0.03
10.
0902
±0.
0286
0.11
17±
0.03
470.
1034
±0.
0493
His
togr
am0.
253
±0.
0577
0.25
21±
0.03
870.
2293
±0.
0479
0.22
94±
0.05
180.
2264
±0.
0504
Fuzz
y-hi
stog
ram
0.09
26±
0.02
90.
0951
±0.
0323
0.09
92±
0.03
320.
0917
±0.
0398
0.07
79±
0.03
68
N=
500
KD
E0.
041
±0.
0046
0.04
02±
0.00
690.
0431
±0.
0217
0.04
±0.
017
0.06
28±
0.01
05
His
togr
am0.
2202
±0.
0303
0.22
69±
0.03
020.
2171
±0.
0346
0.20
86±
0.03
60.
2002
±0.
0465
Fuzz
y-hi
stog
ram
0.00
88±
0.04
420.
0093
±0.
0407
0.01
1±
0.03
830.
0137
±0.
0356
0.01
67±
0.03
27
N=
1,00
0K
DE
0.03
16±
0.00
510.
0288
±0.
0072
0.03
12±
0.00
760.
0316
±0.
0076
0.02
75±
0.01
68
His
togr
am0.
0386
±0.
0058
0.03
71±
0.00
640.
0337
±0.
0092
0.03
55±
0.01
070.
0282
±0.
0125
Fuzz
y-hi
stog
ram
0.02
08±
0.00
440.
0194
±0.
0058
0.01
7±
0.00
830.
0165
±0.
0094
0.00
98±
0.01
17p
0.5
0.6
0.7
0.8
0.9
N=
100
KD
E0.
0482
±0.
0336
0.05
98±
0.04
740.
1423
±0.
0557
0.01
12±
0.05
410.
1307
±0.
0519
His
togr
am0.
2064
±0.
056
0.16
21±
0.05
710.
1343
±0.
0699
0.08
8±
0.07
260.
0579
±0.
0796
Fuzz
y-hi
stog
ram
0.06
11±
0.03
820.
032
±0.
0566
0.01
23±
0.05
380.
0583
±0.
0613
0.18
25±
0.07
09
N=
500
KD
E0.
0243
±0.
0208
0.04
27±
0.04
830.
0351
±0.
027
0.01
69±
0.03
170.
0563
±0.
0204
His
togr
am0.
1834
±0.
0461
0.15
6±
0.05
030.
1214
±0.
0569
0.07
44±
0.06
370.
0409
±0.
0584
Fuzz
y-hi
stog
ram
0.01
98±
0.02
290.
0235
±0.
0098
0.02
75±
0.01
940.
0308
±0.
0438
0.03
12±
0.14
55
N=
1,00
0K
DE
0.02
96±
0.02
720.
0333
±0.
011
0.03
91±
0.03
550.
0257
±0.
0322
0.03
79±
0.02
18
His
togr
am0.
0219
±0.
0143
0.01
4±
0.01
910.
005
±0.
0202
0.03
56±
0.02
340.
1103
±0.
0309
Fuzz
y-hi
stog
ram
0.00
1±
0.01
270.
0086
±0.
0169
0.03
27±
0.01
720.
0693
±0.
021
0.16
76±
0.02
76
123
M. Amir Haeri, M. M. Ebadzadeh
Tabl
e3
The
abso
lute
erro
rsof
the
estim
atio
nm
etho
dsfo
rth
eda
taw
ithth
ebi
vari
ate
gam
ma-
expo
nent
iald
istr
ibut
ion
θ 21
35
79
N=
100
KD
E0.
2896
±0.
049
0.00
12±
0.03
230.
0535
±0.
0432
0.06
05±
0.05
70.
0781
±0.
0204
His
togr
am0.
3437
±0.
0507
0.03
98±
0.06
160.
0996
±0.
0597
0.11
14±
0.05
150.
1319
±0.
0386
Fuzz
y-hi
stog
ram
0.24
15±
0.06
990.
0216
±0.
0319
0.00
21±
0.03
660.
0019
±0.
0269
0.00
04±
0.02
4
N=
500
KD
E0.
3546
±0.
0574
0.01
19±
0.00
920.
0125
±0.
0164
0.02
76±
0.00
810.
0291
±0.
008
His
togr
am0.
4028
±0.
0111
0.04
63±
0.03
140.
0144
±0.
017
0.02
65±
0.01
960.
0414
±0.
0192
Fuzz
y-hi
stog
ram
0.31
44±
0.04
490.
0138
±0.
0265
0.01
08±
0.01
830.
0088
±0.
0195
0.01
54±
0.01
34
N=
1,00
0K
DE
0.33
96±
0.05
740.
0127
±0.
0092
0.02
34±
0.01
640.
018
±0.
0081
0.01
99±
0.00
8
His
togr
am0.
4114
±0.
0058
0.06
51±
0.02
320.
0169
±0.
0171
0.00
4±
0.01
120.
0102
±0.
007
Fuzz
y-hi
stog
ram
0.34
41±
0.02
860.
0442
±0.
0171
0.01
71±
0.01
530.
0041
±0.
0115
0.00
34±
0.00
89
θ 211
1315
1719
N=
100
KD
E0.
0724
±0.
0387
0.06
47±
0.01
670.
0691
±0.
0328
0.06
13±
0.03
040.
0842
±0.
0046
His
togr
am0.
1543
±0.
0405
0.15
29±
0.04
820.
1722
±0.
052
0.17
43±
0.04
20.
1667
±0.
0482
Fuzz
y-hi
stog
ram
0.00
6±
0.02
640.
0063
±0.
0215
0.01
61±
0.02
880.
0103
±0.
0178
0.01
21±
0.02
44
N=
500
KD
E0.
0365
±0 .
0068
0.04
±0.
0061
0.03
25±
0.00
210.
0388
±0.
0026
0.03
44±
0.00
47
His
togr
am0.
0405
±0.
0169
0.04
37±
0.01
410.
0446
±0.
0155
0.04
54±
0.01
770.
0456
±0.
0152
Fuzz
y-hi
stog
ram
0.01
2±
0.01
130.
0124
±0.
0129
0.01
44±
0.01
210.
0157
±0.
013
0.01
41±
0.01
19
N=
1,00
0K
DE
0.02
39±
0.00
680.
0251
±0 .
0061
0.03
04±
0.00
210.
0287
±0.
0026
0.02
75±
0.00
47
His
togr
am0.
017
±0.
0084
0.01
93±
0.00
760.
0211
±0.
0085
0.02
08±
0.01
060.
019
±0.
0084
Fuzz
y-hi
stog
ram
0.00
15±
0.00
890.
0007
±0.
0073
0.00
35±
0.00
730.
0045
±0.
0079
0.00
19±
0.00
64
123
Estimation of mutual information by the fuzzy histogram
Tabl
e4
The
abso
lute
erro
rsof
the
estim
atio
nm
etho
dsfo
rth
eda
taw
ithth
ebi
vari
ate
orde
red
Wei
nman
expo
nent
iald
istr
ibut
ion
dist
ribu
tion
θ 1/θ 0
10.
90.
80.
70.
6
N=
100
KD
E0.
0797
±0.
0596
0.03
7±
0.08
670.
0524
±0.
0932
0.05
43±
0.04
640.
0346
±0.
068
His
togr
am0.
186
±0.
1067
0.19
81±
0.09
230.
1852
±0.
0885
0.18
81±
0.09
020.
1416
±0.
1128
Fuzz
y-hi
stog
ram
0.05
01±
0.06
220.
0481
±0.
0543
0.04
72±
0.05
70.
0391
±0.
0633
0.02
34±
0.08
39
N=
500
KD
E0.
0803
±0.
0138
0.05
79±
0.03
350.
0617
±0.
0273
0.05
38±
0.05
180.
0705
±0.
0242
His
togr
am0.
2358
±0.
0596
0.24
6±
0.05
390.
2188
±0.
0439
0.19
82±
0.06
660.
1796
±0.
0467
Fuzz
y-hi
stog
ram
0.18
65±
0.03
520.
1819
±0.
0333
0.17
35±
0.02
720.
134
±0.
0612
0.12
27±
0.03
54
N=
1,00
0K
DE
0.05
9±
0.01
120.
0686
±0.
0163
0.06
19±
0.02
420.
0765
±0.
0207
0.07
66±
0.02
His
togr
am0.
0917
±0.
0359
0.08
8±
0.02
690.
0978
±0.
0297
0.07
29±
0.03
870.
0667
±0.
0305
Fuzz
y-hi
stog
ram
0.04
8±
0.02
450.
0519
±0.
0252
0.04
43±
0.02
910.
0332
±0.
0284
0.01
32±
0.03
49
θ 1/θ 0
0.5
0.4
0.3
0.2
0.1
N=
100
KD
E0.
0191
±0.
0815
0.00
01±
0.09
130.
0266
±0.
122
0.02
91±
0.14
330.
0121
±0.
0763
His
togr
am0.
1144
±0.
132
0.07
07±
0.10
730.
0457
±0.
1464
0.27
6±
0.10
510.
6176
±0.
1371
Fuzz
y-hi
stog
ram
0.00
65±
0.13
080.
057
±0.
1198
0.11
49±
0.13
60.
3344
±0.
1842
0.65
6±
0.16
66
N=
500
KD
E0.
0803
±0.
0333
0.05
86±
0.04
840.
0905
±0.
0296
0.01
41±
0.06
40.
0305
±0.
0827
His
togr
am0.
1822
±0.
0662
0.11
85±
0.05
520.
0752
±0.
0736
0.05
14±
0.08
330.
2386
±0.
0812
Fuzz
y-hi
stog
ram
0.10
16±
0.06
370.
0267
±0.
0662
0.01
83±
0.09
620.
2265
±0.
177
0.38
68±
0.11
7
N=
1,00
0K
DE
0.08
03±
0.01
660.
0594
±0.
0291
0.08
75±
0.00
970.
0969
±0.
0455
0.03
37±
0.05
65
His
togr
am0.
0487
±0.
0388
0.01
12±
0.03
860.
0272
±0.
0426
0.13
29±
0.06
280.
3396
±0.
0575
Fuzz
y-hi
stog
ram
0.00
45±
0.02
990.
0516
±0 .
0526
0.08
58±
0.03
20.
3278
±0.
143
0.48
63±
0.08
41
123
M. Amir Haeri, M. M. Ebadzadeh
Table 5 The running times of the estimation methods (in seconds)
Distribution Method N = 100 N = 500 N = 1,000
Normal KDE 0.3452 ± 0.1986 8.210 ± 4.483 25.716 ± 7.437
Histogram 0.0014 ± 0.0002 0.0043 ± 0.0016 0.0014 ± 0.0002
Fuzzy-histogram 0.022 ± 0.0031 0.2402 ± 0.048 0.4720 ± 0.0501
Gamma-exponential KDE 0.6389 ± 0.1294 7.6489 ± 3.9953 28.1857 ± 11.6613
Histogram 0.0016 ± 0.001 0.005 ± 0.0013 0.009 ± 0.002
Fuzzy-histogram 0.01792 ± 0.0022 0.1165 ± 0.0119 0.2526 ± 0.0450
Ordered Weinmanexponential
KDE 0.6379 ± 0.0851 6.971 ± 1.345 25.623 ± 3.201
Histogram 0.0016 ± 0.001 0.005 ± 0.0013 0.009 ± 0.002
Fuzzy-histogram 0.01792 ± 0.0022 0.1165 ± 0.0119 0.2526 ± 0.04570
As can be seen in the tables, in many cases the fuzzy method outperforms both thehistogram and KDE methods. These tables show that the fuzzy-histogram method canbe considered as an appropriate estimation method for the MI.
Another important property of an estimation method is its running time. Figure 5demonstrates the average and standard deviation of the running times of the threeestimation methods. Although the KDE is an accurate estimation method, it is verytime consuming and it cannot be used in many applications. However, the averagerunning time of the fuzzy-histogram method is significantly less than that of the KDE.Thus, the fuzzy-histogram method is an efficient estimation method for estimatingMI.
In the KDE, one first estimates the value of the probability density functionsfX (x), fY (y), and fXY (x, y) for each data-point (xk, yk), then tries to estimatethe MI using the numerical evaluation (cf. Steuer et al. 2002, Equations 31–32)of I (X; Y ) = ∫
y
∫x fXY (x, y) log fXY (x,y)
fX (x) fY (y)dx dy. On the other hand, the fuzzy-
histogram method estimates the probability that a random point lies in each bin, ratherthan the probability density function of the individual data-points. Therefore, all ofour calculations are based on the pmf of bins, while the calculations of the KDE arebased on the pdf of the individual data-points. Since the number of bins is usuallymuch smaller than the number of the data-points, our method significantly outper-forms KDE. Moreover, as shown in the paper, its accuracy is comparable, and in somecases even superior to the accuracy of the KDE.
Another difference between the KDE and the fuzzy-histogram method is that theformer uses kernel functions, while the latter uses fuzzy membership functions. Noticethat the membership functions are more general than the kernel functions, since akernel function K must satisfy the symmetry property K (−x) = K (x) for all x inits domain. Moreover, as described above, we use one membership function per bin,while the KDE requires one kernel per each data point. Thus, the number of kernelsused in the KDE is significantly more than the number of membership functions in ourmethod. Again, the computational overhead of our method is significantly less thanthat of the KDE. Moreover, our results show that the accuracy of the fuzzy method iscomparable to, and in some cases better than the accuracy of the KDE.
123
Estimation of mutual information by the fuzzy histogram
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Exact MI
Est
imat
ed M
I
N=500
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Exact MI
Est
imat
ed M
I
N=1000
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Exact MI
Est
imat
ed M
IN=100
FH, M=5H, M=5FH, M=10H, M=10FH, M=15H, M=15
0 0.1 0.2 0.3 0.4 0.50
0.1
0.2
0.3
0.4
0.5
Exact MI
Est
imat
ed M
IN=500
0 0.1 0.2 0.3 0.40
0.1
0.2
0.3
0.4
0.5
Exact MI
Est
imat
ed M
I
N=1000
0 0.1 0.2 0.3 0.4 0.50
0.1
0.2
0.3
0.4
0.5
Exact MI
Est
imat
ed M
I
N=100 FH, M=5H, M=5FH, M=10H, M=10FH, M=15H, M=15
(a)
(b)
(c)
0 0.5 1 1.5 20.20.40.60.8
11.21.41.6
Exact MI
Est
imat
ed M
I
N=100
0 0.5 1 1.5 20.20.40.60.8
11.2
1.41.6
Exact MI
Est
imat
ed M
I
N=500
0 0.5 1 1.5 20.2
0.4
0.6
0.8
1
1.2
1.4
1.61.8
Exact MI
Est
imat
ed M
I
N=1000FH, M=20H, M=20FH, M=25H, M=25FH, M=30H, M=30
Fig. 12 The average of the estimated mutual information (over 50 trials) as a function of the exact MI fordata with: a bivariate normal distribution, b bivariate gamma-exponential, and c bivariate ordered Weinmandistribution. The ideal estimation occurs when I = Iexact . “FH” and “H” denote the fuzzy histogram, andhistogram respectively
4.4 Bias of the fuzzy-histogram estimator
In the ideal case the bias of an estimator must be zero. To investigate the bias of theMI estimator, we sketched the average of the MI estimator (over different trials) as afunction of I exact . In the ideal case, E[ I ] as a function of the true value of I should bea straight line. The deviation from this line is interpreted as the bias of the estimator.
Here, for three distributions mentioned in the Sect. 4.2, the average of the histogramand the fuzzy-histogram MI estimators (over different trials) as a function of I exact
were sketched for different N ’s and the number of bins in the Fig. 12. The number oftrials in all experiments is 50. Here, for the fuzzy method, β was set to its appropriatevalues found in the Sect. 4.2.1, and α was set to h/2.
When the exact MI is closer to zero, both of the estimators will have overestimation.However, the overestimation of the fuzzy-histogram estimator is less than the over-estimation of the histogram estimator. As mentioned in Sect. 2.2 this overestimation
123
M. Amir Haeri, M. M. Ebadzadeh
is called N-bias and caused by the finite sample size. Hence, the N-bias of the fuzzy-histogram estimator is less. Moreover, increasing the sample size and decreasing thenumber of bins can decrease the overestimation in both estimators.
When the exact MI becomes greater, the underestimation or R-bias appears. Inboth estimators, increasing the number of bins decreases the underestimation. Here,there is a trade-off for the number of bins, because by increasing the number of binsthe underestimation (R-bias) reveals, however increasing the number of bins leads toincreasing the N-bias.
For the gamma-exponential distribution the R-bias of the fuzzy-histogram esti-mator is less than the R-bias of the histogram estimator. However, for the normaland ordered Weinman exponential distributions the R-bias (underestimation) of thefuzzy-histogram estimator is greater than the R-bias of the histogram estimator.
4.5 Variance of the fuzzy-histogram estimator
In this part, the variances of the fuzzy-histogram and the histogram MI estimators arecompared. The number of trials in all experiments is 50. Here, for the fuzzy method,β was set to its appropriate values found in the Sect. 4.2.1, and α was set to h/2.
In Fig. 13, the variance of two estimators for different parameters are demonstrated.As shown in the graphs, in most of the cases the variance of the fuzzy-histogramestimator is less than or equal to the variance of the histogram estimator. Moreover, thevariance of both estimators is decreased when the number of samples is increased. Thisis consistent with Eq. 4, which indicates that the variance of the histogram estimatoris inversely proportional to N .
4.6 Mutual information and different degrees of dependencies
This experiment compares different dependency measures over data with differentdegrees of dependency. In other words, various dependency measures between X andXd , are compared, where d = 1, 2, . . . , 10. These dependency measures include thehistogram MI estimator, the fuzzy-histogram MI estimator, the Pearson correlationcoefficient and the non-linear correlation coefficient (NCC) which was proposed byWang et al. (2005).
Here, to compare different dependency measures, the normalized version of eachmeasure is used. The range of the Pearson correlation coefficient is [−1,1], and therange of NCC is [0, 1]. Hence, these two measures do not require normalization. Thereare different ways to normalize the MI. Here, Eq. 24 is utilized to normalize the MI.
N I = I (X,Y )
H(X,Y )(24)
Data X = {xi }Ni=1 which consists of N = 500 points, is chosen uniformly at random
from [−10, 10]. Data X1, X2, . . . , X10 were obtained by raising the xi ’s to a powerof d (Xd = {xi
d}Ni=1). The experiment was repeated 100 times with independent
realizations of X .
123
Estimation of mutual information by the fuzzy histogram
0 0.2 0.4 0.6 0.8 10
0.2
0.4
θ0
/θ1
N=500
0 0.2 0.4 0.6 0.8 10
0.2
0.4
θ0 /θ1
σ
0 0.2 0.4 0.6 0.8 10
0.2
0.4
θ0 /θ1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
θ0 /θ1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
θ0
/θ1
σ
0 0.2 0.4 0.6 0.8 10
0.2
0.4
θ0
/θ1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
θ0
/θ1
0 0.2 0.40
0.2
0.4
/θ0
θ1
σ
N=100
0 0.2 0.4 0.6 0.8 10
0.2
0.4
θ0
/θ1
N=1000
Histogram EstimatorFuzzy Histogram Estimator
Number of Bins=25
Number of Bins=20
Number of Bins=30
(a)
(b)
(c)
0 5 10 15 200
0.05
0.1
0.15
θ2
N=500
0 5 10 15 200
0.05
0.1
0.15
θ2
N=1000
0 5 10 15 200
0.05
0.1
0.15
θ2
σ
0 5 10 15 200
0.05
0.1
0.15
θ2
0 5 10 15 200
0.05
0.1
0.15
θ2
0 5 10 15 200
0.05
0.1
0.15
θ2
σ
0 5 10 15 200
0.05
0.1
0.15
θ2
0 5 10 15 200
0.05
0.1
0.15
θ2
0 5 10 15 200
0.05
0.1
0.15
θ2
σ
N=100
Histogram EstimatorFuzzy Histogram Estimator
Number of Bins=5
Number of Bins=10
0 0.5 1
0.02
0.04
0.06
0.08
ρσ
N=100
0 0.5 1
0.02
0.04
0.06
0.08
ρ
N=500
0 0.5 1
0.02
0.04
0.06
0.08
ρ
N=1000
0 0.5 1
0.02
0.04
0.06
0.08
ρ
σ
0 0.5 1
0.02
0.04
0.06
0.08
ρ0 0.5 1
0.02
0.04
0.06
0.08
ρ
0 0.5 1
0.02
0.04
0.06
0.08
ρ
σ
0 0.5 1
0.02
0.04
0.06
0.08
ρ0 0.5 1
0.02
0.04
0.06
0.08
ρ
Histogram EstimatorFuzzy Histogram Estimator
Number of Bins=5
Number of Bins=10
Number of Bins=15
Number of Bins=15
Fig. 13 The standard deviations (over 50 trials) of the estimated mutual information for data with: abivariate normal distribution, b bivariate gamma-exponential, and c bivariate ordered Weinman exponentialdistribution
123
M. Amir Haeri, M. M. Ebadzadeh
1 2 3 4 5 6 7 8 9 100
0.2
0.4
0.6
0.8
1
d
Dep
ende
ncy
Mea
sure
Normalized Histogram MI Estimator
Normalized Fuzzy Histogram MI Estimator
Pearson Correlation
NCC
1 2 3 4 5 6 7 8 9 10
0
0.5
1
d
Dep
ende
ncy
Mea
sure
(a) (b)
Fig. 14 The dependency measures for the data with different degrees of dependency. a Dependency betweenX and Xd , b dependency between |X | and |Xd |
The estimations of dependency measures between X and Xd were reported in theFig. 14a. The values were averaged over 100 trials and the error bars show the standarddeviations. The number of bins M = MX = MY are identical for the NCC, fuzzy-histogram, and histogram methods and it was equal to 10. For the fuzzy-histogramapproach the GNF membership functions were used with β = 2 and α = h/2.
Figure 14a demonstrates that the Pearson correlation coefficient for the even degreesis almost zero, and it does not represent any dependency between the X and Xd whend = 2k (k ∈ N). The NCC measure for even degrees is identically equal to one,and for odd degrees it equals to 0.62. Thus, this measure cannot distinguish betweenthe different degrees of dependency. However, the normalized MI (both the fuzzy-histogram estimation and the histogram estimation) can indicate the dependenciesbetween X and Xd for different d’s.
For further investigation, the experiment was repeated for the |X | and |Xd |, and theresults were demonstrated in the Fig. 14b. Here, Pearson correlation and NCC of theeven degrees are not zero. However, NCC cannot distinguish between different degreesof dependency, because it uses rank orders of variables instead of the original data.Moreover, in this case, the behavior of the correlation coefficient and the normalizedmutual MI are similar to each other.
This experiment demonstrated the advantage of MI as a measure of dependencybetween variables. MI can indicate dependency in some cases that NCC and the cor-relation coefficient cannot.
5 Application in gene expression
To test the fuzzy-histogram estimator of MI on a real-world application, the geneexpression data is selected. This data was previously used for study different MIestimators by Steuer et al. (2002) and Kraskov et al. (2004). Details about this datasetare available in Hughes et al. (2000). Furthermore, the data can be downloaded fromHughes (2012). It includes ≈ 300 vectors in high dimensional space. However, becausethey have missing values the number of simultaneous pairs is less than 300. Each pointis related to one genome and each dimension is related to one open reading frame
123
Estimation of mutual information by the fuzzy histogram
−0.2 0 0.2 0.4−0.4
−0.25
−0.1
0.05
0.2
YKL148C
YLR
264W
D
−0.4 −0.2 0 0.2 0.4 0.6
−0.2−0.1
00.10.2
YGR122C−A
YD
R36
6C
C
−0.4 −0.2 0 0.2 0.4 0.6 0.8−0.35
−0.15
0.05
0.25
0.450.57
YCR018C−A
YM
R15
8C−
B
B
−0.2 −0.1 0 0.1 0.2
−0.2
0
0.2
YDR366C
YC
R02
0W−
B
A
0 50 100 150 2000
50
100
150
YKL148C
YLR
264W
Drank
0 50 100 1500
30
60
90
120
YGR122C−A
YD
R36
6C
Crank
0 50 100 1500
50
100
150
YCR018C−A
YM
R15
8C−
B
Brank
0 50 1000
30
60
90
120
YDR366C
YC
R02
0W−
B
Arank(b)(a)
Fig. 15 a Simultaneous measurement of gene-expression data. Each point denotes the values of two ORFs.b The rank representation of the datasets A to D. Each point is replaced by its rank-order
(ORF). The MI estimation methods are investigated on four ORF pairs A to D whichis demonstrated in Fig. 15a.
As can be seen in the Fig. 15a, the four examples have various degrees of depen-dency. In example the B strong linear correlation can be detected by eye. However,the relation in the examples A, C, and D is not easily detected by eye.
Since the data have large fluctuations and isolated data-points, instead of usingdata values, their rank is utilized for estimating MI (similar to Steuer et al. 2002;Kraskov et al. 2004). In other words, each point (xi , yi ) is replaced by the rank order(rank(xi ), rank(yi )). Now the data are homogeneously distributed on the xy-plain, andthe correlation between the variables of each dataset is preserved. Figure 15b showsthe rank order of example A to D.
In this experiment, we compare the fuzzy-histogram MI estimator, the histogramestimator, and the KDE on the four examples, and investigate which of them can indi-cate the dependency between the variables. To interpret the results of the experiment,a significance test is required. A null-hypothesis should be set and tested whether it isconsistent with the data or not. Here, the null-hypothesis is assuming that the X andY are independent. If the observed MI is not consistent with the null-hypothesis, it ispossible to assert that X and Y are dependent and the null-hypothesis is wrong.
To test the null-hypothesis, an ensemble of surrogate datasets Xs,Y s , consistentwith the null-hypothesis, are generated. For generating surrogate datasets, constraintrealization technique is used. In this technique, the surrogate data sets are generatedby creating random permutations of the original data X and Y .
In this test, the mean and standard deviation of the MI estimators of the surrogatedata sets should be calculated. The surrogate significance S is obtained by:
S = I (X,Y )data − 〈I (X,Y )〉surrogate
σsurrogate, (25)
where I (X,Y )data is the estimated MI of the original data, and 〈I (X,Y )〉surrogate is theaverage of the estimated MI for the ensemble of the surrogate datasets. If |S| ≥ 2.6,the null-hypothesis is rejected by a significant level of 99 %.
123
M. Amir Haeri, M. M. Ebadzadeh
6 7 8 9 10 11 12 130
0.2
0.4
Number of Bins (MX=M
Y)
MI
6 7 8 9 10 11 12 130
0.07
0.13
Number of Bins (MX=M
Y)
MI
6 7 8 9 10 11 12 130
0.5
1
Number of Bins (MX=M
Y)
MI
6 7 8 9 10 11 12 130
0.4
0.8
Number of Bins (MX=M
Y)
MI
6 7 8 9 10 11 12 130
0.2
0.4
Number of Bins (MX=M
Y)
MI
6 7 8 9 10 11 12 130
0.05
0.1
Number of Bins (MX=M
Y)
MI
6 7 8 9 10 11 12 130
0.2
Number of Bins (MX=M
Y)
MI
6 7 8 9 10 11 12 130
0.07
0.15
Number of Bins (MX=M
Y)
MI
AA
B B
CC
D D
Fuzzy Histogram EstimationHistogram Estimation Kernel Density Estimation
0.10.20.30.40.50.60.70.80
0.07
0.15
Smoothing Parameter h
MI
0.10.20.30.40.50.60.70.80
0.40.7
1
Smoothing Parameter h
MI
0.10.20.30.40.50.60.70.80
0.05
0.1
Smoothing Parameter h
MI
0.10.20.30.40.50.60.70.80
0.07
0.15
Smoothing Parameter h
MI
A
D
C
B
Fig. 16 The estimated MI for the (ranked-ordered) datasets A to D. The dashed lines denote the average MIobtained from an ensemble of 300 surrogates. The error bars indicate the standard deviation. The isolateddots demonstrate the estimated mutual information of the original data (I (X, Y )data
Figure 16 demonstrates the average MI of an ensemble of 300 surrogates for thefuzzy-histogram estimator, the histogram estimator, and KDE. The error bars showthe standard deviation σsurrogate. The separate points denote the MI of the original data.In this experiment, for the fuzzy estimation method, GNF is used as the membershipfunctions, with β = 1 and α = h/2 (see Eq. 23).
Furthermore, Table 6 demonstrates the absolute values of the significance level Sfor the histogram MI estimator, the fuzzy-histogram MI estimator with different β’s,and the KDE. As can be seen in the table, by increasing β, the significance level of thefuzzy estimator is decreased, and it tends to the significance of the histogram estimator.The reason is that by increasing β, the fuzzy membership functions approach more tothe crisp membership functions.
Based on the experiments, the fuzzy-histogram MI estimator with the GNF mem-bership functions with β = 1, and the KDE could reject the null-hypothesis by sig-nificance level 99%. In other words, the fuzzy-histogram MI estimator and the KDEshowed that all four pairs are dependent, while the naïve histogram-based estimatorcould reveal the dependency for only two pairs B and D.
Additionally, Table 7 demonstrates the average and standard deviation of the run-ning times of the three estimation methods for ensemble of 300 surrogates. Based onthis experiment, the fuzzy-histogram MI estimator and the KDE could show that allfour pairs are dependent. Not only the fuzzy-histogram MI estimator is able to indicatethat all four pairs are dependent, but also its computational load is significantly lessthan that of the KDE.
123
Estimation of mutual information by the fuzzy histogram
Table 6 The absolute values of the significance S (see Eq. 25) of the histogram MI estimator, the fuzzy-histogram MI estimator with different β’s, and the KDE
Data Method Number of Bins
7 8 9 10 11 12
A Histogram 3.73 2.08 2.38 1.55 2.01 1.90
Fuzzy Histogram (β = 1) 5.68 5.62 5.17 5.85 5.69 5.01
Fuzzy Histogram (β = 2) 4.17 4.57 3.66 4.17 3.66 3.09
Fuzzy Histogram (β = 3) 3.58 3.41 2.93 3.57 2.91 2.59
Fuzzy Histogram (β = 4) 3.18 3.01 2.47 3.23 2.64 2.14
B Histogram 34.22 28.78 25.71 25.87 22.88 19.84
Fuzzy Histogram (β = 1) 67.71 67.42 58.12 67.87 65.53 56.72
Fuzzy Histogram (β = 2) 53.24 44.01 40.60 40.72 38.75 36.79
Fuzzy Histogram (β = 3) 45.90 40.34 35.11 33.36 32.73 28.93
Fuzzy Histogram (β = 4) 38.60 34.82 31.95 29.52 27.22 27.29
C Histogram 2.09 1.82 0.68 0.65 2.50 1.16
Fuzzy Histogram (β = 1) 3.87 3.21 3.24 2.65 2.70 3.15
Fuzzy Histogram (β = 2) 2.60 1.92 1.91 1.11 1.76 1.52
Fuzzy Histogram (β = 3) 1.59 1.78 2.00 1.02 1.33 1.07
Fuzzy Histogram (β = 4) 1.32 1.27 2.30 0.76 0.99 1.03
D Histogram 5.61 3.70 3.75 3.11 2.95 2.35
Fuzzy Histogram (β = 1) 10.95 10.69 10.60 10.25 8.51 10.10
Fuzzy Histogram (β = 2) 7.93 7.00 7.00 6.49 5.06 5.07
Fuzzy Histogram (β = 3) 7.26 6.40 5.64 5.28 4.28 4.49
Fuzzy Histogram (β = 4) 6.49 6.16 4.17 5.13 4.09 3.78
Data Method h
0.2 0.3 0.4 0.5 0.6 0.7
A Kernel density estimation 6.57 7.14 7.06 7.74 8.84 9.48
B Kernel density estimation 41.19 47.51 58.31 59.35 63.36 67.95
C Kernel density estimation 3.60 4.68 5.79 6.46 6.68 6.58
D Kernel density estimation 10.02 10.64 10.77 11.19 11.67 11.81
6 Conclusion
Mutual information (MI) is the ideal measure of independence. In many applications,we need to estimate the MI from some samples. There are several ways to estimateMI from the sample data. Among these methods, the histogram-based estimation isvery popular, since it is simple and efficient.
In this paper, an MI histogram estimator with fuzzy partitioning was introduced.We utilized a general form for the membership functions that provides a wide range ofmembership functions. By changing parameter β of this general function, the mem-
123
M. Amir Haeri, M. M. Ebadzadeh
Tabl
e7
The
runn
ing
times
ofth
ees
timat
ion
met
hods
(in
seco
nds)
for
anen
sem
ble
of30
0su
rrog
ates
Dat
aM
etho
d
Ker
neld
ensi
tyes
timat
ion
Fuzz
yhi
stog
ram
estim
atio
nH
isto
gram
estim
atio
n
h=
0.4
h=
0.5
h=
0.6
MX
=M
Y=
9M
X=
MY
=10
MX
=M
Y=
11M
X=
MY
=9
MX
=M
Y=
10M
X=
MY
=11
A1.
0912
±0.
041
1.13
1±
0.00
91.
152
±0.
019
0.00
061
±0.
0002
70.
0006
3±
0.00
016
0.00
071
±0.
0003
20.
0002
4±
0.00
012
0.00
025
±0.
0001
30.
0002
7±
0.00
019
B1.
182
±0.
014
1.21
8±
0.07
41.
156
±0.
023
0.00
062
±0.
0003
10.
0006
8±
0.00
025
0.00
073
±0.
0002
70.
0002
1±
0.00
014
0.00
025
±0.
0003
30.
0002
8±
0.00
020
C1.
180
±0.
031
1.17
4±
0.02
91.
164
±0.
010
0.00
067
±0.
0003
40.
0007
0±
0.00
041
0.00
075
±0.
0003
60.
0002
0±
0.00
009
0.00
022
±0.
0001
10.
0000
23±
0.00
015
D1.
237
±0.
040
1.26
1±
0.05
61.
243
±0.
031
0.00
064
±0.
0001
90.
0007
3±
0.00
031
0.00
081
±0.
0002
50.
0002
2±
0.00
013
0.00
024
±0.
0001
20.
0002
9±
0.00
019
123
Estimation of mutual information by the fuzzy histogram
bership functions tend from fuzzy membership functions towards crisp membershipfunctions. In the experiments, we showed that in average, the fuzzy-histogram methodprovides more accurate estimation of the MI than the naïve histogram method. Theeffects of parameters of the fuzzy histogram were examined for datasets sampledfrom three distributions. The average absolute error is minimized for β equals 1 or 2,depending on the distribution. In both cases (β = 1 or 2), the membership functionsare far away from the crisp membership functions. Hence, the experiments showedthat fuzzification improves the histogram method for estimating MI.
Based on our experimental results, the accuracy of the fuzzy-histogram method iscomparable to KDE, and in many cases outperforms KDE. Moreover, its computationalload is significantly less than that of the KDE.
Two important features of an estimator are the bias and variance. As the experi-ments demonstrated, our method decreases the N-bias of the histogram MI estimator.Moreover, in most cases, the variance of the fuzzy-histogram estimator is less than thevariance of the histogram-based estimator.
Another important advantage of the fuzzy-histogram MI estimator is that it providesa good estimation for data with a few samples, in comparison with histogram estimator.This is important because in many applications, there are not large enough samplesfor estimating MI.
To test the fuzzy-histogram–based estimator of MI on a real-world data, it wasutilized for the gene-expression application. The fuzzy-histogram–based estimatorcould show the dependency between two pairs of variables that the histogram-basedestimator could not.
References
Ang, K. K., Chin, Z. Y., Zhang, H., & Guan, C. (2012). Mutual information-based selection of optimalspatial-temporal patterns for single-trial eeg-based bcis. Pattern Recognition, 45(6), 2137–2144.
Crouzet, J. F., & Strauss, O. (2011). Interval-valued probability density estimation based on quasi-continuoushistograms: Proof of the conjecture. Fuzzy Sets and Systems, 183(1), 92–100.
Darbellay, G. (2000). Entropy expressions for multivariate continuous distributions. IEEE Transactions onInformation Theory, 46(2), 709–712.
Darbellay, G. A., & Vajda, I. (1999). Estimation of the information by an adaptive partitioning of theobservation space. IEEE Transactions on Information Theory, 45(4), 1315–1321.
Hughes, T. R. (2012). Supplementary data file of gene expression. http://hugheslab.ccbr.utoronto.ca/supplementary-data/rii/. [Online; Accessed 20 Dec 2012].
Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., Stoughton, R., Armour, C. D., et al. (2000).Functional discovery via a compendium of expression profiles. Cell, 102(1), 109–126.
Karasuyama, M., & Sugiyama, M. (2012). Canonical dependency analysis based on squared-loss mutualinformation. Neural Networking, 34, 46–55.
Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E,69(6), 066138.
Loquin, K., & Strauss, O. (2006). Fuzzy histograms and density estimation. In J. Lawry, E. Miranda,A. Bugarin, S. Li, M. A. Gil, P. Grzegorzewski, & O. Hyrniewicz (Eds.), Soft methods for integrateduncertainty modelling, volume 37 of advances in soft computing (pp. 45–52). Berlin Heidelberg: Springer.
Loquin, K., & Strauss, O. (2008). Histogram density estimators based upon a fuzzy partition. Statistics andProbability Letters, 78(13), 1863–1868.
Moddemeijer, R. (1989). On estimation of entropy and mutual information of continuous distributions.Signal Processing, 16(3), 233–248.
123
M. Amir Haeri, M. M. Ebadzadeh
Moon, Y. I., Rajagopalan, B., & Lall, U. (1995). Estimation of mutual information using kernel densityestimators. Physical Review E, 52(3), 2318–2321.
Schaffernicht, E., Kaltenhaeuser, R., Verma, S., & Gross, H. M. (2010). On estimating mutual informationfor feature selection. Artificial Neural Networks-ICANN, 2010, 362–367.
Steuer, R., Kurths, J., Daub, C. O., Weise, J., & Selbig, J. (2002). The mutual information: Detecting andevaluating dependencies between variables. Bioinformatics, 18(suppl 2), S231–S240.
Tenekedjiev, K., & Nikolova, N. (2008). Justification and numerical realization of the uniform method forfinding point estimates of interval elicited scaling constants. Fuzzy Optimization and Decision Making,7(2), 119–145.
Wang, Q., Shen, Y., & Zhang, J. Q. (2005). A nonlinear correlation measure for multivariable data set.Physica D: Nonlinear Phenomena, 200(3–4), 287–295.
Zografos, K., & Nadarajah, S. (2005). Expressions for rényi and shannon entropies for multivariate distri-butions. Statistics and Probability Letters, 71(1), 71–84.
123