estimation of mutual information by the fuzzy histogram

32
Fuzzy Optim Decis Making DOI 10.1007/s10700-014-9178-0 Estimation of mutual information by the fuzzy histogram Maryam Amir Haeri · Mohammad Mehdi Ebadzadeh © Springer Science+Business Media New York 2014 Abstract Mutual Information (MI) is an important dependency measure between random variables, due to its tight connection with information theory. It has numerous applications, both in theory and practice. However, when employed in practice, it is often necessary to estimate the MI from available data. There are several methods to approximate the MI, but arguably one of the simplest and most widespread techniques is the histogram-based approach. This paper suggests the use of fuzzy partitioning for the histogram-based MI estimation. It uses a general form of fuzzy membership functions, which includes the class of crisp membership functions as a special case. It is accordingly shown that the average absolute error of the fuzzy-histogram method is less than that of the naïve histogram method. Moreover, the accuracy of our technique is comparable, and in some cases superior to the accuracy of the Kernel density estimation (KDE) method, which is one of the best MI estimation methods. Furthermore, the computational cost of our technique is significantly less than that of the KDE. The new estimation method is investigated from different aspects, such as average error, bias and variance. Moreover, we explore the usefulness of the fuzzy-histogram MI estimator in a real-world bioinformatics application. Our experiments show that, in contrast to the naïve histogram MI estimator, the fuzzy-histogram MI estimator is able to reveal all dependencies between the gene-expression data. Keywords Mutual Information · Information Theory · Estimation · Fuzzy Mutual Information · Fuzzy Histogram M. Amir Haeri · M. M. Ebadzadeh (B ) Department of Computer Engineering and Information Technology, Amirkabir University of Technology, Tehran, Iran e-mail: [email protected] M. Amir Haeri e-mail: [email protected] 123

Upload: mohammad-mehdi

Post on 23-Dec-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Estimation of mutual information by the fuzzy histogram

Fuzzy Optim Decis MakingDOI 10.1007/s10700-014-9178-0

Estimation of mutual information by the fuzzyhistogram

Maryam Amir Haeri ·Mohammad Mehdi Ebadzadeh

© Springer Science+Business Media New York 2014

Abstract Mutual Information (MI) is an important dependency measure betweenrandom variables, due to its tight connection with information theory. It has numerousapplications, both in theory and practice. However, when employed in practice, it isoften necessary to estimate the MI from available data. There are several methods toapproximate the MI, but arguably one of the simplest and most widespread techniquesis the histogram-based approach. This paper suggests the use of fuzzy partitioningfor the histogram-based MI estimation. It uses a general form of fuzzy membershipfunctions, which includes the class of crisp membership functions as a special case. Itis accordingly shown that the average absolute error of the fuzzy-histogram method isless than that of the naïve histogram method. Moreover, the accuracy of our technique iscomparable, and in some cases superior to the accuracy of the Kernel density estimation(KDE) method, which is one of the best MI estimation methods. Furthermore, thecomputational cost of our technique is significantly less than that of the KDE. Thenew estimation method is investigated from different aspects, such as average error,bias and variance. Moreover, we explore the usefulness of the fuzzy-histogram MIestimator in a real-world bioinformatics application. Our experiments show that, incontrast to the naïve histogram MI estimator, the fuzzy-histogram MI estimator is ableto reveal all dependencies between the gene-expression data.

Keywords Mutual Information · Information Theory · Estimation · Fuzzy MutualInformation · Fuzzy Histogram

M. Amir Haeri · M. M. Ebadzadeh (B)Department of Computer Engineering and Information Technology,Amirkabir University of Technology, Tehran, Irane-mail: [email protected]

M. Amir Haerie-mail: [email protected]

123

Page 2: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

1 Introduction

Finding dependencies between random variables is an important task in many problems(Karasuyama and Sugiyama 2012; Ang et al. 2012; Steuer et al. 2002; Tenekedjiev andNikolova 2008), such as independent component analysis and feature selection. Thereare several measures that quantify the linear dependency between random variables,such as the Pearson correlation coefficient and the Spearman correlation coefficient.Such measures are not sufficient to quantify the general dependency between tworandom variables. On the other hand, mutual information (MI) provides a generaldependency measure between two random variables. MI can measure any kind ofrelationship between the random variables.

The MI of two random variables depends on their distributions. However, most ofthe time, it is required to find the MI of two variables whose distributions are unknown,and only some samples from them are available. To estimate the MI, one has to estimatethe entropies or probability density functions (pdf’s) from the data samples.

There are several methods to estimate the MI from finite samples. The most popularmethod for MI estimation is the histogram-based method (Moddemeijer 1989), whichpartitions the space into several bins, and counts the number of elements in each bin.This method is very easy and efficient from the computational point of view. However,the approximation given by the counting process is discontinuous, and the estimationis very sensitive to the number of bins (Loquin and Strauss 2006; Schaffernicht et al.2010).

Moon et al. (1995) presented another MI estimation approach called Kernel densityestimation (KDE). KDE utilizes kernels to approximate pdf’s. Probability densityfunctions can be estimated by the superposition of a set of kernel functions. In general,KDE provides a high quality estimation for MI. However, it is very time-consumingand computationally intensive (Steuer et al. 2002; Kraskov et al. 2004; Loquin andStrauss 2006).

Kraskov et al. (2004) suggested the K-nearest neighbors (KNN) method to estimatethe MI. This method is based on estimating entropies from KNN distances.

Another method of estimating the MI is adaptive partitioning, which was introducedby Darbellay and Vajda (1999). This method is based on the histogram approach, butit is not parametric. In their approach, the partition is improved until the conditionalindependence has been achieved in the bins.

Wang et al. (2005) suggested a nonlinear correlation measure called nonlinearcorrelation coefficient (NCC). Their measure is based on the MI carried by the ranksequences of the original data. Unlike the mutual information, NCC takes values fromthe closed interval [0,1].

The accumulation process in the histogram-based method depends on the answerof the question “whether a sample x belongs to the bin ai or not.” However, becauseof the vagueness in the boundaries of histogram bins, it is not possible to answer thisquestion exactly. A reasonable solution to overcome this problem is using fuzzinessin the partitioning (Loquin and Strauss 2006; Crouzet and Strauss 2011).

Loquin and Strauss (2006, 2008) suggested a histogram density estimator basedon a fuzzy partition. They proved the consistency of this estimator based on themean square error (MSE) (Loquin and Strauss 2008). Moreover, they showed that

123

Page 3: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

the main advantage of this estimator is the enhancement of the robustness of thehistogram density estimator, with respect to arbitrary partitioning. Since the MI oftwo variables is a function of the density of the variables, using histogram estima-tor based on fuzzy partitioning (fuzzy-histogram) can improve the histogram MIestimator.

This paper introduces the fuzzy-histogram mutual-information estimator. Thefuzzy-histogram MI estimator uses the fuzzy partitioning. We consider a generalform of fuzzy membership functions whose shapes are controlled by a parameter.By increasing this parameter, the membership functions tend from fuzzy towardscrisp. Using these general membership functions, it is demonstrated that the histogrammethod with fuzzy partitioning outperforms the naïve histogram method (based on theaverage absolute error).

The rest of this paper is organized as follows: Section 2 is dedicated to the his-togram MI estimator. In Sect. 3 the fuzzy-histogram method for estimating MI isintroduced. Section 4 investigates different aspects of the fuzzy-histogram MI esti-mator. Section 5 compares the fuzzy-histogram MI estimator, histogram MI esti-mator, and KDE in a bioinformatics application. Finally, Sect. 6 concludes thepaper.

2 Preliminaries

In this section, the histogram MI estimation method is explained briefly. Moreover,the bias and variance of this method are studied.

2.1 The histogram MI estimator

The MI of two continuous random variables X and Y is defined as I (X,Y ) =∫∫X,Y p(x, y) log p(x,y)

p(x)p(y)dxdy. Here, p(x, y) is the joint pdf of X and Y , and p(x)and p(y) are the marginal pdf’s of X and Y , respectively.

Suppose that we have N simultaneous samples of X and Y . To estimate the MIby the histogram method, the variable X is partitioned into MX bins, and vari-able Y is partitioned into MY bins. We call the i th bin of X, ai (where 1 ≤ i ≤MX ), and the j th bin of Y, b j (where 1 ≤ j ≤ MY ). Furthermore, p(ai ) isdefined as the probability of observing a sample of X in the bin ai . The probabil-ity p(ai ) is estimated by the relative frequency of samples of X observed in thecell ai , and is equal to ki

N . Here, ki is the number of samples of X observed inthe bin ai . Moreover, p(ai , b j ) is defined as the probability of observing a sam-ple of (X,Y) in the bin (ai , b j ) (i.e. x lies in the bin ai and y in the bin b j ), and

is approximated byki jN . Here, ki j is the number of samples observed in the bin

(ai , b j ).Hence, the MI of X and Y is estimated as follows:

I (X,Y ) =∑

i j

ki j

Nlog

ki j N

ki k j. (1)

123

Page 4: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

2.2 Bias of the histogram MI estimator

Based on Moddemeijer (1989), the histogram-based estimator is a biased estimator.The total bias is the sum of N-bias and R-bias. Here, these two types of bias areexplained briefly.

– N-bias: This bias is caused by the finite sample size and depends on the samplesize N . When the MI I (X,Y ) is estimated from a finite sample size N by thehistogram estimator, the N-bias is as follows Moddemeijer (1989):

ΔI (X,Y )N-bias = MX MY − MX − MY + 1

2N, (2)

where MX and MY are the number of histogram bins. Note that the N-bias doesnot depend on the probability distribution of the variables, and only depends onthe sample size and the number of bins. According to the Eq. 2, when N tends toinfinity, the N -bias tends to zero.

– R-Bias: Insufficient representation of the probability distribution function (pdf)by the histogram method leads to the R-bias. R-bias is specified by the estimationmethod and the pdf’s of the variables, and it is caused by two separate sources: (1)the limited integration area, and (2) the finite resolution.Moddemeijer (1989) showed that the bias caused by the limited integration areais negligible in comparison with the bias caused by the finite resolution, and it canbe ignored. For the histogram MI estimator, they demonstrated that the R-biasedcaused by the finite resolution is as follows:

ΔI (X,Y )R-bias =+∞∫

−∞

1

24p(x)

(∂p(x)

∂x

)2

(Δx)2dx

++∞∫

−∞

1

24p(y)

(∂p(y)

∂y

)2

(Δy)2dy

−+∞∫

−∞

+∞∫

−∞

1

24p(x, y)

((∂p(x, y)

∂x

)2

(Δx)2

+(∂p(x, y)

∂y

)2

(Δy)2)

dxdy. (3)

The integrals of the Eq. 3 measure the smoothness of the probability densityfunctions. When the pdf’s are smooth, the first derivatives are approximately equalto zero. Hence, the squared derivatives are almost zero and R-bias is minimized.

The N-bias of the histogram MI estimator leads to overestimation, and the R-biasleads to underestimation. By increasing the number of bins (decreasing the bin length)the R-bias is decreased, and the N-bias is increased. Hence, the number of bins makesa trade-off between the N-bias and the R-bias.

123

Page 5: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

2.3 Variance of the histogram MI estimator

A good estimator must have a low variance. The variance of the histogram estimatorof the MI can be written as follows Moddemeijer (1989):

VAR[ I (X,Y )] ≈ 1

NVAR

[

logp(x, y)

p(x)p(y)

]

, (4)

where x and y are the vectors of N simultaneous samples of X and Y .The variance of the histogram MI estimator approximately does not depend on the

cell sizes, except in the following cases: (1) the number of bins is one, (2) the numberof bins tends to infinity. In these cases, the variance is equal to zero.

3 Fuzzy-histogram method for estimating mutual information

In this section, we present the fuzzy-histogram MI estimator. As mentioned in the intro-duction, Loquin and Strauss (2008) showed that the fuzzy-histogram density estimatorcan improve the robustness of the histogram density estimator. Since estimating theMI depends on the estimation of the probabilities p(x), p(y) and p(x, y), utilizingfuzzy partitioning can improve the histogram MI estimator. In this section, a generalform for membership function is suggested. The shape of membership functions iscontrolled by a parameter called β. As β tends to infinity, the membership functionstend to crisp functions. By using this general form, it is possible to test whether fuzzifi-cation can improve the histogram method, and which membership functions are betterin estimating the MI by the fuzzy histogram method.

3.1 Fuzzy partitioning

Let D = [a, b] ⊂ R be the domain of variable X . We want to partition D as follows. Letγ1 < γ2 < · · · < γn be n ≥ 3 points of D such that γ1 = a and γn = b. Let the lengthof the bins be equal to h. Therefore, γk = a + (k − 1)h. Now define two other pointsγ0 = a − h and γn+1 = b + h. Consider the extended domain D′ = [γ0, γn+1] ⊂ R.Define n fuzzy subsets A1, A2, . . . , An on the extended domain D′, with membershipfunctions μA1(x), μA2(x), . . . , μAn (x). These fuzzy sets should satisfy the followingproperties:

1. μAk (γk) = 1;2. μAk (x)monotonically increases on [γ0, γk], and μAk (x)monotonically decreases

on [γk, γn+1];3. ∀x ∈ D′, ∃k such that μAk (x) > 0.

Some examples of membership functions with mentioned properties are listedbelow:

– the crisp partition: K A(x) = 1[ −12 ,

12 ](x),

– the cosine partition: 12 (cos(πx)+ 1) 1[−1,1](x),

123

Page 6: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

Fig. 1 Fuzzy partitioning withtriangular membershipfunctions. Here, the number ofbins is equal to six

−15 −10 −5 0 5 10 150

0.5

1

x

μ(x)

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

x

y

(b) β=1

β=2

β=4

β=8

β=10

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

x

y

(a) β=1

β=2

β=4

β=8

β=10

Fig. 2 a Generalized normal function (GNF), b normalized GNF which can be used as a membershipfunction. Here, α = 2

– the triangular partition: (1 − |x |) 1[−1,1](x),– the generalized normal partition as described next.

Figure 1 illustrates a fuzzy partitioning of interval [−10,10] with triangular mem-bership functions.

A good choice for the membership functions is the generalized normal function(GNF), which provides a general from for the membership functions.

GNF is a parametric continuous function. GNF is the pdf of the generalized normaldistribution. This type of function adds a shape parameter to the normal function. Theformula of the GNF is as follows:

f (x) = β

2αΓ (1/β)

e−

(|x − μ|/α)β

, (5)

where Γ is the gamma function (Γ (z) = ∫ ∞0 e−t t z−1dt).

By changing the parameter β of this function, its shape is changed. When β = 2,the shape of the function is like the pdf of the normal distribution. Furthermore, whenβ = 1 it is like the pdf of the Laplace distribution and its shape is similar to thetriangular function. By increasing the parameter β, the shape of the function graduallybecomes similar to the pulse function. Moreover, α is the scale parameter.

Figure 2a demonstrates the GNF. To use the generalized normal function as amembership function, its outputs must be normalized over the interval [0,1]. Figure 2bshows the normalized GNF. As shown in the figure, GNF is capable of generating a

123

Page 7: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

−15 −10 −5 0 5 10 150

0.5

1

x

μ (x)

(a)

−15 −10 −5 0 5 10 150

0.5

1

x

μ(x)

(b)

Fig. 3 Fuzzy partitioning using the GNF membership functions, the number of bins is equal to six. aβ = 10, b β = 2

wide range of membership functions (triangular, normal,..., crisp). For example, whenthe membership functions are GNF’s with β ≥ 10, they are similar to the crispmembership functions.

Figure 3 demonstrates two fuzzy partitionings of the interval [−10, 10] by GNFwith β = 10 and β = 2. Here, α is equal to h

2 = 2.

3.2 Estimating the mutual information

In estimating the MI by the fuzzy-histogram method, the probabilities are calculateddifferently from the crisp histogram method. Suppose that we have N simultaneousmeasurements of two continuous variables X and Y . The measurements of X and Yare partitioned into MX and MY bins respectively, as described in the Sect. 3.1. Foreach bin ai belonging to X , a fuzzy membership function μAi (x) is defined. Similarlyfor each bin b j belonging to Y , a fuzzy membership function μB j (y) is considered.The MI of X and Y is given by I (X,Y ) = H(X) + H(Y ) − H(X,Y ), where theentropies are estimated as follows:

H(X) = −MX∑

i=1

p(ai ) log p(ai ), (6)

H(Y ) = −MY∑

j=1

p(b j ) log p(b j ), (7)

H(X,Y ) = −MX∑

i=1

MY∑

j=1

p(ai , b j

)log p

(ai , b j

). (8)

In the fuzzy-histogram approach the probability of state (bin) ai of data X is cal-culated as follows:

p(ai ) = Mai∑MX

l=1 Mal

, (9)

where Mai is the sum of membership values of samples of X belonging to the fuzzyset Ai :

123

Page 8: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

Mai =N∑

k=1

μAi (xk). (10)

In the crisp histogram MI estimation method, the probability of observing a sampleof X in the bin ai is estimated by the relative frequency of samples of X observed in thebin ai , and it equals to ki

N . In the fuzzy histogram method, this probability is estimatedby the fraction in Eq. 9: the nominator is the sum of membership values of samples ofX belonging to the fuzzy set Ai , and the denominator is the sum of membership valuesof samples of X belonging to all fuzzy sets. Hence, the crisp histogram is a specialcase of the fuzzy histogram, where the membership value of each sample belonging toa bin is either one or zero. Thus, in this case,

∑Nk=1 μAi (xk) is equal to the frequency

of samples of X observed in the cell ai , and∑MX

l=1 Mal is equal to the N . Therefore,Eq. 9 is equivalent to the relative frequency of the crisp case.

Similarly, the probability of state (bin) b j of data Y and Mb j are as follows:

p(b j ) = Mb j∑MY

s=1 Mbs

, (11)

Mb j =∑N

k=1μB j (yk). (12)

In the crisp histogram method, p(ai , b j ) is defined as the probability of observinga sample (x, y) of (X,Y ) in the bin (ai , b j ) (where x lies in the bin ai and y in thebin b j ). Let ki j be the number of samples which lie in the bin (ai , b j ). Based on the

frequentist approach to probability, p(ai , b j ) ≈ ki jN .

Now denote by (Ai , B j ) the fuzzy set associated with the bin (ai , b j ). Let μ(Ai ,B j)

be the membership function of (Ai , B j ). We use the product for defining the mem-bership function, that is, for any data-point (xk, yk), we have μ(Ai ,B j )(xk, yk) =μAi (xk) · μB j (yk).

Following the analogy of the crisp case, the frequentist approach suggests that theprobability p(ai , b j ) can be estimated by the relative sum of membership values ofsamples of (X,Y ) belonging to the fuzzy set (Ai , B j ). Therefore, the joint probabilityof the bin (ai , b j ) is computed by:

p(ai , b j ) = Mai b j∑MX

l=1

∑MYs=1 Mal bs

, (13)

where Mai b j is obtained by the following equation:

Mai b j =N∑

k=1

μ(Ai ,B j )(xk, yk) =N∑

k=1

μAi (xk) · μB j (yk). (14)

Using the sum-product instead of max-min or max-product is natural due to theanalogy between the fuzzy and crisp cases, and the way one counts the data-points lyingin each bin. In other words, the summation of membership values in the fuzzy method

123

Page 9: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

plays a similar role as counting the number of samples observed in the bin (ai , b j ) inthe crisp case. Additionally, similar to p(ai ), the crisp histogram is a special case ofthe fuzzy histogram; because in the crisp case, if a sample belongs to a bin (ai , b j ),its membership value is 1, and it is 0 otherwise. Thus, in the crisp case, p(ai , b j ) ascomputed by Eq. 13 is equal to the relative frequency of samples belonging to the bin(ai , b j ).

4 Experimental results

In this section, the fuzzy-histogram MI estimator is investigated from different aspects.Firstly, in a simple experiment, the fuzzy-histogram MI estimator is compared with thehistogram estimator over independent variables. In the second part, the effects of theparameters of the fuzzy-histogram estimator, including the shape of the membershipfunctions, and the number of bins are investigated for three different distributions. Inthe third part, the accuracy and the running time of the fuzzy-histogram estimator arecompared with those of the histogram estimator, and the KDE. The fourth experimentcompares the bias of the fuzzy-histogram estimator with the histogram estimator. Thefifth experiment investigates the variance of the fuzzy-histogram estimator. The sixthexperiment compares different dependency measures on the data with different degreesof dependency. The final experiment is devoted to comparing the histogram estimator,fuzzy-histogram estimator, and the KDE in a real-world application.

4.1 Mutual information of independent variables

In this simple experiment, the difference between the accuracy of the fuzzy-histogramMI estimator and the histogram estimator for independent variables is shown. Wehave two independent and uniformly distributed variables X and Y . Since X and Yare independent, and the true value of MI of two independent variables are zero,I exact (X,Y ) = 0. In the Fig. 4, two estimators of MI are compared for differentsample size N and different number of bins M (M = MX = MY ). The experimentrepeated 1,000 times with independent realizations of X and Y . The estimations of MI

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

N

Est

imat

ed M

I

(a) Number of Bins=5

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

N

Est

imat

ed M

I

(b) Number of Bins=10

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

N

Est

imat

ed M

I

(c) Number of Bins=15

Fuzzy Histogram Estimator Histogram Estimator

Fig. 4 The fuzzy-histogram estimation and histogram estimation of two independent variables

123

Page 10: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

reported in the graphs are averaged over these 1,000 trials and the error bars show thestandard deviations. In this experiment, the triangular membership function is used forpartitioning (similar to Fig. 1). Both estimations of the MI are overestimated. However,the overestimation of the fuzzy-histogram method is less than the overestimation of thehistogram method. For less N and more bins, the difference between the two methodsare more considerable. Thus, the fuzzy-histogram estimator provides more accurateestimation for independent variables, especially when the number of samples is few.

Additionally, the experimental overestimation of the histogram estimator is consis-tent with the theoretical N-bias. For example, when N = 100 and the number of binsis equal to 10, the theoretical overestimation is as follows:

E[I − Iexcat

] = (10 − 1)(10 − 1)

2 ∗ 100≈ 0.41,

and the experimental N-bias for the histogram estimator is equal to 0.47. However, N-bias of the fuzzy-histogram estimator is less. In this case (N = 100), it is equal to 0.18.In the Section 4.4 more investigations are conducted on the bias of the fuzzy-histogramestimator.

4.2 The effects of the parameters of the fuzzy-histogram method

In this experiment, the effects of the parameters of the fuzzy-histogram MI estimatorare investigated by experimental analysis for three distributions. To find an appropriatemembership function, the GNF (Eq. 5) is used. As mentioned in Sect. 3, GNF is aparametric continuous function (see Fig. 2). By changing the parameter β of thisfunction its shape is changed. When β = 2, the shape of the function is like thepdf of the normal distribution. Furthermore, when β = 1, it is like the pdf of theLaplace distribution and its shape is similar to the triangular function. By increasingthe parameter β, the shape of the function gradually becomes similar to the pulsefunction. Moreover, α is the scale parameter.

Hence, the fuzzy-histogram estimator with generalized normal membership func-tion, has three important parameters, β, and α, which indicate the shape and scale ofmembership function and the number of bins.

Here, the impacts of parameters of the fuzzy-histogram MI estimator are investi-gated for data with three distributions, bivariate normal, bivariate gamma-exponential,and bivariate ordered Weinman exponential distribution. In the following, these dis-tributions and their exact MI between their variates are brought.

1. Bivariate Normal Distribution: the pdf of this distribution is as follows:

p(x, y) = 1

2πσ1σ2√

1 − ρ2e

−z2(1−ρ2) , (15)

where z = (x−μ1)2

σ 21

− 2ρ(x−μ1)(y−μ2)σ1σ2

+ (y−μ2)2

σ 22

and ρ = corr(X,Y ).

123

Page 11: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

The exact MI between the variates X and Y of the bivariate normal distributionwith the correlation coefficient ρ is Moddemeijer (1989):

I (X,Y ) = 1

2log

(1

1 − ρ2

)

. (16)

2. Bivariate Gamma Exponential Distribution: the pdf of this distribution is Darbellay(2000) and Zografos and Nadarajah (2005):

p(x, y) = θθ21 θ3

Γ (θ2)xθ2 e−θ1x−θ3xy . (17)

The exact MI between the variates of this distribution is Darbellay (2000):

I (X,Y ) = ψ(θ2)− ln(θ2)+ 1

θ2, (18)

where ψ(z) is the digamma function ψ(z) = dd(z) lnΓ (z) = Γ ′(z)

Γ (z) .3. Bivariate Ordered Weinman Exponential Distribution: the pdf of two-dimensional

ordered Weinman exponential distribution is as follows Darbellay (2000) andZografos and Nadarajah (2005):

p(x, y) =(

2

θ0e− 2θ0(x−x0)

)

×(

1

θ1e− 1θ1(y−x)

)

, (19)

with x0 � x � y, and θ0, θ1 > 0.The exact MI between the variates of the bivariate ordered Weinman exponentialdistribution is Darbellay (2000):

I (X,Y ) =

⎧⎪⎪⎨

⎪⎪⎩

ln(

1θ1

(θ02 − θ1

))+ Ψ

(θ0

θ0−2θ1

)− Ψ (1) if θ1 <

θ02

−Ψ (1) if θ1 = θ02

ln(

1θ1

(θ1 − θ0

2

))+ Ψ

(2θ1

2θ1−2θ0

)− Ψ (1) if θ1 >

θ02

(20)

In each of these distributions, the exact MI depends on some parameters of thedistribution. For bivariate normal distribution, the MI between its variates dependsonly on the correlation coefficient ρ. For the gamma-exponential distribution, MIdepends on the parameter θ2, and for the ordered Weinman exponential distribution,it depends on θ0 and θ1 or precisely on the proportion θ1/θ0.

Here, we want to find appropriate parameters of the fuzzy-histogram MI estimatorfor each distribution, such that for different data driven from that distribution withdifferent exact MI, the average absolute error is minimized. For example, for the normaldistribution, we want to find appropriate parameter values among several parametervalues such that for different bivariate normal data with differentρ the average absoluteerror is minimized.

123

Page 12: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

−4 −2 0 2 4−4−2

024

x

yρ=0

−4 −2 0 2 4−4−2

024

x

y

ρ=0.1

−4 −2 0 2 4−4−2

024

x

y

ρ=0.2

−4 −2 0 2 4−4−2

024

x

y

ρ=0.3

−4 −2 0 2 4−4−2

024

x

y

ρ=0.4

−4 −2 0 2 4−4−2

024

x

y

ρ=0.5

−4 −2 0 2 4−4−2

024

x

y

ρ=0.6

−4 −2 0 2 4−4−2

024

x

yρ=0.7

−4 −2 0 2 4−4−2

024

x

y

ρ=0.8

−4 −2 0 2 4−4−2

024

x

y

ρ=0.9

Fig. 5 Data with bivariate normal distribution with different correlation coefficients ρ

Thus, for this experiment, different realizations were generated from each distribu-tion with different parameter settings. Bivariate normal samples were generated with10 different correlation coefficient ρ = {0, 0.1, . . . , 0.8, 0.9}. The mean vector and

the covariance matrix are set to Mean =[

00

]

and∑=

[1 ρρ 1

]

respectively.

Various realizations of the bivariate normal distribution with differentρ are shown inthe Fig. 5, to illustrate graphically the relation between the variates of this distribution.Here, the sample size N is 500.

For the bivariate gamma-exponential distribution samples were generated withθ1, θ3 = 1 and θ2 = {1, 2, . . . , 19, 20}. Figure 6 illustrates several realizations ofthis distribution with different parameter values.

Finally for the bivariate ordered Weinman exponential distribution samples weregenerated with θ0 = 100 and θ1 = {10, 20, . . . , 90, 100}. Figure 7 illustrates severalrealizations of this distribution with different parameter values.

As mentioned above, for finding appropriate parameters of the fuzzy histogram foreach of these distributions, the average absolute error between the estimated MI andexact MI is used. This average is computed over different realizations with differentparameter settings. Formally:

1. P is the number of different parameter settings for the underlying probabilitydistribution, for which the sampling took place. For instance, consider the bivariatenormal distribution in Table 1. Since the values of μ1, μ2, σ1, and σ2 are fixed,and only 10 different values of ρ are used, we have 10 different settings. Therefore,P = 10.

2. T is the number of trials; that is, the number of realizations over which the erroris computed.

123

Page 13: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

0 10 20 300

2

x

2=1

0 10 20 300

2

x

y

θ2=3

0 10 20 300

2

x

y

θ2=5

0 10 20 300

2

x

y

θ2=7

0 10 20 300

2

xy

θ2=9

0 10 20 300

2

x

y

θ2=11

0 10 20 300

2

x

y

θ2=13

0 10 20 300

2

x

2=15

0 10 20 300

2

x

y

θ2=17

0 10 20 300

2

x

y

θ2=19

Fig. 6 Data with gamma-exponential distribution with different θ2’s

0 100 200 300 4000

500

x

y

θ 1 / θ0=0.1

0 100 200 300 4000

500

x

y

θ 1 / θ0=0.2

0 100 200 300 4000

500

x

y

θ 1 / θ0=0.3

0 100 200 300 4000

500

x

y

θ 1 / θ0=0.4

0 100 200 300 4000

500

x

y

θ 1 / θ0=0.5

0 100 200 300 4000

500

x

y

θ 1 / θ0=0.6

0 100 200 300 4000

500

x

y

θ 1 / θ0=0.7

0 100 200 300 4000

500

x

y

θ 1 / θ0=0.8

0 100 200 300 4000

500

x

y

θ 1 / θ0=0.9

0 100 200 300 4000

500

x

y

θ 1 / θ0=1

Fig. 7 Data with bivariate ordered Weinman exponential distribution with different θ1/θ0. The parameterθ0 is equal to 100 and the parameter θ1 is changed from 10 to 100 by a step equals to 10

123

Page 14: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

Table 1 The parameter setting of the distributions used in the experiments

Distribution Parameters

Bivariate normal μ1 = μ2 = 0, σ1 = σ2 = 1,

ρ = {0.0, 0.1, . . . , 0.8, 0.9}Bivariate gamma-exponential θ1, θ3 = 1, θ2 = {1, 2, . . . , 19, 20}Bivariate ordered Weinman exponential θ0 = 100, θ1/θ0 = {0.1, 0.2, . . . , 0.9, 1}

Define AvgError Ti as the average error in the i th setting, where the number of trials

is T :

AvgError Ti = 1

T

T∑

j=1

(∣∣∣ Ii j − Ii

exact∣∣∣). (21)

Moreover, define AvgError P as the average error over the P different parametersettings of the underlying distribution:

AvgError P = 1

P

P∑

i=1

AvgError Ti . (22)

The effects of parameter values of the fuzzy histogram are evaluated based on theAvgError P , as explored next. In the experiments, the distributions and settings men-tioned in Table 1 are used.

4.2.1 Parameter β

First, we examine the effect of parameter β which indicates the shape of the member-ship function. For this reason, the parameters α and the number of bins are fixed. α isfixed to the following values.

αX = h X

2, h X = max(X)− min(X)

MX − 1

αY = hY

2, hY = max(Y )− min(Y )

MY − 1(23)

In summary α is fixed to the h/2. Moreover, the number of bins M is identical forthe X and Y (M = MX = MY ). For the bivariate normal and the bivariate gamma-exponential distributions, the number of bins is fixed to 10 (M = MX = MY = 10),and for the bivariate ordered Weinman exponential distribution, the number of bins isfixed to M = MX = MY = 30. In this experiment N = 500.

Figure 8 shows The average error AvgError P (see Eq. 22) versus β for each ofthe three distributions. The number of trials T is 50. For the normal distributionwhen β = 2 the average error is minimized. Moreover, for the gamma-exponential

123

Page 15: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

0 2 4 6 8 100.04

0.05

0.06

0.07

0.08

0.09

β

Ave

rage

Err

orNormal Distribution

0 2 4 6 8 100.03

0.04

0.05

0.06

0.07

0.08

β

Ave

rage

Err

or

Gamma−Exponential Distribution

0 2 4 6 8 100.1

0.15

0.2

0.25

0.3

0.35

β

Ave

rage

Err

or

Ordered Weinman Distribution

Fig. 8 The average error AvgError P (see Eq. 22) versus β

distribution and the ordered Weinman exponential distribution the average error takesits minimum when β = 1 and β = 2 respectively.

This experiment indicates that fuzzification can improve the average error. Based onthe results of Fig. 8, when β is small the average error is minimum and by increasing βthe average error is increased. Moreover, by increasing β, the membership functionstend to the crisp membership functions. Thus, in average the fuzzy-histogram MIestimator provide more accurate estimation than the histogram estimation, for thedata sampled from these three distributions.

4.2.2 Number of bins

To investigate the impact of the number of bins M (M = MX = MY ), α and β arefixed. α is fixed to h/2, (see Eq. 23) and β is set to the specific values which led to theminimum average error in the previous experiment.

Figure 9 illustrates the average of absolute error (AvgError P ) over 50 trials versusthe number of bins.

As can be seen in the graphs, when M is greater or less than a certain value, theerror is increased. The reason is that when the number of bins is increased, the N-biasis increased and the R-bias is decreased, and when the number of bins is decreased,the R-bias is increased and the N-bias decreased. Hence, the number of bins makes atrade-off between the N-bias and R-bias.

Furthermore, as shown in the graphs, by changing the number of bins (M), thevariation of the average error of the fuzzy-histogram method is less than that of thehistogram method. Thus, the sensitivity of the fuzzy histogram to M is a bit less thanthe sensitivity of the histogram method.

4.2.3 Parameter α

Here, the impact of α and the relation among the α, β, and the number of bins arestudied. In the previous subsections, α was fixed to the h/2 (see Eq. 23). Here, wefound an appropriate scaling parameter α among different coefficients of h.

Figure 10 demonstrates the averages of absolute error AvgError P for the fuzzy-histogram estimator with the GNF membership functions with different parametersα, β, and the number of bins M (M = MX = MY ) for each distribution. Here, thesample size N is equal to 500.

123

Page 16: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

0 5 10 15 200

0.2

0.4

0.6

0.8

1

1.2

Number of Bins

Ave

rage

Err

orN=100

0 5 10 15 200

0.050.1

0.150.2

0.250.3

0.35

Number of Bins

Ave

rage

Err

or

N=1000

0 5 10 15 200

0.050.1

0.150.2

0.250.3

0.35

Number of Bins

Ave

rage

Err

or

N=500Histogram EstimatorFuzzy Histogram Estimator

0 5 10 15 200

0.10.20.30.40.50.60.70.8

Number of Bins

Ave

rage

Err

or

N=100

0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

Number of Bins

Ave

rage

Err

orN=500

0 5 10 15 20

0.03

0.08

0.13

0.18

0.23

Number of Bins

Ave

rage

Err

or

N=1000

Histogram EstimatorFuzzy Histogram Estimator

0 10 20 30 40 500

0.5

1

1.5

2

Number of Bins

Ave

rage

Err

or

N=100

10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

Number of Bins

Ave

rage

Err

or

N=500

10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

Number of Bins

Ave

rage

Err

or

N=1000Fuzzy Histogram EstimatorHistogram Estimator

(a)

(b)

(c)

Fig. 9 The averages of the absolute error (AvgError P ) over 50 trials versus the number of bins, for datawith: a bivariate normal distribution, b bivariate gamma-exponential, and c bivariate ordered Weinmandistribution

As can be seen in the graphs, the results for the three distributions are similar. Forthe appropriate values of β and M found previously, appropriate values of α are aroundh/2. Thus, h/2 is a proper value for α. Additionally, these results are consistent withthe results of the Sects. 4.2.1 and 4.2.2, and demonstrated the relations between α, βand M .

4.3 The accuracy and the running time of the fuzzy-histogram MI estimator

After study the impacts of the parameters of the fuzzy-histogram method and findingappropriate values for these parameters, the accuracy of the fuzzy-histogram estimatoris compared with the histogram estimator, for the three mentioned distributions (seeTable 1).

Figure 11 demonstrates the histogram and the fuzzy-histogram estimation and theexact MI for the three distributions. The values were averaged over 50 trials and theerror bars denote the standard deviation. The parameters of the methods were set tothe specific values which led to the minimum average absolute errors in the Sect. 4.2.

123

Page 17: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

0.3h

0.4h

0.5h

0.6h

0.7h

0.8h

0.9h

510

1520

00.10.20.30.4

Number of Bins

β=1

α

Ave

rrag

e E

rror

0.3h0.4h

0.5h0.6h

0.7h0.8h

0.9h

510

1520

0

0.2

0.4

Number of Bins

β=2

α

Ave

rage

Err

or0.3h

0.4h0.5h

0.6h0.7h

0.8h0.9h

510

1520

0

0.2

0.4

Number of Bins

β=3

α

Ave

rage

Err

or0.3h

0.4h0.5h

0.6h0.7h

0.8h0.9h

36

912

15

0

0.1

0.2

Number of Bins

β=3

αA

vera

ge E

rror

0.3h0.4h

0.5h0.6h

0.7h0.8h

0.9h

36

912

15

0

0.1

0.2

Number of Bins

β=1

α

Ave

rage

Err

or

0.3h0.4h

0.5h0.6h

0.7h0.8h

0.9h

36

912

15

0

0.1

0.2

Number of Bins

β=2

α

Ave

rage

Err

or

0.3h0.4h

0.5h0.6h

0.7h0.8h

0.9h

1520

2530

35

0

0.5

1

Number of Bins

β=1

α

Ave

rage

Err

or

0.3h0.4h

0.5h0.6h

0.7h0.8h

0.9h

1520

2530

35

0

0.5

1

Number of Bins

β=2

α

Ave

rage

Err

or

0.3h0.4h

0.5h0.6h

0.7h0.8h

0.9h

1520

2530

35

0

0.5

1

Number of B

ins

β=3

α

Ave

rage

Err

or

(a)

(b)

(c)

Fig. 10 The averages of the absolute error (AvgError P ) of the fuzzy-histogram estimator, with the GNFmembership functions with different parametersα, β, and the number of bins. The data is distributed accord-ing to a bivariate normal distribution, b bivariate gamma-exponential, and c bivariate ordered Weinmandistribution

As shown in the graphs, most of the times, the fuzzy-histogram estimation is closerto the exact MI. Based on this experiment, for both estimators by increasing thesample size, the accuracy of the estimation is improved. However, when the numberof samples is few, the estimation of the fuzzy-histogram is considerably better than thatof the histogram method. Hence, this experiment indicates that the fuzzy-histogrammethod provides a good estimation, even when the available sample data is not large-enough.

Kernel density estimation (KDE) is known as an effective algorithm for estimatingMI Schaffernicht et al. (2010). KDE outperforms the naïve histogram estimator interms of accuracy. KDE provides a high quality estimation. However, it is a very timeconsuming method. Here, we also compare the fuzzy-histogram method with KDE.

123

Page 18: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

0 0.2 0.4

0.5

1

1.5

2

θ1/θ

0

Est

imat

ed M

I

N=100

0 0.2 0.4

0.5

1

1.5

2

θ1/θ

0

Est

imat

ed M

I

N=500

0 0.2 0.4

0.5

1

1.5

2

θ1/θ0

Est

imat

ed M

IN=1000

Exact MIFuzzy Histogram EstimatorHistogram Estimator

(a)

(b)

(c)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

ρ

Est

imat

ed M

I

N=1000

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

ρ

Est

imat

ed M

I

N=500

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

ρ

Est

imat

ed M

IN=100

Exact MIFuzzy Histogram EstimatorHistogram Estimator

Exact MIFuzzy Histogram EstimatorHistogram Estimator

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

θ2

Est

imat

ed M

IN=500

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

θ2

Est

imat

ed M

I

N=1000

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

θ2

Est

imat

ed M

I

N=100

Exact MIHistogram EstimatorFuzzy Histogram Estimator

Fig. 11 The fuzzy-histogram estimation, and histogram estimation of the MI, between the variates of the:a bivariate normal data, b bivariate gamma-exponential, and c bivariate ordered Weinman distribution. Thevalues were averaged over 50 trials and the error bars denote the standard deviation. The black solid linesindicate the exact MI which obtained from equation 16, equation 18, and equation 20

The average and standard deviation (over 50 trials) of absolute error (|I exact − I |)of the fuzzy-histogram method, histogram method, and KDE are compared inTables 2, 3 and 4. Note that the parameters of the fuzzy and crisp histogram meth-ods were set to their appropriate values which were obtained in the previous sec-tion. For the KDE its optimum value (Moon et al. 1995) of the smoothing para-meter was used. In Tables 2, 3 and 4, for the cases that the error of the fuzzymethod is better than the errors of both the histogram and KDE, the error ofthe fuzzy method is written in boldface. Moreover, when the error of the fuzzy-histogram method is only better than that of the histogram method, the error ofthe fuzzy method is underlined. In the cases that the error of the fuzzy-histogrammethod is only better than the error of the KDE, the error of the fuzzy method isitalicized.

123

Page 19: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

Tabl

e2

The

abso

lute

erro

rsof

the

estim

atio

nm

etho

dsfo

rth

eda

taw

ithth

ebi

vari

ate

norm

aldi

stri

butio

n

p0

0.1

0.2

0.3

0.4

N=

100

KD

E0.

0802

±0.

0113

0.08

71±

0.03

10.

0902

±0.

0286

0.11

17±

0.03

470.

1034

±0.

0493

His

togr

am0.

253

±0.

0577

0.25

21±

0.03

870.

2293

±0.

0479

0.22

94±

0.05

180.

2264

±0.

0504

Fuzz

y-hi

stog

ram

0.09

26±

0.02

90.

0951

±0.

0323

0.09

92±

0.03

320.

0917

±0.

0398

0.07

79±

0.03

68

N=

500

KD

E0.

041

±0.

0046

0.04

02±

0.00

690.

0431

±0.

0217

0.04

±0.

017

0.06

28±

0.01

05

His

togr

am0.

2202

±0.

0303

0.22

69±

0.03

020.

2171

±0.

0346

0.20

86±

0.03

60.

2002

±0.

0465

Fuzz

y-hi

stog

ram

0.00

88±

0.04

420.

0093

±0.

0407

0.01

0.03

830.

0137

±0.

0356

0.01

67±

0.03

27

N=

1,00

0K

DE

0.03

16±

0.00

510.

0288

±0.

0072

0.03

12±

0.00

760.

0316

±0.

0076

0.02

75±

0.01

68

His

togr

am0.

0386

±0.

0058

0.03

71±

0.00

640.

0337

±0.

0092

0.03

55±

0.01

070.

0282

±0.

0125

Fuzz

y-hi

stog

ram

0.02

08±

0.00

440.

0194

±0.

0058

0.01

0.00

830.

0165

±0.

0094

0.00

98±

0.01

17p

0.5

0.6

0.7

0.8

0.9

N=

100

KD

E0.

0482

±0.

0336

0.05

98±

0.04

740.

1423

±0.

0557

0.01

12±

0.05

410.

1307

±0.

0519

His

togr

am0.

2064

±0.

056

0.16

21±

0.05

710.

1343

±0.

0699

0.08

0.07

260.

0579

±0.

0796

Fuzz

y-hi

stog

ram

0.06

11±

0.03

820.

032

±0.

0566

0.01

23±

0.05

380.

0583

±0.

0613

0.18

25±

0.07

09

N=

500

KD

E0.

0243

±0.

0208

0.04

27±

0.04

830.

0351

±0.

027

0.01

69±

0.03

170.

0563

±0.

0204

His

togr

am0.

1834

±0.

0461

0.15

0.05

030.

1214

±0.

0569

0.07

44±

0.06

370.

0409

±0.

0584

Fuzz

y-hi

stog

ram

0.01

98±

0.02

290.

0235

±0.

0098

0.02

75±

0.01

940.

0308

±0.

0438

0.03

12±

0.14

55

N=

1,00

0K

DE

0.02

96±

0.02

720.

0333

±0.

011

0.03

91±

0.03

550.

0257

±0.

0322

0.03

79±

0.02

18

His

togr

am0.

0219

±0.

0143

0.01

0.01

910.

005

±0.

0202

0.03

56±

0.02

340.

1103

±0.

0309

Fuzz

y-hi

stog

ram

0.00

0.01

270.

0086

±0.

0169

0.03

27±

0.01

720.

0693

±0.

021

0.16

76±

0.02

76

123

Page 20: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

Tabl

e3

The

abso

lute

erro

rsof

the

estim

atio

nm

etho

dsfo

rth

eda

taw

ithth

ebi

vari

ate

gam

ma-

expo

nent

iald

istr

ibut

ion

θ 21

35

79

N=

100

KD

E0.

2896

±0.

049

0.00

12±

0.03

230.

0535

±0.

0432

0.06

05±

0.05

70.

0781

±0.

0204

His

togr

am0.

3437

±0.

0507

0.03

98±

0.06

160.

0996

±0.

0597

0.11

14±

0.05

150.

1319

±0.

0386

Fuzz

y-hi

stog

ram

0.24

15±

0.06

990.

0216

±0.

0319

0.00

21±

0.03

660.

0019

±0.

0269

0.00

04±

0.02

4

N=

500

KD

E0.

3546

±0.

0574

0.01

19±

0.00

920.

0125

±0.

0164

0.02

76±

0.00

810.

0291

±0.

008

His

togr

am0.

4028

±0.

0111

0.04

63±

0.03

140.

0144

±0.

017

0.02

65±

0.01

960.

0414

±0.

0192

Fuzz

y-hi

stog

ram

0.31

44±

0.04

490.

0138

±0.

0265

0.01

08±

0.01

830.

0088

±0.

0195

0.01

54±

0.01

34

N=

1,00

0K

DE

0.33

96±

0.05

740.

0127

±0.

0092

0.02

34±

0.01

640.

018

±0.

0081

0.01

99±

0.00

8

His

togr

am0.

4114

±0.

0058

0.06

51±

0.02

320.

0169

±0.

0171

0.00

0.01

120.

0102

±0.

007

Fuzz

y-hi

stog

ram

0.34

41±

0.02

860.

0442

±0.

0171

0.01

71±

0.01

530.

0041

±0.

0115

0.00

34±

0.00

89

θ 211

1315

1719

N=

100

KD

E0.

0724

±0.

0387

0.06

47±

0.01

670.

0691

±0.

0328

0.06

13±

0.03

040.

0842

±0.

0046

His

togr

am0.

1543

±0.

0405

0.15

29±

0.04

820.

1722

±0.

052

0.17

43±

0.04

20.

1667

±0.

0482

Fuzz

y-hi

stog

ram

0.00

0.02

640.

0063

±0.

0215

0.01

61±

0.02

880.

0103

±0.

0178

0.01

21±

0.02

44

N=

500

KD

E0.

0365

±0 .

0068

0.04

±0.

0061

0.03

25±

0.00

210.

0388

±0.

0026

0.03

44±

0.00

47

His

togr

am0.

0405

±0.

0169

0.04

37±

0.01

410.

0446

±0.

0155

0.04

54±

0.01

770.

0456

±0.

0152

Fuzz

y-hi

stog

ram

0.01

0.01

130.

0124

±0.

0129

0.01

44±

0.01

210.

0157

±0.

013

0.01

41±

0.01

19

N=

1,00

0K

DE

0.02

39±

0.00

680.

0251

±0 .

0061

0.03

04±

0.00

210.

0287

±0.

0026

0.02

75±

0.00

47

His

togr

am0.

017

±0.

0084

0.01

93±

0.00

760.

0211

±0.

0085

0.02

08±

0.01

060.

019

±0.

0084

Fuzz

y-hi

stog

ram

0.00

15±

0.00

890.

0007

±0.

0073

0.00

35±

0.00

730.

0045

±0.

0079

0.00

19±

0.00

64

123

Page 21: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

Tabl

e4

The

abso

lute

erro

rsof

the

estim

atio

nm

etho

dsfo

rth

eda

taw

ithth

ebi

vari

ate

orde

red

Wei

nman

expo

nent

iald

istr

ibut

ion

dist

ribu

tion

θ 1/θ 0

10.

90.

80.

70.

6

N=

100

KD

E0.

0797

±0.

0596

0.03

0.08

670.

0524

±0.

0932

0.05

43±

0.04

640.

0346

±0.

068

His

togr

am0.

186

±0.

1067

0.19

81±

0.09

230.

1852

±0.

0885

0.18

81±

0.09

020.

1416

±0.

1128

Fuzz

y-hi

stog

ram

0.05

01±

0.06

220.

0481

±0.

0543

0.04

72±

0.05

70.

0391

±0.

0633

0.02

34±

0.08

39

N=

500

KD

E0.

0803

±0.

0138

0.05

79±

0.03

350.

0617

±0.

0273

0.05

38±

0.05

180.

0705

±0.

0242

His

togr

am0.

2358

±0.

0596

0.24

0.05

390.

2188

±0.

0439

0.19

82±

0.06

660.

1796

±0.

0467

Fuzz

y-hi

stog

ram

0.18

65±

0.03

520.

1819

±0.

0333

0.17

35±

0.02

720.

134

±0.

0612

0.12

27±

0.03

54

N=

1,00

0K

DE

0.05

0.01

120.

0686

±0.

0163

0.06

19±

0.02

420.

0765

±0.

0207

0.07

66±

0.02

His

togr

am0.

0917

±0.

0359

0.08

0.02

690.

0978

±0.

0297

0.07

29±

0.03

870.

0667

±0.

0305

Fuzz

y-hi

stog

ram

0.04

0.02

450.

0519

±0.

0252

0.04

43±

0.02

910.

0332

±0.

0284

0.01

32±

0.03

49

θ 1/θ 0

0.5

0.4

0.3

0.2

0.1

N=

100

KD

E0.

0191

±0.

0815

0.00

01±

0.09

130.

0266

±0.

122

0.02

91±

0.14

330.

0121

±0.

0763

His

togr

am0.

1144

±0.

132

0.07

07±

0.10

730.

0457

±0.

1464

0.27

0.10

510.

6176

±0.

1371

Fuzz

y-hi

stog

ram

0.00

65±

0.13

080.

057

±0.

1198

0.11

49±

0.13

60.

3344

±0.

1842

0.65

0.16

66

N=

500

KD

E0.

0803

±0.

0333

0.05

86±

0.04

840.

0905

±0.

0296

0.01

41±

0.06

40.

0305

±0.

0827

His

togr

am0.

1822

±0.

0662

0.11

85±

0.05

520.

0752

±0.

0736

0.05

14±

0.08

330.

2386

±0.

0812

Fuzz

y-hi

stog

ram

0.10

16±

0.06

370.

0267

±0.

0662

0.01

83±

0.09

620.

2265

±0.

177

0.38

68±

0.11

7

N=

1,00

0K

DE

0.08

03±

0.01

660.

0594

±0.

0291

0.08

75±

0.00

970.

0969

±0.

0455

0.03

37±

0.05

65

His

togr

am0.

0487

±0.

0388

0.01

12±

0.03

860.

0272

±0.

0426

0.13

29±

0.06

280.

3396

±0.

0575

Fuzz

y-hi

stog

ram

0.00

45±

0.02

990.

0516

±0 .

0526

0.08

58±

0.03

20.

3278

±0.

143

0.48

63±

0.08

41

123

Page 22: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

Table 5 The running times of the estimation methods (in seconds)

Distribution Method N = 100 N = 500 N = 1,000

Normal KDE 0.3452 ± 0.1986 8.210 ± 4.483 25.716 ± 7.437

Histogram 0.0014 ± 0.0002 0.0043 ± 0.0016 0.0014 ± 0.0002

Fuzzy-histogram 0.022 ± 0.0031 0.2402 ± 0.048 0.4720 ± 0.0501

Gamma-exponential KDE 0.6389 ± 0.1294 7.6489 ± 3.9953 28.1857 ± 11.6613

Histogram 0.0016 ± 0.001 0.005 ± 0.0013 0.009 ± 0.002

Fuzzy-histogram 0.01792 ± 0.0022 0.1165 ± 0.0119 0.2526 ± 0.0450

Ordered Weinmanexponential

KDE 0.6379 ± 0.0851 6.971 ± 1.345 25.623 ± 3.201

Histogram 0.0016 ± 0.001 0.005 ± 0.0013 0.009 ± 0.002

Fuzzy-histogram 0.01792 ± 0.0022 0.1165 ± 0.0119 0.2526 ± 0.04570

As can be seen in the tables, in many cases the fuzzy method outperforms both thehistogram and KDE methods. These tables show that the fuzzy-histogram method canbe considered as an appropriate estimation method for the MI.

Another important property of an estimation method is its running time. Figure 5demonstrates the average and standard deviation of the running times of the threeestimation methods. Although the KDE is an accurate estimation method, it is verytime consuming and it cannot be used in many applications. However, the averagerunning time of the fuzzy-histogram method is significantly less than that of the KDE.Thus, the fuzzy-histogram method is an efficient estimation method for estimatingMI.

In the KDE, one first estimates the value of the probability density functionsfX (x), fY (y), and fXY (x, y) for each data-point (xk, yk), then tries to estimatethe MI using the numerical evaluation (cf. Steuer et al. 2002, Equations 31–32)of I (X; Y ) = ∫

y

∫x fXY (x, y) log fXY (x,y)

fX (x) fY (y)dx dy. On the other hand, the fuzzy-

histogram method estimates the probability that a random point lies in each bin, ratherthan the probability density function of the individual data-points. Therefore, all ofour calculations are based on the pmf of bins, while the calculations of the KDE arebased on the pdf of the individual data-points. Since the number of bins is usuallymuch smaller than the number of the data-points, our method significantly outper-forms KDE. Moreover, as shown in the paper, its accuracy is comparable, and in somecases even superior to the accuracy of the KDE.

Another difference between the KDE and the fuzzy-histogram method is that theformer uses kernel functions, while the latter uses fuzzy membership functions. Noticethat the membership functions are more general than the kernel functions, since akernel function K must satisfy the symmetry property K (−x) = K (x) for all x inits domain. Moreover, as described above, we use one membership function per bin,while the KDE requires one kernel per each data point. Thus, the number of kernelsused in the KDE is significantly more than the number of membership functions in ourmethod. Again, the computational overhead of our method is significantly less thanthat of the KDE. Moreover, our results show that the accuracy of the fuzzy method iscomparable to, and in some cases better than the accuracy of the KDE.

123

Page 23: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Exact MI

Est

imat

ed M

I

N=500

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Exact MI

Est

imat

ed M

I

N=1000

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Exact MI

Est

imat

ed M

IN=100

FH, M=5H, M=5FH, M=10H, M=10FH, M=15H, M=15

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

Exact MI

Est

imat

ed M

IN=500

0 0.1 0.2 0.3 0.40

0.1

0.2

0.3

0.4

0.5

Exact MI

Est

imat

ed M

I

N=1000

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

Exact MI

Est

imat

ed M

I

N=100 FH, M=5H, M=5FH, M=10H, M=10FH, M=15H, M=15

(a)

(b)

(c)

0 0.5 1 1.5 20.20.40.60.8

11.21.41.6

Exact MI

Est

imat

ed M

I

N=100

0 0.5 1 1.5 20.20.40.60.8

11.2

1.41.6

Exact MI

Est

imat

ed M

I

N=500

0 0.5 1 1.5 20.2

0.4

0.6

0.8

1

1.2

1.4

1.61.8

Exact MI

Est

imat

ed M

I

N=1000FH, M=20H, M=20FH, M=25H, M=25FH, M=30H, M=30

Fig. 12 The average of the estimated mutual information (over 50 trials) as a function of the exact MI fordata with: a bivariate normal distribution, b bivariate gamma-exponential, and c bivariate ordered Weinmandistribution. The ideal estimation occurs when I = Iexact . “FH” and “H” denote the fuzzy histogram, andhistogram respectively

4.4 Bias of the fuzzy-histogram estimator

In the ideal case the bias of an estimator must be zero. To investigate the bias of theMI estimator, we sketched the average of the MI estimator (over different trials) as afunction of I exact . In the ideal case, E[ I ] as a function of the true value of I should bea straight line. The deviation from this line is interpreted as the bias of the estimator.

Here, for three distributions mentioned in the Sect. 4.2, the average of the histogramand the fuzzy-histogram MI estimators (over different trials) as a function of I exact

were sketched for different N ’s and the number of bins in the Fig. 12. The number oftrials in all experiments is 50. Here, for the fuzzy method, β was set to its appropriatevalues found in the Sect. 4.2.1, and α was set to h/2.

When the exact MI is closer to zero, both of the estimators will have overestimation.However, the overestimation of the fuzzy-histogram estimator is less than the over-estimation of the histogram estimator. As mentioned in Sect. 2.2 this overestimation

123

Page 24: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

is called N-bias and caused by the finite sample size. Hence, the N-bias of the fuzzy-histogram estimator is less. Moreover, increasing the sample size and decreasing thenumber of bins can decrease the overestimation in both estimators.

When the exact MI becomes greater, the underestimation or R-bias appears. Inboth estimators, increasing the number of bins decreases the underestimation. Here,there is a trade-off for the number of bins, because by increasing the number of binsthe underestimation (R-bias) reveals, however increasing the number of bins leads toincreasing the N-bias.

For the gamma-exponential distribution the R-bias of the fuzzy-histogram esti-mator is less than the R-bias of the histogram estimator. However, for the normaland ordered Weinman exponential distributions the R-bias (underestimation) of thefuzzy-histogram estimator is greater than the R-bias of the histogram estimator.

4.5 Variance of the fuzzy-histogram estimator

In this part, the variances of the fuzzy-histogram and the histogram MI estimators arecompared. The number of trials in all experiments is 50. Here, for the fuzzy method,β was set to its appropriate values found in the Sect. 4.2.1, and α was set to h/2.

In Fig. 13, the variance of two estimators for different parameters are demonstrated.As shown in the graphs, in most of the cases the variance of the fuzzy-histogramestimator is less than or equal to the variance of the histogram estimator. Moreover, thevariance of both estimators is decreased when the number of samples is increased. Thisis consistent with Eq. 4, which indicates that the variance of the histogram estimatoris inversely proportional to N .

4.6 Mutual information and different degrees of dependencies

This experiment compares different dependency measures over data with differentdegrees of dependency. In other words, various dependency measures between X andXd , are compared, where d = 1, 2, . . . , 10. These dependency measures include thehistogram MI estimator, the fuzzy-histogram MI estimator, the Pearson correlationcoefficient and the non-linear correlation coefficient (NCC) which was proposed byWang et al. (2005).

Here, to compare different dependency measures, the normalized version of eachmeasure is used. The range of the Pearson correlation coefficient is [−1,1], and therange of NCC is [0, 1]. Hence, these two measures do not require normalization. Thereare different ways to normalize the MI. Here, Eq. 24 is utilized to normalize the MI.

N I = I (X,Y )

H(X,Y )(24)

Data X = {xi }Ni=1 which consists of N = 500 points, is chosen uniformly at random

from [−10, 10]. Data X1, X2, . . . , X10 were obtained by raising the xi ’s to a powerof d (Xd = {xi

d}Ni=1). The experiment was repeated 100 times with independent

realizations of X .

123

Page 25: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

0 0.2 0.4 0.6 0.8 10

0.2

0.4

θ0

/θ1

N=500

0 0.2 0.4 0.6 0.8 10

0.2

0.4

θ0 /θ1

σ

0 0.2 0.4 0.6 0.8 10

0.2

0.4

θ0 /θ1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

θ0 /θ1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

θ0

/θ1

σ

0 0.2 0.4 0.6 0.8 10

0.2

0.4

θ0

/θ1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

θ0

/θ1

0 0.2 0.40

0.2

0.4

/θ0

θ1

σ

N=100

0 0.2 0.4 0.6 0.8 10

0.2

0.4

θ0

/θ1

N=1000

Histogram EstimatorFuzzy Histogram Estimator

Number of Bins=25

Number of Bins=20

Number of Bins=30

(a)

(b)

(c)

0 5 10 15 200

0.05

0.1

0.15

θ2

N=500

0 5 10 15 200

0.05

0.1

0.15

θ2

N=1000

0 5 10 15 200

0.05

0.1

0.15

θ2

σ

0 5 10 15 200

0.05

0.1

0.15

θ2

0 5 10 15 200

0.05

0.1

0.15

θ2

0 5 10 15 200

0.05

0.1

0.15

θ2

σ

0 5 10 15 200

0.05

0.1

0.15

θ2

0 5 10 15 200

0.05

0.1

0.15

θ2

0 5 10 15 200

0.05

0.1

0.15

θ2

σ

N=100

Histogram EstimatorFuzzy Histogram Estimator

Number of Bins=5

Number of Bins=10

0 0.5 1

0.02

0.04

0.06

0.08

ρσ

N=100

0 0.5 1

0.02

0.04

0.06

0.08

ρ

N=500

0 0.5 1

0.02

0.04

0.06

0.08

ρ

N=1000

0 0.5 1

0.02

0.04

0.06

0.08

ρ

σ

0 0.5 1

0.02

0.04

0.06

0.08

ρ0 0.5 1

0.02

0.04

0.06

0.08

ρ

0 0.5 1

0.02

0.04

0.06

0.08

ρ

σ

0 0.5 1

0.02

0.04

0.06

0.08

ρ0 0.5 1

0.02

0.04

0.06

0.08

ρ

Histogram EstimatorFuzzy Histogram Estimator

Number of Bins=5

Number of Bins=10

Number of Bins=15

Number of Bins=15

Fig. 13 The standard deviations (over 50 trials) of the estimated mutual information for data with: abivariate normal distribution, b bivariate gamma-exponential, and c bivariate ordered Weinman exponentialdistribution

123

Page 26: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

d

Dep

ende

ncy

Mea

sure

Normalized Histogram MI Estimator

Normalized Fuzzy Histogram MI Estimator

Pearson Correlation

NCC

1 2 3 4 5 6 7 8 9 10

0

0.5

1

d

Dep

ende

ncy

Mea

sure

(a) (b)

Fig. 14 The dependency measures for the data with different degrees of dependency. a Dependency betweenX and Xd , b dependency between |X | and |Xd |

The estimations of dependency measures between X and Xd were reported in theFig. 14a. The values were averaged over 100 trials and the error bars show the standarddeviations. The number of bins M = MX = MY are identical for the NCC, fuzzy-histogram, and histogram methods and it was equal to 10. For the fuzzy-histogramapproach the GNF membership functions were used with β = 2 and α = h/2.

Figure 14a demonstrates that the Pearson correlation coefficient for the even degreesis almost zero, and it does not represent any dependency between the X and Xd whend = 2k (k ∈ N). The NCC measure for even degrees is identically equal to one,and for odd degrees it equals to 0.62. Thus, this measure cannot distinguish betweenthe different degrees of dependency. However, the normalized MI (both the fuzzy-histogram estimation and the histogram estimation) can indicate the dependenciesbetween X and Xd for different d’s.

For further investigation, the experiment was repeated for the |X | and |Xd |, and theresults were demonstrated in the Fig. 14b. Here, Pearson correlation and NCC of theeven degrees are not zero. However, NCC cannot distinguish between different degreesof dependency, because it uses rank orders of variables instead of the original data.Moreover, in this case, the behavior of the correlation coefficient and the normalizedmutual MI are similar to each other.

This experiment demonstrated the advantage of MI as a measure of dependencybetween variables. MI can indicate dependency in some cases that NCC and the cor-relation coefficient cannot.

5 Application in gene expression

To test the fuzzy-histogram estimator of MI on a real-world application, the geneexpression data is selected. This data was previously used for study different MIestimators by Steuer et al. (2002) and Kraskov et al. (2004). Details about this datasetare available in Hughes et al. (2000). Furthermore, the data can be downloaded fromHughes (2012). It includes ≈ 300 vectors in high dimensional space. However, becausethey have missing values the number of simultaneous pairs is less than 300. Each pointis related to one genome and each dimension is related to one open reading frame

123

Page 27: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

−0.2 0 0.2 0.4−0.4

−0.25

−0.1

0.05

0.2

YKL148C

YLR

264W

D

−0.4 −0.2 0 0.2 0.4 0.6

−0.2−0.1

00.10.2

YGR122C−A

YD

R36

6C

C

−0.4 −0.2 0 0.2 0.4 0.6 0.8−0.35

−0.15

0.05

0.25

0.450.57

YCR018C−A

YM

R15

8C−

B

B

−0.2 −0.1 0 0.1 0.2

−0.2

0

0.2

YDR366C

YC

R02

0W−

B

A

0 50 100 150 2000

50

100

150

YKL148C

YLR

264W

Drank

0 50 100 1500

30

60

90

120

YGR122C−A

YD

R36

6C

Crank

0 50 100 1500

50

100

150

YCR018C−A

YM

R15

8C−

B

Brank

0 50 1000

30

60

90

120

YDR366C

YC

R02

0W−

B

Arank(b)(a)

Fig. 15 a Simultaneous measurement of gene-expression data. Each point denotes the values of two ORFs.b The rank representation of the datasets A to D. Each point is replaced by its rank-order

(ORF). The MI estimation methods are investigated on four ORF pairs A to D whichis demonstrated in Fig. 15a.

As can be seen in the Fig. 15a, the four examples have various degrees of depen-dency. In example the B strong linear correlation can be detected by eye. However,the relation in the examples A, C, and D is not easily detected by eye.

Since the data have large fluctuations and isolated data-points, instead of usingdata values, their rank is utilized for estimating MI (similar to Steuer et al. 2002;Kraskov et al. 2004). In other words, each point (xi , yi ) is replaced by the rank order(rank(xi ), rank(yi )). Now the data are homogeneously distributed on the xy-plain, andthe correlation between the variables of each dataset is preserved. Figure 15b showsthe rank order of example A to D.

In this experiment, we compare the fuzzy-histogram MI estimator, the histogramestimator, and the KDE on the four examples, and investigate which of them can indi-cate the dependency between the variables. To interpret the results of the experiment,a significance test is required. A null-hypothesis should be set and tested whether it isconsistent with the data or not. Here, the null-hypothesis is assuming that the X andY are independent. If the observed MI is not consistent with the null-hypothesis, it ispossible to assert that X and Y are dependent and the null-hypothesis is wrong.

To test the null-hypothesis, an ensemble of surrogate datasets Xs,Y s , consistentwith the null-hypothesis, are generated. For generating surrogate datasets, constraintrealization technique is used. In this technique, the surrogate data sets are generatedby creating random permutations of the original data X and Y .

In this test, the mean and standard deviation of the MI estimators of the surrogatedata sets should be calculated. The surrogate significance S is obtained by:

S = I (X,Y )data − 〈I (X,Y )〉surrogate

σsurrogate, (25)

where I (X,Y )data is the estimated MI of the original data, and 〈I (X,Y )〉surrogate is theaverage of the estimated MI for the ensemble of the surrogate datasets. If |S| ≥ 2.6,the null-hypothesis is rejected by a significant level of 99 %.

123

Page 28: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

6 7 8 9 10 11 12 130

0.2

0.4

Number of Bins (MX=M

Y)

MI

6 7 8 9 10 11 12 130

0.07

0.13

Number of Bins (MX=M

Y)

MI

6 7 8 9 10 11 12 130

0.5

1

Number of Bins (MX=M

Y)

MI

6 7 8 9 10 11 12 130

0.4

0.8

Number of Bins (MX=M

Y)

MI

6 7 8 9 10 11 12 130

0.2

0.4

Number of Bins (MX=M

Y)

MI

6 7 8 9 10 11 12 130

0.05

0.1

Number of Bins (MX=M

Y)

MI

6 7 8 9 10 11 12 130

0.2

Number of Bins (MX=M

Y)

MI

6 7 8 9 10 11 12 130

0.07

0.15

Number of Bins (MX=M

Y)

MI

AA

B B

CC

D D

Fuzzy Histogram EstimationHistogram Estimation Kernel Density Estimation

0.10.20.30.40.50.60.70.80

0.07

0.15

Smoothing Parameter h

MI

0.10.20.30.40.50.60.70.80

0.40.7

1

Smoothing Parameter h

MI

0.10.20.30.40.50.60.70.80

0.05

0.1

Smoothing Parameter h

MI

0.10.20.30.40.50.60.70.80

0.07

0.15

Smoothing Parameter h

MI

A

D

C

B

Fig. 16 The estimated MI for the (ranked-ordered) datasets A to D. The dashed lines denote the average MIobtained from an ensemble of 300 surrogates. The error bars indicate the standard deviation. The isolateddots demonstrate the estimated mutual information of the original data (I (X, Y )data

Figure 16 demonstrates the average MI of an ensemble of 300 surrogates for thefuzzy-histogram estimator, the histogram estimator, and KDE. The error bars showthe standard deviation σsurrogate. The separate points denote the MI of the original data.In this experiment, for the fuzzy estimation method, GNF is used as the membershipfunctions, with β = 1 and α = h/2 (see Eq. 23).

Furthermore, Table 6 demonstrates the absolute values of the significance level Sfor the histogram MI estimator, the fuzzy-histogram MI estimator with different β’s,and the KDE. As can be seen in the table, by increasing β, the significance level of thefuzzy estimator is decreased, and it tends to the significance of the histogram estimator.The reason is that by increasing β, the fuzzy membership functions approach more tothe crisp membership functions.

Based on the experiments, the fuzzy-histogram MI estimator with the GNF mem-bership functions with β = 1, and the KDE could reject the null-hypothesis by sig-nificance level 99%. In other words, the fuzzy-histogram MI estimator and the KDEshowed that all four pairs are dependent, while the naïve histogram-based estimatorcould reveal the dependency for only two pairs B and D.

Additionally, Table 7 demonstrates the average and standard deviation of the run-ning times of the three estimation methods for ensemble of 300 surrogates. Based onthis experiment, the fuzzy-histogram MI estimator and the KDE could show that allfour pairs are dependent. Not only the fuzzy-histogram MI estimator is able to indicatethat all four pairs are dependent, but also its computational load is significantly lessthan that of the KDE.

123

Page 29: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

Table 6 The absolute values of the significance S (see Eq. 25) of the histogram MI estimator, the fuzzy-histogram MI estimator with different β’s, and the KDE

Data Method Number of Bins

7 8 9 10 11 12

A Histogram 3.73 2.08 2.38 1.55 2.01 1.90

Fuzzy Histogram (β = 1) 5.68 5.62 5.17 5.85 5.69 5.01

Fuzzy Histogram (β = 2) 4.17 4.57 3.66 4.17 3.66 3.09

Fuzzy Histogram (β = 3) 3.58 3.41 2.93 3.57 2.91 2.59

Fuzzy Histogram (β = 4) 3.18 3.01 2.47 3.23 2.64 2.14

B Histogram 34.22 28.78 25.71 25.87 22.88 19.84

Fuzzy Histogram (β = 1) 67.71 67.42 58.12 67.87 65.53 56.72

Fuzzy Histogram (β = 2) 53.24 44.01 40.60 40.72 38.75 36.79

Fuzzy Histogram (β = 3) 45.90 40.34 35.11 33.36 32.73 28.93

Fuzzy Histogram (β = 4) 38.60 34.82 31.95 29.52 27.22 27.29

C Histogram 2.09 1.82 0.68 0.65 2.50 1.16

Fuzzy Histogram (β = 1) 3.87 3.21 3.24 2.65 2.70 3.15

Fuzzy Histogram (β = 2) 2.60 1.92 1.91 1.11 1.76 1.52

Fuzzy Histogram (β = 3) 1.59 1.78 2.00 1.02 1.33 1.07

Fuzzy Histogram (β = 4) 1.32 1.27 2.30 0.76 0.99 1.03

D Histogram 5.61 3.70 3.75 3.11 2.95 2.35

Fuzzy Histogram (β = 1) 10.95 10.69 10.60 10.25 8.51 10.10

Fuzzy Histogram (β = 2) 7.93 7.00 7.00 6.49 5.06 5.07

Fuzzy Histogram (β = 3) 7.26 6.40 5.64 5.28 4.28 4.49

Fuzzy Histogram (β = 4) 6.49 6.16 4.17 5.13 4.09 3.78

Data Method h

0.2 0.3 0.4 0.5 0.6 0.7

A Kernel density estimation 6.57 7.14 7.06 7.74 8.84 9.48

B Kernel density estimation 41.19 47.51 58.31 59.35 63.36 67.95

C Kernel density estimation 3.60 4.68 5.79 6.46 6.68 6.58

D Kernel density estimation 10.02 10.64 10.77 11.19 11.67 11.81

6 Conclusion

Mutual information (MI) is the ideal measure of independence. In many applications,we need to estimate the MI from some samples. There are several ways to estimateMI from the sample data. Among these methods, the histogram-based estimation isvery popular, since it is simple and efficient.

In this paper, an MI histogram estimator with fuzzy partitioning was introduced.We utilized a general form for the membership functions that provides a wide range ofmembership functions. By changing parameter β of this general function, the mem-

123

Page 30: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

Tabl

e7

The

runn

ing

times

ofth

ees

timat

ion

met

hods

(in

seco

nds)

for

anen

sem

ble

of30

0su

rrog

ates

Dat

aM

etho

d

Ker

neld

ensi

tyes

timat

ion

Fuzz

yhi

stog

ram

estim

atio

nH

isto

gram

estim

atio

n

h=

0.4

h=

0.5

h=

0.6

MX

=M

Y=

9M

X=

MY

=10

MX

=M

Y=

11M

X=

MY

=9

MX

=M

Y=

10M

X=

MY

=11

A1.

0912

±0.

041

1.13

0.00

91.

152

±0.

019

0.00

061

±0.

0002

70.

0006

0.00

016

0.00

071

±0.

0003

20.

0002

0.00

012

0.00

025

±0.

0001

30.

0002

0.00

019

B1.

182

±0.

014

1.21

0.07

41.

156

±0.

023

0.00

062

±0.

0003

10.

0006

0.00

025

0.00

073

±0.

0002

70.

0002

0.00

014

0.00

025

±0.

0003

30.

0002

0.00

020

C1.

180

±0.

031

1.17

0.02

91.

164

±0.

010

0.00

067

±0.

0003

40.

0007

0.00

041

0.00

075

±0.

0003

60.

0002

0.00

009

0.00

022

±0.

0001

10.

0000

23±

0.00

015

D1.

237

±0.

040

1.26

0.05

61.

243

±0.

031

0.00

064

±0.

0001

90.

0007

0.00

031

0.00

081

±0.

0002

50.

0002

0.00

013

0.00

024

±0.

0001

20.

0002

0.00

019

123

Page 31: Estimation of mutual information by the fuzzy histogram

Estimation of mutual information by the fuzzy histogram

bership functions tend from fuzzy membership functions towards crisp membershipfunctions. In the experiments, we showed that in average, the fuzzy-histogram methodprovides more accurate estimation of the MI than the naïve histogram method. Theeffects of parameters of the fuzzy histogram were examined for datasets sampledfrom three distributions. The average absolute error is minimized for β equals 1 or 2,depending on the distribution. In both cases (β = 1 or 2), the membership functionsare far away from the crisp membership functions. Hence, the experiments showedthat fuzzification improves the histogram method for estimating MI.

Based on our experimental results, the accuracy of the fuzzy-histogram method iscomparable to KDE, and in many cases outperforms KDE. Moreover, its computationalload is significantly less than that of the KDE.

Two important features of an estimator are the bias and variance. As the experi-ments demonstrated, our method decreases the N-bias of the histogram MI estimator.Moreover, in most cases, the variance of the fuzzy-histogram estimator is less than thevariance of the histogram-based estimator.

Another important advantage of the fuzzy-histogram MI estimator is that it providesa good estimation for data with a few samples, in comparison with histogram estimator.This is important because in many applications, there are not large enough samplesfor estimating MI.

To test the fuzzy-histogram–based estimator of MI on a real-world data, it wasutilized for the gene-expression application. The fuzzy-histogram–based estimatorcould show the dependency between two pairs of variables that the histogram-basedestimator could not.

References

Ang, K. K., Chin, Z. Y., Zhang, H., & Guan, C. (2012). Mutual information-based selection of optimalspatial-temporal patterns for single-trial eeg-based bcis. Pattern Recognition, 45(6), 2137–2144.

Crouzet, J. F., & Strauss, O. (2011). Interval-valued probability density estimation based on quasi-continuoushistograms: Proof of the conjecture. Fuzzy Sets and Systems, 183(1), 92–100.

Darbellay, G. (2000). Entropy expressions for multivariate continuous distributions. IEEE Transactions onInformation Theory, 46(2), 709–712.

Darbellay, G. A., & Vajda, I. (1999). Estimation of the information by an adaptive partitioning of theobservation space. IEEE Transactions on Information Theory, 45(4), 1315–1321.

Hughes, T. R. (2012). Supplementary data file of gene expression. http://hugheslab.ccbr.utoronto.ca/supplementary-data/rii/. [Online; Accessed 20 Dec 2012].

Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., Stoughton, R., Armour, C. D., et al. (2000).Functional discovery via a compendium of expression profiles. Cell, 102(1), 109–126.

Karasuyama, M., & Sugiyama, M. (2012). Canonical dependency analysis based on squared-loss mutualinformation. Neural Networking, 34, 46–55.

Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E,69(6), 066138.

Loquin, K., & Strauss, O. (2006). Fuzzy histograms and density estimation. In J. Lawry, E. Miranda,A. Bugarin, S. Li, M. A. Gil, P. Grzegorzewski, & O. Hyrniewicz (Eds.), Soft methods for integrateduncertainty modelling, volume 37 of advances in soft computing (pp. 45–52). Berlin Heidelberg: Springer.

Loquin, K., & Strauss, O. (2008). Histogram density estimators based upon a fuzzy partition. Statistics andProbability Letters, 78(13), 1863–1868.

Moddemeijer, R. (1989). On estimation of entropy and mutual information of continuous distributions.Signal Processing, 16(3), 233–248.

123

Page 32: Estimation of mutual information by the fuzzy histogram

M. Amir Haeri, M. M. Ebadzadeh

Moon, Y. I., Rajagopalan, B., & Lall, U. (1995). Estimation of mutual information using kernel densityestimators. Physical Review E, 52(3), 2318–2321.

Schaffernicht, E., Kaltenhaeuser, R., Verma, S., & Gross, H. M. (2010). On estimating mutual informationfor feature selection. Artificial Neural Networks-ICANN, 2010, 362–367.

Steuer, R., Kurths, J., Daub, C. O., Weise, J., & Selbig, J. (2002). The mutual information: Detecting andevaluating dependencies between variables. Bioinformatics, 18(suppl 2), S231–S240.

Tenekedjiev, K., & Nikolova, N. (2008). Justification and numerical realization of the uniform method forfinding point estimates of interval elicited scaling constants. Fuzzy Optimization and Decision Making,7(2), 119–145.

Wang, Q., Shen, Y., & Zhang, J. Q. (2005). A nonlinear correlation measure for multivariable data set.Physica D: Nonlinear Phenomena, 200(3–4), 287–295.

Zografos, K., & Nadarajah, S. (2005). Expressions for rényi and shannon entropies for multivariate distri-butions. Statistics and Probability Letters, 71(1), 71–84.

123