bioinformatics 2005 sehgal 2417 23

7
BIOINFORMATICS  ORIGINAL P APER  Vol.21 no.10 2005, pages 2417–2423 doi:10.1093/bioinformatics/bti345 Gene expression Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data Muhammad Shoaib B. Sehgal , Iqbal Gondal and Laurence S. Dooley Gippsland School of Computing and Information Technology, Monash University, VIC 3842, Australia Received on November 21, 2004; revised on January 30, 2005; accepted on February 18, 2005  Advance Access publication February 24, 2005 ABSTRACT Motivation:  Mi croarr ay dat a are used in a range of applicati on areas in biol ogy , although often it cont ains considerable numbers of missing val ues. These missing val ues can signicantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possibl e bef ore using these algori thms. Whil e many imputat ion algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accur- atel y undertaken. In this paper, an innovat ivemissing val ue imputation algorithmcall ed coll ater al missing val ue esti mati on (CMVE) is presen- ted which uses mul tiple covariance-based imputation matri ces for the nal predi ction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. Results: The new CMVE algorithm has been compared with existing esti mati on techniques incl uding Bayesian principal component anal ysis imputati on (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing val ues. The resul ts conrmed that CMVEconsistentlydemon- strated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity . A concise theoretical frame- work has also been formulated to validate the improved performance of the CMVE algorithm. Availability:  The CMVE software is available upon request from the authors. Contact: [email protected] 1 INTRODUCTION DNA microarrays are extensively used to probe the genetic expression oftens of thousandsof genes under a variety of conditions, as well as in the study of many biological processes varying To whom correspondence should be addressed. from human tumors (Sehgal et al., 2004a) to yeast sporulation (Troyanskayaetal., 2001). Thereare several stati stical , mathematical and machine learning algorithms (Gust avoetal., 2003; Ramaswamy et al., 2001; Shipp et al., 2002) that exploit these data for diagnosis (Furey et al., 2000; Brown et al., 1997), drug discovery and protein sequencing for instance. The most commonly used methods include data dimension reduction techniques (Sehgal et al., 2004e), class prediction techniques (Sehgal et al., 2004bc; Golub et al., 1999) and clustering methods (Munagala et al., 2004). Despite the wide usage of microarray data, they frequentl y contain missing values with up to 90% of genes affected (Ouyang et al., 2004). Missing val uescan occur forvarious reasons, such as spotting probl ems, sli de scrat ches, blemishes on the chip, hybridization error , and image corruption or simply dust on the slide (Oba et al., 2003). It has been proven (Sehgal et al., 2004d), (Acuna et al., 2004) that missing values affect class prediction and data dimension reduc- tion techniques, such as support vector machines (SVMs), neural networks (NNs), principal component analysis (PCA) and singular value decomposition (SVD). The problem can be managed in many different ways from repeating the experiment, although this is often not feasible for economic reasons, to simply ignoring the samples contai ning missing values, although this is inappropriate because usually there are only a very limited number of samples available. The best sol ution is to attempt to accurat ely estimate the missing val- ues, but unfortunately most approaches use zero impute (replace the missing values by zero) or row average/median (replacement by the corresponding row average/median), neither of which take advant- age of data correlations, thereby leading to high estimation errors (Troyanskaya et al., 2001). Current research demonst rates that if the correlation between data is exploited then missing value predic- tion error can be reduced signicantly (Sehgal  et al., 2004d; Hellem etal., 2004). Several methods including K-nearest neighbour (KNN) imput e, leastsquare imputation (LSImpute)(Hellemetal., 2004) and Bayesian PCA (BPCA) (Oba et al., 2003) have been used; however, thepredi ction errorgeneratedusingthesemethodsstillimpactson the performance of stati stical and machine learni ng algori thms including class prediction, class discovery and differential gene identication algorithms (Sehgal et al., 2004e). There is, thus, considerable poten- tial to develop new techniques that will provide minimal prediction errors for different types of microarray data including both time and non-time series sequences. This paper presents a collateral missing value estimation (CMVE) algorithm which combines multiple value matrices for particular missing data and optimizes its parameters using linear programming  © The Author 2005. Published by Oxford University Press. All rights reser ved. For Permissions, please email: jour nals.permissions@oupjournals.org  2417   b  y  g  u  e  s  t   o n M  a r  c h 2  5  , 2  0 1 4 h  t   t   p  :  /   /   b i   o i  n f   o r m  a  t  i   c  s  .  o x f   o r  d  j   o  u r n  a l   s  .  o  g  /  D  o  w n l   o  a  d  e  d f  r  o m

Upload: sitinurzahrah

Post on 02-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics 2005 Sehgal 2417 23

8/10/2019 Bioinformatics 2005 Sehgal 2417 23

http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 1/7

BIOINFORMATICS    ORIGINAL PAPER   Vol.21 no.10 2005, pages 2417–2423

doi:10.1093/bioinformatics/bti345

Gene expression 

Collateral missing value imputation: a new robust missing value

estimation algorithm for microarray dataMuhammad Shoaib B. Sehgal∗, Iqbal Gondal and Laurence S. Dooley

Gippsland School of Computing and Information Technology, Monash University, VIC 3842, Australia

Received on November 21, 2004; revised on January 30, 2005; accepted on February 18, 2005

 Advance Access publication February 24, 2005

ABSTRACT

Motivation:   Microarray data are used in a range of application

areas in biology, although often it contains considerable numbers

of missing values. These missing values can significantly affectsubsequent statistical analysis and machine learning algorithms so

there is a strong motivation to estimate these values as accurately

as possible before using these algorithms. While many imputation

algorithms have been proposed, more robust techniques need to be

developed so that further analysis of biological data can be accur-

ately undertaken. In this paper, an innovative missing value imputation

algorithm called collateral missing value estimation (CMVE) is presen-

ted which uses multiple covariance-based imputation matrices for

the final prediction of missing values. The matrices are computed

and optimized using least square regression and linear programming

methods.

Results: The new CMVE algorithm has been compared with existing

estimation techniques including Bayesian principal component

analysis imputation (BPCA), least square impute (LSImpute) and

K-nearest neighbour (KNN). All these methods were rigorously tested

to estimate missing values in three separate non-time series (ovarian

cancer based) and one time series (yeast sporulation) dataset. Each

method was quantitatively analyzed using the normalized root mean

square (NRMS) error measure, covering a wide range of randomly

introduced missing value probabilities from 0.01 to 0.2. Experiments

were also undertaken on the yeast dataset, which comprised 1.7%

actual missing values, to test the hypothesis that CMVE performed

better not only for randomly occurring but also for a real distribution of

missing values. The results confirmed that CMVE consistently demon-

strated superior and robust estimation capability of missing values

compared with other methods for both series types of data, for the

same order of computational complexity. A concise theoretical frame-work has also been formulated to validate the improved performance

of the CMVE algorithm.

Availability:  The CMVE software is available upon request from the

authors.

Contact:  [email protected]

1 INTRODUCTION

DNA microarrays are extensively used to probe the genetic

expression oftens of thousandsof genes under a variety of conditions,

as well as in the study of many biological processes varying

∗To whom correspondence should be addressed.

from human tumors (Sehgal   et al., 2004a) to yeast sporulation

(Troyanskayaetal., 2001). Thereare several statistical, mathematical

and machine learning algorithms (Gustavo etal., 2003; Ramaswamy

et al., 2001; Shipp et al., 2002) that exploit these data for diagnosis(Furey et al., 2000; Brown et al., 1997), drug discovery and protein

sequencing for instance. The most commonly used methods include

data dimension reduction techniques (Sehgal   et al., 2004e), class

prediction techniques (Sehgal et al., 2004bc; Golub et al., 1999) and

clustering methods (Munagala  et al., 2004).

Despite the wide usage of microarray data, they frequently contain

missing values with up to 90% of genes affected (Ouyang   et al.,

2004). Missing valuescan occur forvarious reasons, such as spotting

problems, slide scratches, blemishes on the chip, hybridization error,

and image corruption or simply dust on the slide (Oba  et al., 2003).

It has been proven (Sehgal  et al., 2004d), (Acuna  et al., 2004) that

missing values affect class prediction and data dimension reduc-

tion techniques, such as support vector machines (SVMs), neural

networks (NNs), principal component analysis (PCA) and singular

value decomposition (SVD). The problem can be managed in many

different ways from repeating the experiment, although this is often

not feasible for economic reasons, to simply ignoring the samples

containing missing values, although this is inappropriate because

usually there are only a very limited number of samples available.

The best solution is to attempt to accurately estimate the missing val-

ues, but unfortunately most approaches use zero impute (replace the

missing values by zero) or row average/median (replacement by the

corresponding row average/median), neither of which take advant-

age of data correlations, thereby leading to high estimation errors

(Troyanskaya  et al., 2001). Current research demonstrates that if 

the correlation between data is exploited then missing value predic-

tion error can be reduced significantly (Sehgal et al., 2004d; Hellemetal., 2004). Several methods including K-nearest neighbour (KNN)

impute, leastsquare imputation (LSImpute)(Hellem etal., 2004) and

Bayesian PCA (BPCA) (Oba et al., 2003) have been used; however,

theprediction errorgeneratedusingthesemethodsstillimpactson the

performance of statistical and machine learning algorithms including

class prediction, class discovery and differential gene identification

algorithms (Sehgal et al., 2004e). There is, thus, considerable poten-

tial to develop new techniques that will provide minimal prediction

errors for different types of microarray data including both time and

non-time series sequences.

This paper presents a collateral missing value estimation (CMVE)

algorithm which combines multiple value matrices for particular

missing data and optimizes its parameters using linear programming

 © The Author 2005 . Published by Oxford University Press. All righ ts reser ved. For Permissions, please email: jour nals.permissions@oupjourna ls.org   2417

  b  y g u e  s  t   onM a r  c h 2  5  ,2  0 1 4 

h  t   t   p :  /   /   b i   oi  nf   or m a  t  i   c  s  . oxf   or  d  j   o u

r n a l   s  . or  g /  

D o wnl   o a  d  e  d f  r  om

Page 2: Bioinformatics 2005 Sehgal 2417 23

8/10/2019 Bioinformatics 2005 Sehgal 2417 23

http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 2/7

M.S.B.Sehgal et al.

and least square (LS) regression. CMVE is compared with other

well-established techniques, including KNN, LSImpute and BPCA,

with their performance rigorously tested for the prediction of 

randomly introduced missing values, with probabilities ranging from

0.01 to 0.2 for the BRCA1, BRCA2, sporadic mutation microarraydata (mutations present in ovarian cancer), which is non-time series

data (Amir et al., 2001). The reason for introducing missing values

is that the number of actual missing values in the BRCA1, BRCA2

and sporadic mutation data are negligibly small compared with the

size of the dataset—only 0.01, 0.003 and 0.01% values, respectively.

Since randomly introduced missing values may not be distributed

in the same way as actual missing values (Oba   et al., 2003), a

separate experiment was performed, with CMVE and the other three

estimation algorithms being applied to the yeast sporulation time

series dataset (Spellman et al., 1998), which contains 1.7% missing

values.

The normalized root mean square (NRMS) error (Ouyang et al.,

2004) metric was used to quantitatively evaluate the estimation

performance of each technique, with results demonstrating theimproved accuracy and robustness of CMVE over a wide range of 

randomly introduced missing values. In addition, while computa-

tional complexity is not as critical a factor as accuracy for missing

value imputation because estimation is performed only once during

the data collection (Troyanskaya etal., 2001; Hellemetal., 2004), the

order of computational complexity for CMVE proved to be exactly

the same as the LSimpute and KNN algorithms.

The remainder of the paper is organized as follows: Section 2

presents a brief overview of existing estimation techniques, with

their respective advantages and disadvantages, while the new CMVE

algorithm and methodology is detailed in Section 3. Section 4

provides the theoretical framework for the improved performance of 

CMVE compared with the KNN, LSImpute and BPCA algorithms,

while Section 5 analyses fully the respective estimation perform-

ance of all four imputation methods. Section 6 provides some

conclusions.

2 OVERVIEW OF EXISTING MISSING VALUEESTIMATION TECHNIQUES

The following convention is adopted for all the imputation algorithms

describedin this paper. Themicroarray data have theform of an m×n

matrix  Y , where  m  is the number of genes and  n  is the number of 

samples. The Y IJ  component of  Y  represents the expression level of 

gene I  for sample J .

An overview is now presented of the strengths and limitations of 

the three estimation techniques used for comparative purposes inassessing the performance of CMVE.

2.1 KNN estimation

The KNN method imputes missing values by selecting genes with

expression values similar to the gene of interest (Toyanasaka et al.,

2001). In order to estimate the missing value Y IJ  of gene I  in sample

J ,   k   genes are selected whose expression vectors are similar to

genetic expression of   I   in samples other than   J . The similarity

measure between two expression vectors   Y 1   and   Y 2   is determi-

ned by the Euclidian distance  ψ  over the observed components in

sample J .

ψ   = ||Y 1 − Y 2||. (1)

The missing value is then estimated as the weighted average of the

corresponding entries in the selected  k  expression vectors:

Y IJ   =

k

i=1

W i  · Xi   (2)

W i   =  1

ψi  × (3)

where     =k

i=1 ψi   and   X   is the input matrix containing gene

expressions. Equations (2) and (3) show that each gene contribution

is weighted by the similarity of its expression to gene  I .

The Euclidean distance measure used by KNN is sensitive to

outlier values which may be present in microarray data; although

log-transforming the data significantly reduces their effects on gene

similarity determination (Toyanasaka  et al., 2001). The choice of 

a small  k  degrades the performance of the classifier as the imputa-

tion process overemphasizes a few dominant genes in estimating

the missing values. Conversely, a large neighbourhood may include

genes that are significantly different from those containing missing

values; thereby, degrading the estimation process and commensura-

tely the classifier’s performance. Empirical results have demons-

trated that for small datasets   k   =  10 is the best choice (Accuna

et al., 2004), while Toyanasaka  et al. (2001) observed that KNN is

insensitive to values of  k  in the range 10–20.

The computational complexity of KNN is  O (m2n), where m  and

n are the number of genes and samples, respectively, and while this

is the same order as the LSImpute algorithm, Section 2.3 will show it

is higher than BPCA. A vital feature of KNN is that it does not con-

sidernegative correlations between data, whichcan leadto estimation

errors.

2.2 Least square impute estimation

LSImpute is a regression-based estimation method that exploits the

correlation between genes. To estimate the missing value Y IJ   of gene

I   from gene expression matrix  Y , the  k -most correlated genes are

first selected whose expression vectors are similar to gene  I  from Y 

in all samples except  J , containing non-missing values for gene  I .

The LS regression method then estimates the missing value Y IJ . By

possessing the flexibility to adjust the number of predictor genes k

in the regression, LSImpute performs best when data have a strong

local correlation structure and for the same order of computational

complexity O (m2n), as KNN.

2.3 Bayesian PCA based estimation

BPCA estimates missing values Y miss in data matrix  Y   using thosegenes Y obs having no missing values. The probabilistic PCA (PPCA)

is calculated using Bayes theorem and the Bayesian estimation cal-

culates posterior distribution of model parameter  θ  and input matrix

X containing gene expression samples using:

p(θ , X|Y) α p(Y , X|θ)p(θ)   (4)

where  p(θ)  is known as the prior distribution which contributes a

priori preference to θ  and X.

Missing values are estimated using a Bayesian estimation

algorithm, which is executed for both   θ   and   Y miss (similar to the

Expectation–Maximization repetitive algorithm) and calculates the

posterior distributions for θ  and Y miss, q(θ) and q(Y miss) (Oba et al.,

2418

  b  y g u e  s  t   onM a r  c h 2  5  ,2  0 1 4 

h  t   t   p :  /   /   b i   oi  nf   or m a  t  i   c  s  . oxf   or  d  j   o u

r n a l   s  . or  g /  

D o wnl   o a  d  e  d f  r  om

Page 3: Bioinformatics 2005 Sehgal 2417 23

8/10/2019 Bioinformatics 2005 Sehgal 2417 23

http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 3/7

Collateral missing value imputation 

Pre Condition:   Gene expression matrix Y (m, n) where m and n arethe number of genes and samples, respectively.Post Condition:   Y  with no missing values.

 Algorithm:

STEP 1 Locate missing value Y  IJ 

  in gene I  and sample  J .

STEP 2 Compute the absolute covariance   CoV   of expressionvector v  of gene I  using (7)

STEP 3 Rank genes (rows) based on CoV 

STEP 4 Select the k  most effective rows Rk

STEP 5 Use these values of  Rk  to estimate 1  by (8)

STEP 6 Calculate 2  and  3  using (9) and (10)

STEP 7 Calculate missing value Y  IJ  using (14) and impute estim-ate χ  in all future predictions.

STEP 8 Seek the next missing value  Y  IJ  and repeat STEPS 2–7until all missing Y  values are estimated

STEP 9 END

Fig. 1.  The CMVE algorithm.

2003). Finally, the missing values in the gene expression matrix are

imputed using:

Y   =

   Y missq(Y miss) dY miss (5)

q(Y miss) =  p( Y miss|Y obs, θ true)   (6)

where θ true  is the posterior of the missing value.

By exploiting only theglobalcorrelationin thedatasets, BPCA has

the advantage of prediction speed incurring a computational com-

plexity  O(mn), which is one degree less than for both KNN and

LSImpute. For imputation purposes, however, improved estimation

accuracy is always a greater priority than speed.

3 THE CMVE ALGORITHM

The complete CMVE algorithm, which is detailed in Figure 1, intro-

duces the concept of multiple parallel estimations of missing values.

For instance, if value Y IJ  of gene  I   and sample  J   is missing, mul-

tipleestimates (1, 2 and 3) aregeneratedand thefinal estimate χ

distilled from these estimates. The covariance function is employed

since unlike KNN, it is unbiased in considering both positive and

negative correlation values. The covariancefunction CoV is formally

defined as:

CoV =   1(n − 1)

ni=1

(νi  − ν)(ωi  − ω)   (7)

where ω  is the predictor gene vector and  ν  the expression vector of 

gene I  which has the missing values. The absolute diagonal covari-

ance CoV is first computed for a gene vector  ν , where every gene

except I  is iteratively considered as  ω  (Step 2 in Fig. 1). The genes

are then ordered with respect to their values and the first k-ranked

covariate genes Rk selected, whose expression vectors have the most

similarity to gene  I   from  Y  in all samples except  J   (Step 4). The

LS regression method Harvey and Arthur (2004) is then applied to

estimate value 1  for  Y IJ  (Step 5) as:

1  =  α  + βX  + ξ    (8)

where ξ  is the error term that minimizes the variance in the LS model

(parameters α and β). For a single regression, the estimate of α and β

are, respectively,

α  = y  − βX   and   β  =  xy

xx

,

where xy   =   1/(n −  1)n

J =1(XJ   −  X)(Y J   −  Y ) is the empirical

covariance between X  and  Y , Y J  is the gene with the missing value

and XJ  is the predictor gene in  Rk .

xx   =   1/(n −  1)n

J =1(XJ   −  X)2 is the empirical variance of 

X   with  X   and  Y  being the respective means over  X1, . . . , Xn   and

Y 1, . . . , Y n, so the LS estimate of  Y  given X  is expressed as:

Y   =  Y  − xy

xx

(X −  X)2.

The two other missing value estimates   2   and   3   (Step 6) are,

respectively, given by:

2  =

ki=1

φ + η −

ki=1

ξ 2 (9)

3  =

ki=1(φT × I )

k+ η   (10)

where  φ  is the vector that minimizes  ξ 0  in Equation (12),  η   is the

normal residual and  ξ  is the actual residual. These three parameters

are obtained from the non-negative least square (NNLS) algorithm

(Charles et al., 1974). The objective is now to find a linear combina-

tion of models that best fit Rk and I . The objective function in NNLS

minimizes, using linear programming techniques, the prediction

error ξ 0  so that:

ξ , φ, η =  min(ξ 0)   (11)

i.e. min(ξ 0

)   is a function that locates the normal vector   φ   with

minimum prediction error  ξ 0   and residual   η. The value of   ξ 0   in

Equation (11) is obtained from

ξ 0  =  max(SV(Rk  · φ − I ))   (12)

where SV are the singular valuesof the difference vectorbetween the

dot product Rk and prediction coefficients φ with thegene expression

row  I . The tolerance used in the linear programming to compute

vector φ  is given by

Tol =  k  × n × max(SV(Rk )) × C   (13)

where k  is the number of predictor genes, n  the number of samples

in the dataset and C is the normalization factor. The final estimate χ

for Y IJ  is formed using

χ   = ρ  · 1 + · 2 + · 3   (14)

where   ρ   =     =     =   0.33 ensures an equal weighting to the

respective estimates  1,   2   and  3. The rationale for this choice

is that as each estimate is highly data-dependent, it avoids any bias

toward one particular estimate.

The reason (14) has a lower NRMS error is because the imputation

matrix 1 uses LS regression, while matrices 2 and 3 use NNLS,

which is superior for estimating positive correlated values. NNLS is

unable, however, to estimate negative values and given microarray

data, possessesboth negativeand positivevalues, thiswas the motiva-

tion to embed the LSImpute-based matrix into the gene expression

prediction, thus combiningthe advantagesof both algorithms to more

accurately estimate the missing values.

2419

  b  y g u e  s  t   onM a r  c h 2  5  ,2  0 1 4 

h  t   t   p :  /   /   b i   oi  nf   or m a  t  i   c  s  . oxf   or  d  j   o u

r n a l   s  . or  g /  

D o wnl   o a  d  e  d f  r  om

Page 4: Bioinformatics 2005 Sehgal 2417 23

8/10/2019 Bioinformatics 2005 Sehgal 2417 23

http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 4/7

M.S.B.Sehgal et al.

4 THEORETICAL FOUNDATIONS OF CMVE

This section explores the theoretical principles underpinning the

reasons behind why the CMVE algorithm performs better when

compared with KNN, LSImpute and BPCA techniques in estimating

missing values. For completeness, a computational complexity ana-lysis of CMVE is provided and is shown to be exactly of the same

order as both LSImpute and KNN.

Proposition 1.   KNN only considers positive correlations.

If there are two sets α and β which are inversely proportional to each

other, then the distance d  between α and β will be larger in those sets

which are proportional to each other. Several distance functions can

be used for KNN, with the most common being Euclidian distance

which is given by

d  = α − β   (15)

so  d  is always higher when  α  is inversely proportional to  β , rather

than when they are directly proportional to each other.

Proposition 2.   The CMVE algorithm considersboth positive and 

negative correlation values.

Assume two sets  ν  and  ω   that are inversely proportional, such that

CoV  <  0  ∀ν, ω. From (7) it is clear that if a high correlation exists

between the gene values (either directly proportional and positive

correlation or inversely proportional and negative correlation) then

a higher absolute CoV value will exist.

Proposition 3.  The probability P() of the normalized imputa-

tion error    of missing values using CMVE is always less than that 

 for BPCA, LSImpute and KNN.

The probability  P() of the normalized imputation error   of the

missing value for correlated data is directly proportional to the

number of missing values  M  (Mclean, 2000). Assume  P 1   and  P 2are the probabilities of normalized imputation errors of CMVE (1)

and the three comparative algorithms (2) such that:

P 1  =

M i=0

P (1)P(M) =  M  × P (1)P(M)   (16)

P 2  =

M i=0

P (2)P(i)   (17)

Since the comparative methods do not estimate any future missing

value predictions, such algorithms only consider  M  missing values

for each prediction. In contrast, CMVE uses estimated values for the

future prediction of missing values so each estimate increases the

number of predictor genes to be considered, while concomitantly

decreasing the prediction probabilities in Equation (17). Hence

P 2  < P 1   such that P 2  →  0 when i  →  0

as P (i) =  0 for i  =  0 (18)

Proposition  4.  CMVE always has a lower estimation error of 

missing values in the case of transitive gene dependency (Gene A →

B  →  C ) than BPCA, LSImpute and KNN.

Assume that gene  Ga1  is correlated with  S 1 such that

Ga1  →  S 1   such that S 1  = {Gb1, Gb2, . . . , Gbn}   (19)

Similarly, gene Gb1 is correlated with  S 2  as:

Gb1  →  S 2   such that S 2  = {Gc1, Gc2, . . . , Gcn}   (20)

If the values of both   Ga1   and   Gb1   are missing then   Gb1   can be

predicted using set  S 2  and subsequently used to predict  Ga1   more

accurately using S 1  by including  Gb1 rather than ignoring it.

CMVE, unlike the other imputation techniques considers estima-

ted values in predicting future missing values. LSImpute replaces the

gene missing value with an average value to compute the CoV  matrix

(Hellem etal., 2004). The NRMS error using this approach is always

higher than CMVE, since each iteration (Fig. 1, Steps 1–7) lowers

this error. KNN and BPCA consider that missing genes have no

correlation with the missing value gene as they ignore these missingvalues while searching the estimation space. In contrast, CMVE

includes these genes when searching for the most correlated gene.

This may incur a small accumulative error in future predictions, but

it will always be less than when either the average value of the gene

is used or the gene is totally ignored.

Proposition 5.   CMVE generates a lower estimation error than

 BPCA when genes have dominant local correlation.

BPCA assumes only a global correlation structure and has a similar

effect in selectinga high value of k forCMVE.Owingto this assump-

tion, BPCA does not provide accurate estimates when genes have

dominant local correlation Oba  et al. (2003), because in predicting

missing values, information from all genes is considered, many of which have little or no correlation with the gene with the missing

value. In contrast, the CMVE variable  k  can be adjusted depend-

ing upon the type of the data, ensuring that only those genes with

strong correlations are considered, which concomitantly reduces

the estimation error. The empirical results presented in the next

section demonstrate that a value of   k   =   10 is suitable for local

correlated data.

Computational complexity analysis.   The order of computational

complexity for CMVE is exactly the same as for the KNN and

LSImpute algorithms.

The critical operation for the CMVE, KNN and LSImpute

algorithms is the search for the most correlated genes. These

algorithms search for correlated genes with the gene that has miss-ing values. Each estimation takes linear time  O(n); therefore, for

m   genes and   n   samples the complexity order is   O(m2n)   for all

algorithms. KNN uses a weighted average of   k  correlated genes

to estimate the missing values, while CMVE and LSImpute use

regression and linear programming for estimation, although these

additional overheads are negligible compared with the time taken to

search for the most correlated genes. Similar to KNN and LSImpute,

CMVE also only searches once per estimation for correlated genes.

As discussed in Section 2.3, BPCA has a computational complex-

ity of  O(mn) as it only considers the global correlation structure of 

the data. This is pyrrhic, however, becausethe corresponding estima-

tion accuracy is significantly inferior whenever data have a localized

correlation structure.

2420

  b  y g u e  s  t   onM a r  c h 2  5  ,2  0 1 4 

h  t   t   p :  /   /   b i   oi  nf   or m a  t  i   c  s  . oxf   or  d  j   o u

r n a l   s  . or  g /  

D o wnl   o a  d  e  d f  r  om

Page 5: Bioinformatics 2005 Sehgal 2417 23

8/10/2019 Bioinformatics 2005 Sehgal 2417 23

http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 5/7

Collateral missing value imputation 

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

1 4 7 10 13 16 19 22 25 2 8 31 3 4 37 4 0 43 46 49

k

   N   R   M   S   E  r  r  o  r

B1

B2

Sp

Yeast

Fig. 2.   NRMS error over a wide range of  k  in the CMVE algorithm for 5%

missing values.

5 RESULTS ANALYSISTo test the different imputation algorithms, four different types of 

microarray data were used including both time series and non-time

series data. The dataset contained 18, 16, 27 and 77 samples of 

BRCA1, BRCA2, sporadic mutations (neither BRCA1 nor BRCA2)

of ovarian cancer data (non-time series) and yeast sporulation data

(time series), respectively. Each ovarian cancer data sample con-

tained logarithmic microarray data of 6445 genes while there were

6179 genetic expressions per sample for yeast dataset. The rationale

for selecting cancer data is that in such data some of the genes are

up/downregulated hence it is very difficult to determine their expres-

sion levels from non-regulated genes. The missing value estimation

techniques were tested by randomly removing data values and then

computing the estimation error. In the experiments, between 1 and

5% of the values were removed from each dataset samples and the

NRMS error θ  was computed by

 =  RMS(M  − M est)

RMS(M)(21)

where M  is the original data matrix and M est  is the estimated matrix

using KNN, LSImpute, BPCA and CMVE. This particular metric

was used for error estimation because   ξ  = 1 for zero imputation

(Ouyang et al., 2004).

To compare the performance of the CMVE, KNN and LSImpute

imputation algorithms, k  =  10 wasused throughout theexperiments.

The rationale for this was that Olga  et al. (2001) observed that KNN

was insensitive to values of  k  in the range 10–20 and the best estim-ation results were observed in this range. Hellem  et al. (2003) also

suggested using k  =  10 for LSImpute. Figure 2 plots the minimum

overall prediction error rates for CMVE over a range of  k  values for

the different test datasets, with results showing that  k  in the range

10–15 (highlighted) is the most appropriate. Lower  k  values include

only a small set of correlated genes for prediction leading to predic-

tion errors as other correlated genes are ignored. Conversely, when k

is high, genes which have either very little or no correlation with the

gene having missing values will be included in the prediction, again

leading to erroneous results (Troyanskaya et al., 2001).

To fully test the robustness of the new CMVE algorithm, exper-

iments were performed for missing values up to 20% (Figs 3–9).

Figure8a and 8b showthe error values for 10% missing values which

0

0.01

0.02

0.03

0.040.05

0.06

0.07

0.08

0.09

0.1

Brca1 Brca2 Sporadic

Data

   N  o  r  m  a   l   i  z  e   d

   R   M   S   E  r  r  o  r

BPCA

KNN

LSImputeCMVE

Fig. 3.   NRMS error for 1% missing values for ovarian cancer data.

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

Brca1 Brca2 Sporadic

Data

   N  o  r  m  a   l   i  z  e   d   R   M

   S   E  r  r  o  r

BPCA

KNN

LSImpute

CMVE

Fig. 4.   NRMS error for 2% missing values for ovarian cancer data.

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

Brca1 Brca2 Sporadic

Data

   N  o  r  m  a   l   i  z  e   d   R   M   S   E  r  r

  o  r

BPCA

KNN

LSImpute

CMVE

Fig. 5.   NRMS error for 3% missing values for ovarian cancer data.

especially reveal (Fig. 8a) the significant deterioration in the resultsof KNN for the sporadic dataset.

To clarify the performance of CMVE, Figure 8b plots the error

results without KNN, which consistently confirm the lower error

values compared with LSImpute and BPCA. Figure 9a and 9b show

thecorresponding results for20% missing values, which again reveal

the superiority and greater robustness of the CMVE algorithm for

missing value imputation. Note, for the sake of clarity a logarithmic

scale is used in Figure 9b.

Whenever there is a high number of missing values in a gene,

sparse covariance matrices will ensue and with them, the increased

likelihood of ill-conditioning. The CMVE algorithm avoids ill-

conditioningby ensuring theremoval ofall genes with >20% missing

values before imputation.

2421

  b  y g u e  s  t   onM a r  c h 2  5  ,2  0 1 4 

h  t   t   p :  /   /   b i   oi  nf   or m a  t  i   c  s  . oxf   or  d  j   o u

r n a l   s  . or  g /  

D o wnl   o a  d  e  d f  r  om

Page 6: Bioinformatics 2005 Sehgal 2417 23

8/10/2019 Bioinformatics 2005 Sehgal 2417 23

http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 6/7

M.S.B.Sehgal et al.

0

0.015

0.03

0.045

0.06

0.075

0.09

0.105

0.12

0.135

0.15

Brca1 Brca2 Sporadic

Data

   N  o  r  m  a   l   i  z  e

   d   R   M   S   E  r  r  o  r

BPCA

KNN

LSImpute

CMVE

Fig. 6.   NRMS error for 4% missing values for ovarian cancer data.

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Brca1 Brca2 Sporadic

Data

   N  o  r  m  a   l   i

  z  e   d   R   M   S   E  r  r  o  r

BPCA

KNN

LSImputeCMVE

Fig. 7.   NRMS error for 5% missing values for ovarian cancer data.

(a)

(b)

Fig. 8.   (a) NRMS error for 10% missing values for ovarian cancer data.

(b) NRMS error of CMVE, BPCA and LSImpute for 10% missing values for

ovarian cancer data.

As highlighted in Section 1, experiments performed on datasets

with randomly introduced missing values may not truly reflect the

nature of actual microarray data missing values. All four imputation

algorithms were, therefore, tested on the yeast time series data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Brca1 Brca2 Sporadic

Data

   N  o  r  m  a   l   i  z

  e   d   R   M   S   E  r  r  o  r

BPCA

KN N

LSImpute

C M V E

0.001

0.01

0.1

1

Brca1 Brca2 Sporadic

Data

   L  o  g  a  r   i   t   h  m   i  c

   N  o  r  m  a   l   i  z  e   d   R   M   S   E  r  r  o  r

BPCA

KN N

LSImpute

CM VE

(a)

(b)

Fig. 9.   (a) NRMS error for 20% missing values for ovarian cancer data.

(b) NRMS error (log scale) for 20% missing values for ovarian cancer data.

Table 1.   NRMS errors for the actual missing value distribution (AMVD) of 

1.7% missing values and additional 1–10% randomly introduced values in

the yeast dataset

Missing BPCAImpute KNNImpute LSImpute CMVE

values

AMVD 0.1485 0.0654 0.0130 0.0064

1% 0.0319 0.8930 0.0849 0.0030

2% 0.0569 0.5284 0.1555 0.0843

3% 0.0674 0.6232 0.1612 0.0547

4% 0.0846 0.9307 0.2003 0.0090

5% 0.0927 0.5821 0.2071 0.0091

10% 0.1756 0.8763 0.0638 0.0130

containing 1.7% missing values. Since NRMS errors could not be

calculated for these actual missing values, the gene value adjacent

to the gene with the missing value was replaced before applyingthe imputation algorithms. This had the effect of a delay function,

while retaining the same distribution of missing values. The results

in Table 1 again confirm the superior performance of CMVE, par-

ticularly when an additional 4, 5 and 10% of missing values are

introduced into the data, with the corresponding average improve-

ments being 60, 72 and 64%, respectively. The imputation results

also reveal some other broader noteworthy issues. KNN for instance,

performed better when missing values were randomly introduced

because KNN only considers positive correlations and certain ran-

domly introduced missing values will inevitably contain negative

correlations with other genetic data. Similarly, LSImpute exhibited

an improved performance compared with BPCA (Oba  et al., 2003)

confirming the discussion underpinning Proposition 5.

2422

  b  y g u e  s  t   onM a r  c h 2  5  ,2  0 1 4 

h  t   t   p :  /   /   b i   oi  nf   or m a  t  i   c  s  . oxf   or  d  j   o u

r n a l   s  . or  g /  

D o wnl   o a  d  e  d f  r  om

Page 7: Bioinformatics 2005 Sehgal 2417 23

8/10/2019 Bioinformatics 2005 Sehgal 2417 23

http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 7/7

Collateral missing value imputation 

6 CONCLUSIONS

This paper has presented a new CMVE algorithm based on the novel

concept of multiple imputations. Experimental results confirmed that

CMVEconsistently provided superior estimationaccuracy compared

with the existing missing value imputation algorithms, includingKNN, LSImpute and BPCA. This performance improvement was

especially evident when estimating higher numbers of missing val-

ues in both time series and non-time series data. The algorithm’s

theoretical basis, which was the exploitation of a combination of 

global and local correlations in a given dataset, repeatedly proved to

be a more effectiveand robuststrategythanthe distance function used

by KNN, with no increase in the order of computational complexity

for all values of  k . The results corroborate the fact that CMVE can

be successfully applied to accurately impute missing values before

any microarray data experiment, crucially without any bias being

introduced into the estimation process.

REFERENCESAcuna,E. and Rodriguez,C. (2004) The treatment of missing values and its effect in

the classifier accuracy. In Banks,D.  et al. (eds) Classification, Clustering and Data

 Mining Applications. Springer-Verlag, Berlin, Heidelberg, pp. 639–648.

Amir,A.J. et al. (2002) Gene expression profiles of BRCA1-linked, BRCA2-linked, and

sporadic ovarian cancers. J. Natl Cancer Inst., 94, 981–990.

Brown,W.N. etal. (1997)Knowledge-based analysis of microarray gene expression data

using support vector machines. Proc. Natl Acad. Sci. USA, 97, 262–267.

Golub,T.R. et al. (1999) Molecular classification of cancer: class discovery and class

prediction by gene expression monitoring.  Science, 286, 531–537.

Gustavo,B.and Monard,C.M. (2003)An analysis of fourmissing data treatment methods

for supervised learning. Appl. Artif. Intell., 17, 519–533.

Furey,T.S.  et al. (2000) Support vector machine classification and validation of cancer

tissue samples using microarray expression data.  Bioinformatics, 16, 906–914.

Harvey,M. and Arthur,C. (2004)  Fitting Models to Biological Data Using Linear and 

 Nonlinear Regression. Oxford University Press, Oxford.

Hellem,B.T. etal. (2004)LSimpute: accurateestimationof missing valuesin microarray

data with least squares methods.  Nucleic Acids Res., 32, e34.Lawson,C.L. and Hanson,R.J. (1974)  Solving Least Squares Problems. Prentice-Hall,

Inc., Englewood Cliffs, NJ.

Munagala,K.   et al. (2004) Cancer characterization and feature set extraction by

discriminative margin clustering. BMC Bioinformatics, 5, 21.

McLean,A. (2000) The predictive approach to teaching statistics.  J. Stat. Education, 8.

Oba,S.  et al. (2003) A bayesian missing value estimation method for gene expression

profile data.  Bioinformatics, 19, 2088–2096.

Ouyang,M. etal. (2004)Gaussian mixtureclustering andimputation of microarray data.

 Bioinformatics, 20, 917–923.

Ramaswamy,S. et al. (2001) Multiclass cancer diagnosis using tumour gene expression

signatures. Proc. Natl Acad. Sci. USA, 98, 15149–15154.

Sehgal,M.S.B.  et al. (2004a) Support vector machine and generalized regression neural

network based classification fusion models for cancer diagnosis. In HIS’04, Japan.

Sehgal,M.S.B.  et al. (2004b) A collimator neural network model for the classification

of genetic data. ICBA 04, USA.

Sehgal,M.S.B.  et al. (2004c) Communal neural network for ovarian cancer mutation

classification. Complex 04, Australia.

Sehgal,M.S.B.  et al. (2004d) K-Ranked covariance based missing values estimation for

microarray data classification. HIS’04, Japan.

Sehgal,M.S.B.  et al. (2004e) Statistical neural networks and support vector machine for

the classification of genetic mutations in ovarian cancer. IEEE CIBCB 04, USA.

Shipp,M.A.  et al., (2002) Diffuse large B-cell lymphoma outcome prediction by gene

expression profiling and supervised machine learning.  Nat. Med., 8, 68–74.

Spellman,P.T.  et al. (1998) Comprehensive identification of cell cycle-regulated genes

of the yeast  Saccharomyces cerevisiae by microarray hybridization.  Mol. Biol. Cell,

9, 3273–3297.

Troyanskaya,O.  et al. (2001) Missing value estimation methods for DNA microarrays.

 Bioinformatics, 17, 520–525.

2423

  b  y g u e  s  t   onM a r  c h 2  5  ,2  0 1 4 

h  t   t   p :  /   /   b i   oi  nf   or m a  t  i   c  s  . oxf   or  d  j   o u

r n a l   s  . or  g /  

D o wnl   o a  d  e  d f  r  om