bioinformatics 2005 sehgal 2417 23

8/10/2019 Bioinformatics 2005 Sehgal 2417 23

http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 1/7

BIOINFORMATICS ORIGINAL PAPER Vol.21 no.10 2005, pages 2417–2423

doi:10.1093/bioinformatics/bti345

Gene expression

Collateral missing value imputation: a new robust missing value

estimation algorithm for microarray dataMuhammad Shoaib B. Sehgal∗, Iqbal Gondal and Laurence S. Dooley

Gippsland School of Computing and Information Technology, Monash University, VIC 3842, Australia

Received on November 21, 2004; revised on January 30, 2005; accepted on February 18, 2005

Advance Access publication February 24, 2005

ABSTRACT

Motivation: Microarray data are used in a range of application

areas in biology, although often it contains considerable numbers

of missing values. These missing values can significantly affectsubsequent statistical analysis and machine learning algorithms so

there is a strong motivation to estimate these values as accurately

as possible before using these algorithms. While many imputation

algorithms have been proposed, more robust techniques need to be

developed so that further analysis of biological data can be accur-

ately undertaken. In this paper, an innovative missing value imputation

algorithm called collateral missing value estimation (CMVE) is presen-

ted which uses multiple covariance-based imputation matrices for

the final prediction of missing values. The matrices are computed

and optimized using least square regression and linear programming

methods.

Results: The new CMVE algorithm has been compared with existing

estimation techniques including Bayesian principal component

analysis imputation (BPCA), least square impute (LSImpute) and

K-nearest neighbour (KNN). All these methods were rigorously tested

to estimate missing values in three separate non-time series (ovarian

cancer based) and one time series (yeast sporulation) dataset. Each

method was quantitatively analyzed using the normalized root mean

square (NRMS) error measure, covering a wide range of randomly

introduced missing value probabilities from 0.01 to 0.2. Experiments

were also undertaken on the yeast dataset, which comprised 1.7%

actual missing values, to test the hypothesis that CMVE performed

better not only for randomly occurring but also for a real distribution of

missing values. The results confirmed that CMVE consistently demon-

strated superior and robust estimation capability of missing values

compared with other methods for both series types of data, for the

same order of computational complexity. A concise theoretical frame-work has also been formulated to validate the improved performance

of the CMVE algorithm.

Availability: The CMVE software is available upon request from the

authors.

Contact: [email protected]

1 INTRODUCTION

DNA microarrays are extensively used to probe the genetic

expression oftens of thousandsof genes under a variety of conditions,

as well as in the study of many biological processes varying

∗To whom correspondence should be addressed.

from human tumors (Sehgal et al., 2004a) to yeast sporulation

(Troyanskayaetal., 2001). Thereare several statistical, mathematical

and machine learning algorithms (Gustavo etal., 2003; Ramaswamy

et al., 2001; Shipp et al., 2002) that exploit these data for diagnosis(Furey et al., 2000; Brown et al., 1997), drug discovery and protein

sequencing for instance. The most commonly used methods include

data dimension reduction techniques (Sehgal et al., 2004e), class

prediction techniques (Sehgal et al., 2004bc; Golub et al., 1999) and

clustering methods (Munagala et al., 2004).

Despite the wide usage of microarray data, they frequently contain

missing values with up to 90% of genes affected (Ouyang et al.,

2004). Missing valuescan occur forvarious reasons, such as spotting

problems, slide scratches, blemishes on the chip, hybridization error,

and image corruption or simply dust on the slide (Oba et al., 2003).

It has been proven (Sehgal et al., 2004d), (Acuna et al., 2004) that

missing values affect class prediction and data dimension reduc-

tion techniques, such as support vector machines (SVMs), neural

networks (NNs), principal component analysis (PCA) and singular

value decomposition (SVD). The problem can be managed in many

different ways from repeating the experiment, although this is often

not feasible for economic reasons, to simply ignoring the samples

containing missing values, although this is inappropriate because

usually there are only a very limited number of samples available.

The best solution is to attempt to accurately estimate the missing val-

ues, but unfortunately most approaches use zero impute (replace the

missing values by zero) or row average/median (replacement by the

corresponding row average/median), neither of which take advant-

age of data correlations, thereby leading to high estimation errors

(Troyanskaya et al., 2001). Current research demonstrates that if

the correlation between data is exploited then missing value predic-

tion error can be reduced significantly (Sehgal et al., 2004d; Hellemetal., 2004). Several methods including K-nearest neighbour (KNN)

impute, leastsquare imputation (LSImpute)(Hellem etal., 2004) and

Bayesian PCA (BPCA) (Oba et al., 2003) have been used; however,

theprediction errorgeneratedusingthesemethodsstillimpactson the

performance of statistical and machine learning algorithms including

class prediction, class discovery and differential gene identification

algorithms (Sehgal et al., 2004e). There is, thus, considerable poten-

tial to develop new techniques that will provide minimal prediction

errors for different types of microarray data including both time and

non-time series sequences.

This paper presents a collateral missing value estimation (CMVE)

algorithm which combines multiple value matrices for particular

missing data and optimizes its parameters using linear programming

© The Author 2005 . Published by Oxford University Press. All righ ts reser ved. For Permissions, please email: jour nals.permissions@oupjourna ls.org 2417

b y g u e s t onM a r c h 2 5 ,2 0 1 4

h t t p : / / b i oi nf or m a t i c s . oxf or d j o u

r n a l s . or g /

D o wnl o a d e d f r om

http://bioinformatics.oxfordjournals.org/













































M.S.B.Sehgal et al.

and least square (LS) regression. CMVE is compared with other

well-established techniques, including KNN, LSImpute and BPCA,

with their performance rigorously tested for the prediction of

randomly introduced missing values, with probabilities ranging from

0.01 to 0.2 for the BRCA1, BRCA2, sporadic mutation microarraydata (mutations present in ovarian cancer), which is non-time series

data (Amir et al., 2001). The reason for introducing missing values

is that the number of actual missing values in the BRCA1, BRCA2

and sporadic mutation data are negligibly small compared with the

size of the dataset—only 0.01, 0.003 and 0.01% values, respectively.

Since randomly introduced missing values may not be distributed

in the same way as actual missing values (Oba et al., 2003), a

separate experiment was performed, with CMVE and the other three

estimation algorithms being applied to the yeast sporulation time

series dataset (Spellman et al., 1998), which contains 1.7% missing

values.

The normalized root mean square (NRMS) error (Ouyang et al.,

2004) metric was used to quantitatively evaluate the estimation

performance of each technique, with results demonstrating theimproved accuracy and robustness of CMVE over a wide range of

randomly introduced missing values. In addition, while computa-

tional complexity is not as critical a factor as accuracy for missing

value imputation because estimation is performed only once during

the data collection (Troyanskaya etal., 2001; Hellemetal., 2004), the

order of computational complexity for CMVE proved to be exactly

the same as the LSimpute and KNN algorithms.

The remainder of the paper is organized as follows: Section 2

presents a brief overview of existing estimation techniques, with

their respective advantages and disadvantages, while the new CMVE

algorithm and methodology is detailed in Section 3. Section 4

provides the theoretical framework for the improved performance of

CMVE compared with the KNN, LSImpute and BPCA algorithms,

while Section 5 analyses fully the respective estimation perform-

ance of all four imputation methods. Section 6 provides some

conclusions.

2 OVERVIEW OF EXISTING MISSING VALUEESTIMATION TECHNIQUES

The following convention is adopted for all the imputation algorithms

describedin this paper. Themicroarray data have theform of an m×n

matrix Y , where m is the number of genes and n is the number of

samples. The Y IJ component of Y represents the expression level of

gene I for sample J .

An overview is now presented of the strengths and limitations of

the three estimation techniques used for comparative purposes inassessing the performance of CMVE.

2.1 KNN estimation

The KNN method imputes missing values by selecting genes with

expression values similar to the gene of interest (Toyanasaka et al.,

2001). In order to estimate the missing value Y IJ of gene I in sample

J , k genes are selected whose expression vectors are similar to

genetic expression of I in samples other than J . The similarity

measure between two expression vectors Y 1 and Y 2 is determi-

ned by the Euclidian distance ψ over the observed components in

sample J .

ψ = ||Y 1 − Y 2||. (1)

The missing value is then estimated as the weighted average of the

corresponding entries in the selected k expression vectors:

Y IJ =

k

i=1

W i · Xi (2)

W i = 1

ψi × (3)

where =k

i=1 ψi and X is the input matrix containing gene

expressions. Equations (2) and (3) show that each gene contribution

is weighted by the similarity of its expression to gene I .

The Euclidean distance measure used by KNN is sensitive to

outlier values which may be present in microarray data; although

log-transforming the data significantly reduces their effects on gene

similarity determination (Toyanasaka et al., 2001). The choice of

a small k degrades the performance of the classifier as the imputa-

tion process overemphasizes a few dominant genes in estimating

the missing values. Conversely, a large neighbourhood may include

genes that are significantly different from those containing missing

values; thereby, degrading the estimation process and commensura-

tely the classifier’s performance. Empirical results have demons-

trated that for small datasets k = 10 is the best choice (Accuna

et al., 2004), while Toyanasaka et al. (2001) observed that KNN is

insensitive to values of k in the range 10–20.

The computational complexity of KNN is O (m2n), where m and

n are the number of genes and samples, respectively, and while this

is the same order as the LSImpute algorithm, Section 2.3 will show it

is higher than BPCA. A vital feature of KNN is that it does not con-

sidernegative correlations between data, whichcan leadto estimation

errors.

2.2 Least square impute estimation

LSImpute is a regression-based estimation method that exploits the

correlation between genes. To estimate the missing value Y IJ of gene

I from gene expression matrix Y , the k -most correlated genes are

first selected whose expression vectors are similar to gene I from Y

in all samples except J , containing non-missing values for gene I .

The LS regression method then estimates the missing value Y IJ . By

possessing the flexibility to adjust the number of predictor genes k

in the regression, LSImpute performs best when data have a strong

local correlation structure and for the same order of computational

complexity O (m2n), as KNN.

2.3 Bayesian PCA based estimation

BPCA estimates missing values Y miss in data matrix Y using thosegenes Y obs having no missing values. The probabilistic PCA (PPCA)

is calculated using Bayes theorem and the Bayesian estimation cal-

culates posterior distribution of model parameter θ and input matrix

X containing gene expression samples using:

p(θ , X|Y) α p(Y , X|θ)p(θ) (4)

where p(θ) is known as the prior distribution which contributes a

priori preference to θ and X.

Missing values are estimated using a Bayesian estimation

algorithm, which is executed for both θ and Y miss (similar to the

Expectation–Maximization repetitive algorithm) and calculates the

posterior distributions for θ and Y miss, q(θ) and q(Y miss) (Oba et al.,

2418

b y g u e s t onM a r c h 2 5 ,2 0 1 4


r n a l s . or g /















































Collateral missing value imputation

Pre Condition: Gene expression matrix Y (m, n) where m and n arethe number of genes and samples, respectively.Post Condition: Y with no missing values.

Algorithm:

STEP 1 Locate missing value Y IJ

in gene I and sample J .

STEP 2 Compute the absolute covariance CoV of expressionvector v of gene I using (7)

STEP 3 Rank genes (rows) based on CoV

STEP 4 Select the k most effective rows Rk

STEP 5 Use these values of Rk to estimate 1 by (8)

STEP 6 Calculate 2 and 3 using (9) and (10)

STEP 7 Calculate missing value Y IJ using (14) and impute estim-ate χ in all future predictions.

STEP 8 Seek the next missing value Y IJ and repeat STEPS 2–7until all missing Y values are estimated

STEP 9 END

Fig. 1. The CMVE algorithm.

2003). Finally, the missing values in the gene expression matrix are

imputed using:

Y =

Y missq(Y miss) dY miss (5)

q(Y miss) = p( Y miss|Y obs, θ true) (6)

where θ true is the posterior of the missing value.

By exploiting only theglobalcorrelationin thedatasets, BPCA has

the advantage of prediction speed incurring a computational com-

plexity O(mn), which is one degree less than for both KNN and

LSImpute. For imputation purposes, however, improved estimation

accuracy is always a greater priority than speed.

3 THE CMVE ALGORITHM

The complete CMVE algorithm, which is detailed in Figure 1, intro-

duces the concept of multiple parallel estimations of missing values.

For instance, if value Y IJ of gene I and sample J is missing, mul-

tipleestimates (1, 2 and 3) aregeneratedand thefinal estimate χ

distilled from these estimates. The covariance function is employed

since unlike KNN, it is unbiased in considering both positive and

negative correlation values. The covariancefunction CoV is formally

defined as:

CoV = 1(n − 1)

ni=1

(νi − ν)(ωi − ω) (7)

where ω is the predictor gene vector and ν the expression vector of

gene I which has the missing values. The absolute diagonal covari-

ance CoV is first computed for a gene vector ν , where every gene

except I is iteratively considered as ω (Step 2 in Fig. 1). The genes

are then ordered with respect to their values and the first k-ranked

covariate genes Rk selected, whose expression vectors have the most

similarity to gene I from Y in all samples except J (Step 4). The

LS regression method Harvey and Arthur (2004) is then applied to

estimate value 1 for Y IJ (Step 5) as:

1 = α + βX + ξ (8)

where ξ is the error term that minimizes the variance in the LS model

(parameters α and β). For a single regression, the estimate of α and β

are, respectively,

α = y − βX and β = xy

xx

,

where xy = 1/(n − 1)n

J =1(XJ − X)(Y J − Y ) is the empirical

covariance between X and Y , Y J is the gene with the missing value

and XJ is the predictor gene in Rk .

xx = 1/(n − 1)n

J =1(XJ − X)2 is the empirical variance of

X with X and Y being the respective means over X1, . . . , Xn and

Y 1, . . . , Y n, so the LS estimate of Y given X is expressed as:

Y = Y − xy

xx

(X − X)2.

The two other missing value estimates 2 and 3 (Step 6) are,

respectively, given by:

2 =

ki=1

φ + η −

ki=1

ξ 2 (9)

3 =

ki=1(φT × I )

k+ η (10)

where φ is the vector that minimizes ξ 0 in Equation (12), η is the

normal residual and ξ is the actual residual. These three parameters

are obtained from the non-negative least square (NNLS) algorithm

(Charles et al., 1974). The objective is now to find a linear combina-

tion of models that best fit Rk and I . The objective function in NNLS

minimizes, using linear programming techniques, the prediction

error ξ 0 so that:

ξ , φ, η = min(ξ 0) (11)

i.e. min(ξ 0

) is a function that locates the normal vector φ with

minimum prediction error ξ 0 and residual η. The value of ξ 0 in

Equation (11) is obtained from

ξ 0 = max(SV(Rk · φ − I )) (12)

where SV are the singular valuesof the difference vectorbetween the

dot product Rk and prediction coefficients φ with thegene expression

row I . The tolerance used in the linear programming to compute

vector φ is given by

Tol = k × n × max(SV(Rk )) × C (13)

where k is the number of predictor genes, n the number of samples

in the dataset and C is the normalization factor. The final estimate χ

for Y IJ is formed using

χ = ρ · 1 + · 2 + · 3 (14)

where ρ = = = 0.33 ensures an equal weighting to the

respective estimates 1, 2 and 3. The rationale for this choice

is that as each estimate is highly data-dependent, it avoids any bias

toward one particular estimate.

The reason (14) has a lower NRMS error is because the imputation

matrix 1 uses LS regression, while matrices 2 and 3 use NNLS,

which is superior for estimating positive correlated values. NNLS is

unable, however, to estimate negative values and given microarray

data, possessesboth negativeand positivevalues, thiswas the motiva-

tion to embed the LSImpute-based matrix into the gene expression

prediction, thus combiningthe advantagesof both algorithms to more

accurately estimate the missing values.

2419

b y g u e s t onM a r c h 2 5 ,2 0 1 4


r n a l s . or g /















































M.S.B.Sehgal et al.

4 THEORETICAL FOUNDATIONS OF CMVE

This section explores the theoretical principles underpinning the

reasons behind why the CMVE algorithm performs better when

compared with KNN, LSImpute and BPCA techniques in estimating

missing values. For completeness, a computational complexity ana-lysis of CMVE is provided and is shown to be exactly of the same

order as both LSImpute and KNN.

Proposition 1. KNN only considers positive correlations.

If there are two sets α and β which are inversely proportional to each

other, then the distance d between α and β will be larger in those sets

which are proportional to each other. Several distance functions can

be used for KNN, with the most common being Euclidian distance

which is given by

d = α − β (15)

so d is always higher when α is inversely proportional to β , rather

than when they are directly proportional to each other.

Proposition 2. The CMVE algorithm considersboth positive and

negative correlation values.

Assume two sets ν and ω that are inversely proportional, such that

CoV < 0 ∀ν, ω. From (7) it is clear that if a high correlation exists

between the gene values (either directly proportional and positive

correlation or inversely proportional and negative correlation) then

a higher absolute CoV value will exist.

Proposition 3. The probability P() of the normalized imputa-

tion error of missing values using CMVE is always less than that

for BPCA, LSImpute and KNN.

The probability P() of the normalized imputation error of the

missing value for correlated data is directly proportional to the

number of missing values M (Mclean, 2000). Assume P 1 and P 2are the probabilities of normalized imputation errors of CMVE (1)

and the three comparative algorithms (2) such that:

P 1 =

M i=0

P (1)P(M) = M × P (1)P(M) (16)

P 2 =

M i=0

P (2)P(i) (17)

Since the comparative methods do not estimate any future missing

value predictions, such algorithms only consider M missing values

for each prediction. In contrast, CMVE uses estimated values for the

future prediction of missing values so each estimate increases the

number of predictor genes to be considered, while concomitantly

decreasing the prediction probabilities in Equation (17). Hence

P 2 < P 1 such that P 2 → 0 when i → 0

as P (i) = 0 for i = 0 (18)

Proposition 4. CMVE always has a lower estimation error of

missing values in the case of transitive gene dependency (Gene A →

B → C ) than BPCA, LSImpute and KNN.

Assume that gene Ga1 is correlated with S 1 such that

Ga1 → S 1 such that S 1 = {Gb1, Gb2, . . . , Gbn} (19)

Similarly, gene Gb1 is correlated with S 2 as:

Gb1 → S 2 such that S 2 = {Gc1, Gc2, . . . , Gcn} (20)

If the values of both Ga1 and Gb1 are missing then Gb1 can be

predicted using set S 2 and subsequently used to predict Ga1 more

accurately using S 1 by including Gb1 rather than ignoring it.

CMVE, unlike the other imputation techniques considers estima-

ted values in predicting future missing values. LSImpute replaces the

gene missing value with an average value to compute the CoV matrix

(Hellem etal., 2004). The NRMS error using this approach is always

higher than CMVE, since each iteration (Fig. 1, Steps 1–7) lowers

this error. KNN and BPCA consider that missing genes have no

correlation with the missing value gene as they ignore these missingvalues while searching the estimation space. In contrast, CMVE

includes these genes when searching for the most correlated gene.

This may incur a small accumulative error in future predictions, but

it will always be less than when either the average value of the gene

is used or the gene is totally ignored.

Proposition 5. CMVE generates a lower estimation error than

BPCA when genes have dominant local correlation.

BPCA assumes only a global correlation structure and has a similar

effect in selectinga high value of k forCMVE.Owingto this assump-

tion, BPCA does not provide accurate estimates when genes have

dominant local correlation Oba et al. (2003), because in predicting

missing values, information from all genes is considered, many of which have little or no correlation with the gene with the missing

value. In contrast, the CMVE variable k can be adjusted depend-

ing upon the type of the data, ensuring that only those genes with

strong correlations are considered, which concomitantly reduces

the estimation error. The empirical results presented in the next

section demonstrate that a value of k = 10 is suitable for local

correlated data.

Computational complexity analysis. The order of computational

complexity for CMVE is exactly the same as for the KNN and

LSImpute algorithms.

The critical operation for the CMVE, KNN and LSImpute

algorithms is the search for the most correlated genes. These

algorithms search for correlated genes with the gene that has miss-ing values. Each estimation takes linear time O(n); therefore, for

m genes and n samples the complexity order is O(m2n) for all

algorithms. KNN uses a weighted average of k correlated genes

to estimate the missing values, while CMVE and LSImpute use

regression and linear programming for estimation, although these

additional overheads are negligible compared with the time taken to

search for the most correlated genes. Similar to KNN and LSImpute,

CMVE also only searches once per estimation for correlated genes.

As discussed in Section 2.3, BPCA has a computational complex-

ity of O(mn) as it only considers the global correlation structure of

the data. This is pyrrhic, however, becausethe corresponding estima-

tion accuracy is significantly inferior whenever data have a localized

correlation structure.

2420

b y g u e s t onM a r c h 2 5 ,2 0 1 4


r n a l s . or g /
















































0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

1 4 7 10 13 16 19 22 25 2 8 31 3 4 37 4 0 43 46 49

k

N R M S E r r o r

B1

B2

Sp

Yeast

Fig. 2. NRMS error over a wide range of k in the CMVE algorithm for 5%

missing values.

5 RESULTS ANALYSISTo test the different imputation algorithms, four different types of

microarray data were used including both time series and non-time

series data. The dataset contained 18, 16, 27 and 77 samples of

BRCA1, BRCA2, sporadic mutations (neither BRCA1 nor BRCA2)

of ovarian cancer data (non-time series) and yeast sporulation data

(time series), respectively. Each ovarian cancer data sample con-

tained logarithmic microarray data of 6445 genes while there were

6179 genetic expressions per sample for yeast dataset. The rationale

for selecting cancer data is that in such data some of the genes are

up/downregulated hence it is very difficult to determine their expres-

sion levels from non-regulated genes. The missing value estimation

techniques were tested by randomly removing data values and then

computing the estimation error. In the experiments, between 1 and

5% of the values were removed from each dataset samples and the

NRMS error θ was computed by

= RMS(M − M est)

RMS(M)(21)

where M is the original data matrix and M est is the estimated matrix

using KNN, LSImpute, BPCA and CMVE. This particular metric

was used for error estimation because ξ = 1 for zero imputation

(Ouyang et al., 2004).

To compare the performance of the CMVE, KNN and LSImpute

imputation algorithms, k = 10 wasused throughout theexperiments.

The rationale for this was that Olga et al. (2001) observed that KNN

was insensitive to values of k in the range 10–20 and the best estim-ation results were observed in this range. Hellem et al. (2003) also

suggested using k = 10 for LSImpute. Figure 2 plots the minimum

overall prediction error rates for CMVE over a range of k values for

the different test datasets, with results showing that k in the range

10–15 (highlighted) is the most appropriate. Lower k values include

only a small set of correlated genes for prediction leading to predic-

tion errors as other correlated genes are ignored. Conversely, when k

is high, genes which have either very little or no correlation with the

gene having missing values will be included in the prediction, again

leading to erroneous results (Troyanskaya et al., 2001).

To fully test the robustness of the new CMVE algorithm, exper-

iments were performed for missing values up to 20% (Figs 3–9).

Figure8a and 8b showthe error values for 10% missing values which

0

0.01

0.02

0.03

0.040.05

0.06

0.07

0.08

0.09

0.1

Brca1 Brca2 Sporadic

Data

N o r m a l i z e d

R M S E r r o r

BPCA

KNN

LSImputeCMVE

Fig. 3. NRMS error for 1% missing values for ovarian cancer data.

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11


Data

N o r m a l i z e d R M

S E r r o r

BPCA

KNN

LSImpute

CMVE


0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13


Data

N o r m a l i z e d R M S E r r

o r

BPCA

KNN

LSImpute

CMVE


especially reveal (Fig. 8a) the significant deterioration in the resultsof KNN for the sporadic dataset.

To clarify the performance of CMVE, Figure 8b plots the error

results without KNN, which consistently confirm the lower error

values compared with LSImpute and BPCA. Figure 9a and 9b show

thecorresponding results for20% missing values, which again reveal

the superiority and greater robustness of the CMVE algorithm for

missing value imputation. Note, for the sake of clarity a logarithmic

scale is used in Figure 9b.

Whenever there is a high number of missing values in a gene,

sparse covariance matrices will ensue and with them, the increased

likelihood of ill-conditioning. The CMVE algorithm avoids ill-

conditioningby ensuring theremoval ofall genes with >20% missing

values before imputation.

2421

b y g u e s t onM a r c h 2 5 ,2 0 1 4


r n a l s . or g /















































M.S.B.Sehgal et al.

0

0.015

0.03

0.045

0.06

0.075

0.09

0.105

0.12

0.135

0.15


Data

N o r m a l i z e

d R M S E r r o r

BPCA

KNN

LSImpute

CMVE


0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16


Data

N o r m a l i

z e d R M S E r r o r

BPCA

KNN

LSImputeCMVE


(a)

(b)

Fig. 8. (a) NRMS error for 10% missing values for ovarian cancer data.

(b) NRMS error of CMVE, BPCA and LSImpute for 10% missing values for

ovarian cancer data.

As highlighted in Section 1, experiments performed on datasets

with randomly introduced missing values may not truly reflect the

nature of actual microarray data missing values. All four imputation

algorithms were, therefore, tested on the yeast time series data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


Data

N o r m a l i z

e d R M S E r r o r

BPCA

KN N

LSImpute

C M V E

0.001

0.01

0.1

1


Data

L o g a r i t h m i c

N o r m a l i z e d R M S E r r o r

BPCA

KN N

LSImpute

CM VE

(a)

(b)

Fig. 9. (a) NRMS error for 20% missing values for ovarian cancer data.

(b) NRMS error (log scale) for 20% missing values for ovarian cancer data.

Table 1. NRMS errors for the actual missing value distribution (AMVD) of

1.7% missing values and additional 1–10% randomly introduced values in

the yeast dataset

Missing BPCAImpute KNNImpute LSImpute CMVE

values

AMVD 0.1485 0.0654 0.0130 0.0064

1% 0.0319 0.8930 0.0849 0.0030

2% 0.0569 0.5284 0.1555 0.0843

3% 0.0674 0.6232 0.1612 0.0547

4% 0.0846 0.9307 0.2003 0.0090

5% 0.0927 0.5821 0.2071 0.0091

10% 0.1756 0.8763 0.0638 0.0130

containing 1.7% missing values. Since NRMS errors could not be

calculated for these actual missing values, the gene value adjacent

to the gene with the missing value was replaced before applyingthe imputation algorithms. This had the effect of a delay function,

while retaining the same distribution of missing values. The results

in Table 1 again confirm the superior performance of CMVE, par-

ticularly when an additional 4, 5 and 10% of missing values are

introduced into the data, with the corresponding average improve-

ments being 60, 72 and 64%, respectively. The imputation results

also reveal some other broader noteworthy issues. KNN for instance,

performed better when missing values were randomly introduced

because KNN only considers positive correlations and certain ran-

domly introduced missing values will inevitably contain negative

correlations with other genetic data. Similarly, LSImpute exhibited

an improved performance compared with BPCA (Oba et al., 2003)

confirming the discussion underpinning Proposition 5.

2422

b y g u e s t onM a r c h 2 5 ,2 0 1 4


r n a l s . or g /
















































6 CONCLUSIONS

This paper has presented a new CMVE algorithm based on the novel

concept of multiple imputations. Experimental results confirmed that

CMVEconsistently provided superior estimationaccuracy compared

with the existing missing value imputation algorithms, includingKNN, LSImpute and BPCA. This performance improvement was

especially evident when estimating higher numbers of missing val-

ues in both time series and non-time series data. The algorithm’s

theoretical basis, which was the exploitation of a combination of

global and local correlations in a given dataset, repeatedly proved to

be a more effectiveand robuststrategythanthe distance function used

by KNN, with no increase in the order of computational complexity

for all values of k . The results corroborate the fact that CMVE can

be successfully applied to accurately impute missing values before

any microarray data experiment, crucially without any bias being

introduced into the estimation process.

REFERENCESAcuna,E. and Rodriguez,C. (2004) The treatment of missing values and its effect in

the classifier accuracy. In Banks,D. et al. (eds) Classification, Clustering and Data

Mining Applications. Springer-Verlag, Berlin, Heidelberg, pp. 639–648.

Amir,A.J. et al. (2002) Gene expression profiles of BRCA1-linked, BRCA2-linked, and

sporadic ovarian cancers. J. Natl Cancer Inst., 94, 981–990.

Brown,W.N. etal. (1997)Knowledge-based analysis of microarray gene expression data

using support vector machines. Proc. Natl Acad. Sci. USA, 97, 262–267.

Golub,T.R. et al. (1999) Molecular classification of cancer: class discovery and class

prediction by gene expression monitoring. Science, 286, 531–537.

Gustavo,B.and Monard,C.M. (2003)An analysis of fourmissing data treatment methods

for supervised learning. Appl. Artif. Intell., 17, 519–533.

Furey,T.S. et al. (2000) Support vector machine classification and validation of cancer

tissue samples using microarray expression data. Bioinformatics, 16, 906–914.

Harvey,M. and Arthur,C. (2004) Fitting Models to Biological Data Using Linear and

Nonlinear Regression. Oxford University Press, Oxford.

Hellem,B.T. etal. (2004)LSimpute: accurateestimationof missing valuesin microarray

data with least squares methods. Nucleic Acids Res., 32, e34.Lawson,C.L. and Hanson,R.J. (1974) Solving Least Squares Problems. Prentice-Hall,

Inc., Englewood Cliffs, NJ.

Munagala,K. et al. (2004) Cancer characterization and feature set extraction by

discriminative margin clustering. BMC Bioinformatics, 5, 21.

McLean,A. (2000) The predictive approach to teaching statistics. J. Stat. Education, 8.

Oba,S. et al. (2003) A bayesian missing value estimation method for gene expression

profile data. Bioinformatics, 19, 2088–2096.

Ouyang,M. etal. (2004)Gaussian mixtureclustering andimputation of microarray data.

Bioinformatics, 20, 917–923.

Ramaswamy,S. et al. (2001) Multiclass cancer diagnosis using tumour gene expression

signatures. Proc. Natl Acad. Sci. USA, 98, 15149–15154.

Sehgal,M.S.B. et al. (2004a) Support vector machine and generalized regression neural

network based classification fusion models for cancer diagnosis. In HIS’04, Japan.

Sehgal,M.S.B. et al. (2004b) A collimator neural network model for the classification

of genetic data. ICBA 04, USA.

Sehgal,M.S.B. et al. (2004c) Communal neural network for ovarian cancer mutation

classification. Complex 04, Australia.

Sehgal,M.S.B. et al. (2004d) K-Ranked covariance based missing values estimation for

microarray data classification. HIS’04, Japan.

Sehgal,M.S.B. et al. (2004e) Statistical neural networks and support vector machine for

the classification of genetic mutations in ovarian cancer. IEEE CIBCB 04, USA.

Shipp,M.A. et al., (2002) Diffuse large B-cell lymphoma outcome prediction by gene

expression profiling and supervised machine learning. Nat. Med., 8, 68–74.

Spellman,P.T. et al. (1998) Comprehensive identification of cell cycle-regulated genes

of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell,

9, 3273–3297.

Troyanskaya,O. et al. (2001) Missing value estimation methods for DNA microarrays.

Bioinformatics, 17, 520–525.

2423

b y g u e s t onM a r c h 2 5 ,2 0 1 4


r n a l s . or g /













































bioinformatics 2005 sehgal 2417 23

Documents