bioinformatics 2005 sehgal 2417 23
TRANSCRIPT
8/10/2019 Bioinformatics 2005 Sehgal 2417 23
http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 1/7
BIOINFORMATICS ORIGINAL PAPER Vol.21 no.10 2005, pages 2417–2423
doi:10.1093/bioinformatics/bti345
Gene expression
Collateral missing value imputation: a new robust missing value
estimation algorithm for microarray dataMuhammad Shoaib B. Sehgal∗, Iqbal Gondal and Laurence S. Dooley
Gippsland School of Computing and Information Technology, Monash University, VIC 3842, Australia
Received on November 21, 2004; revised on January 30, 2005; accepted on February 18, 2005
Advance Access publication February 24, 2005
ABSTRACT
Motivation: Microarray data are used in a range of application
areas in biology, although often it contains considerable numbers
of missing values. These missing values can significantly affectsubsequent statistical analysis and machine learning algorithms so
there is a strong motivation to estimate these values as accurately
as possible before using these algorithms. While many imputation
algorithms have been proposed, more robust techniques need to be
developed so that further analysis of biological data can be accur-
ately undertaken. In this paper, an innovative missing value imputation
algorithm called collateral missing value estimation (CMVE) is presen-
ted which uses multiple covariance-based imputation matrices for
the final prediction of missing values. The matrices are computed
and optimized using least square regression and linear programming
methods.
Results: The new CMVE algorithm has been compared with existing
estimation techniques including Bayesian principal component
analysis imputation (BPCA), least square impute (LSImpute) and
K-nearest neighbour (KNN). All these methods were rigorously tested
to estimate missing values in three separate non-time series (ovarian
cancer based) and one time series (yeast sporulation) dataset. Each
method was quantitatively analyzed using the normalized root mean
square (NRMS) error measure, covering a wide range of randomly
introduced missing value probabilities from 0.01 to 0.2. Experiments
were also undertaken on the yeast dataset, which comprised 1.7%
actual missing values, to test the hypothesis that CMVE performed
better not only for randomly occurring but also for a real distribution of
missing values. The results confirmed that CMVE consistently demon-
strated superior and robust estimation capability of missing values
compared with other methods for both series types of data, for the
same order of computational complexity. A concise theoretical frame-work has also been formulated to validate the improved performance
of the CMVE algorithm.
Availability: The CMVE software is available upon request from the
authors.
Contact: [email protected]
1 INTRODUCTION
DNA microarrays are extensively used to probe the genetic
expression oftens of thousandsof genes under a variety of conditions,
as well as in the study of many biological processes varying
∗To whom correspondence should be addressed.
from human tumors (Sehgal et al., 2004a) to yeast sporulation
(Troyanskayaetal., 2001). Thereare several statistical, mathematical
and machine learning algorithms (Gustavo etal., 2003; Ramaswamy
et al., 2001; Shipp et al., 2002) that exploit these data for diagnosis(Furey et al., 2000; Brown et al., 1997), drug discovery and protein
sequencing for instance. The most commonly used methods include
data dimension reduction techniques (Sehgal et al., 2004e), class
prediction techniques (Sehgal et al., 2004bc; Golub et al., 1999) and
clustering methods (Munagala et al., 2004).
Despite the wide usage of microarray data, they frequently contain
missing values with up to 90% of genes affected (Ouyang et al.,
2004). Missing valuescan occur forvarious reasons, such as spotting
problems, slide scratches, blemishes on the chip, hybridization error,
and image corruption or simply dust on the slide (Oba et al., 2003).
It has been proven (Sehgal et al., 2004d), (Acuna et al., 2004) that
missing values affect class prediction and data dimension reduc-
tion techniques, such as support vector machines (SVMs), neural
networks (NNs), principal component analysis (PCA) and singular
value decomposition (SVD). The problem can be managed in many
different ways from repeating the experiment, although this is often
not feasible for economic reasons, to simply ignoring the samples
containing missing values, although this is inappropriate because
usually there are only a very limited number of samples available.
The best solution is to attempt to accurately estimate the missing val-
ues, but unfortunately most approaches use zero impute (replace the
missing values by zero) or row average/median (replacement by the
corresponding row average/median), neither of which take advant-
age of data correlations, thereby leading to high estimation errors
(Troyanskaya et al., 2001). Current research demonstrates that if
the correlation between data is exploited then missing value predic-
tion error can be reduced significantly (Sehgal et al., 2004d; Hellemetal., 2004). Several methods including K-nearest neighbour (KNN)
impute, leastsquare imputation (LSImpute)(Hellem etal., 2004) and
Bayesian PCA (BPCA) (Oba et al., 2003) have been used; however,
theprediction errorgeneratedusingthesemethodsstillimpactson the
performance of statistical and machine learning algorithms including
class prediction, class discovery and differential gene identification
algorithms (Sehgal et al., 2004e). There is, thus, considerable poten-
tial to develop new techniques that will provide minimal prediction
errors for different types of microarray data including both time and
non-time series sequences.
This paper presents a collateral missing value estimation (CMVE)
algorithm which combines multiple value matrices for particular
missing data and optimizes its parameters using linear programming
© The Author 2005 . Published by Oxford University Press. All righ ts reser ved. For Permissions, please email: jour nals.permissions@oupjourna ls.org 2417
b y g u e s t onM a r c h 2 5 ,2 0 1 4
h t t p : / / b i oi nf or m a t i c s . oxf or d j o u
r n a l s . or g /
D o wnl o a d e d f r om
8/10/2019 Bioinformatics 2005 Sehgal 2417 23
http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 2/7
M.S.B.Sehgal et al.
and least square (LS) regression. CMVE is compared with other
well-established techniques, including KNN, LSImpute and BPCA,
with their performance rigorously tested for the prediction of
randomly introduced missing values, with probabilities ranging from
0.01 to 0.2 for the BRCA1, BRCA2, sporadic mutation microarraydata (mutations present in ovarian cancer), which is non-time series
data (Amir et al., 2001). The reason for introducing missing values
is that the number of actual missing values in the BRCA1, BRCA2
and sporadic mutation data are negligibly small compared with the
size of the dataset—only 0.01, 0.003 and 0.01% values, respectively.
Since randomly introduced missing values may not be distributed
in the same way as actual missing values (Oba et al., 2003), a
separate experiment was performed, with CMVE and the other three
estimation algorithms being applied to the yeast sporulation time
series dataset (Spellman et al., 1998), which contains 1.7% missing
values.
The normalized root mean square (NRMS) error (Ouyang et al.,
2004) metric was used to quantitatively evaluate the estimation
performance of each technique, with results demonstrating theimproved accuracy and robustness of CMVE over a wide range of
randomly introduced missing values. In addition, while computa-
tional complexity is not as critical a factor as accuracy for missing
value imputation because estimation is performed only once during
the data collection (Troyanskaya etal., 2001; Hellemetal., 2004), the
order of computational complexity for CMVE proved to be exactly
the same as the LSimpute and KNN algorithms.
The remainder of the paper is organized as follows: Section 2
presents a brief overview of existing estimation techniques, with
their respective advantages and disadvantages, while the new CMVE
algorithm and methodology is detailed in Section 3. Section 4
provides the theoretical framework for the improved performance of
CMVE compared with the KNN, LSImpute and BPCA algorithms,
while Section 5 analyses fully the respective estimation perform-
ance of all four imputation methods. Section 6 provides some
conclusions.
2 OVERVIEW OF EXISTING MISSING VALUEESTIMATION TECHNIQUES
The following convention is adopted for all the imputation algorithms
describedin this paper. Themicroarray data have theform of an m×n
matrix Y , where m is the number of genes and n is the number of
samples. The Y IJ component of Y represents the expression level of
gene I for sample J .
An overview is now presented of the strengths and limitations of
the three estimation techniques used for comparative purposes inassessing the performance of CMVE.
2.1 KNN estimation
The KNN method imputes missing values by selecting genes with
expression values similar to the gene of interest (Toyanasaka et al.,
2001). In order to estimate the missing value Y IJ of gene I in sample
J , k genes are selected whose expression vectors are similar to
genetic expression of I in samples other than J . The similarity
measure between two expression vectors Y 1 and Y 2 is determi-
ned by the Euclidian distance ψ over the observed components in
sample J .
ψ = ||Y 1 − Y 2||. (1)
The missing value is then estimated as the weighted average of the
corresponding entries in the selected k expression vectors:
Y IJ =
k
i=1
W i · Xi (2)
W i = 1
ψi × (3)
where =k
i=1 ψi and X is the input matrix containing gene
expressions. Equations (2) and (3) show that each gene contribution
is weighted by the similarity of its expression to gene I .
The Euclidean distance measure used by KNN is sensitive to
outlier values which may be present in microarray data; although
log-transforming the data significantly reduces their effects on gene
similarity determination (Toyanasaka et al., 2001). The choice of
a small k degrades the performance of the classifier as the imputa-
tion process overemphasizes a few dominant genes in estimating
the missing values. Conversely, a large neighbourhood may include
genes that are significantly different from those containing missing
values; thereby, degrading the estimation process and commensura-
tely the classifier’s performance. Empirical results have demons-
trated that for small datasets k = 10 is the best choice (Accuna
et al., 2004), while Toyanasaka et al. (2001) observed that KNN is
insensitive to values of k in the range 10–20.
The computational complexity of KNN is O (m2n), where m and
n are the number of genes and samples, respectively, and while this
is the same order as the LSImpute algorithm, Section 2.3 will show it
is higher than BPCA. A vital feature of KNN is that it does not con-
sidernegative correlations between data, whichcan leadto estimation
errors.
2.2 Least square impute estimation
LSImpute is a regression-based estimation method that exploits the
correlation between genes. To estimate the missing value Y IJ of gene
I from gene expression matrix Y , the k -most correlated genes are
first selected whose expression vectors are similar to gene I from Y
in all samples except J , containing non-missing values for gene I .
The LS regression method then estimates the missing value Y IJ . By
possessing the flexibility to adjust the number of predictor genes k
in the regression, LSImpute performs best when data have a strong
local correlation structure and for the same order of computational
complexity O (m2n), as KNN.
2.3 Bayesian PCA based estimation
BPCA estimates missing values Y miss in data matrix Y using thosegenes Y obs having no missing values. The probabilistic PCA (PPCA)
is calculated using Bayes theorem and the Bayesian estimation cal-
culates posterior distribution of model parameter θ and input matrix
X containing gene expression samples using:
p(θ , X|Y) α p(Y , X|θ)p(θ) (4)
where p(θ) is known as the prior distribution which contributes a
priori preference to θ and X.
Missing values are estimated using a Bayesian estimation
algorithm, which is executed for both θ and Y miss (similar to the
Expectation–Maximization repetitive algorithm) and calculates the
posterior distributions for θ and Y miss, q(θ) and q(Y miss) (Oba et al.,
2418
b y g u e s t onM a r c h 2 5 ,2 0 1 4
h t t p : / / b i oi nf or m a t i c s . oxf or d j o u
r n a l s . or g /
D o wnl o a d e d f r om
8/10/2019 Bioinformatics 2005 Sehgal 2417 23
http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 3/7
Collateral missing value imputation
Pre Condition: Gene expression matrix Y (m, n) where m and n arethe number of genes and samples, respectively.Post Condition: Y with no missing values.
Algorithm:
STEP 1 Locate missing value Y IJ
in gene I and sample J .
STEP 2 Compute the absolute covariance CoV of expressionvector v of gene I using (7)
STEP 3 Rank genes (rows) based on CoV
STEP 4 Select the k most effective rows Rk
STEP 5 Use these values of Rk to estimate 1 by (8)
STEP 6 Calculate 2 and 3 using (9) and (10)
STEP 7 Calculate missing value Y IJ using (14) and impute estim-ate χ in all future predictions.
STEP 8 Seek the next missing value Y IJ and repeat STEPS 2–7until all missing Y values are estimated
STEP 9 END
Fig. 1. The CMVE algorithm.
2003). Finally, the missing values in the gene expression matrix are
imputed using:
Y =
Y missq(Y miss) dY miss (5)
q(Y miss) = p( Y miss|Y obs, θ true) (6)
where θ true is the posterior of the missing value.
By exploiting only theglobalcorrelationin thedatasets, BPCA has
the advantage of prediction speed incurring a computational com-
plexity O(mn), which is one degree less than for both KNN and
LSImpute. For imputation purposes, however, improved estimation
accuracy is always a greater priority than speed.
3 THE CMVE ALGORITHM
The complete CMVE algorithm, which is detailed in Figure 1, intro-
duces the concept of multiple parallel estimations of missing values.
For instance, if value Y IJ of gene I and sample J is missing, mul-
tipleestimates (1, 2 and 3) aregeneratedand thefinal estimate χ
distilled from these estimates. The covariance function is employed
since unlike KNN, it is unbiased in considering both positive and
negative correlation values. The covariancefunction CoV is formally
defined as:
CoV = 1(n − 1)
ni=1
(νi − ν)(ωi − ω) (7)
where ω is the predictor gene vector and ν the expression vector of
gene I which has the missing values. The absolute diagonal covari-
ance CoV is first computed for a gene vector ν , where every gene
except I is iteratively considered as ω (Step 2 in Fig. 1). The genes
are then ordered with respect to their values and the first k-ranked
covariate genes Rk selected, whose expression vectors have the most
similarity to gene I from Y in all samples except J (Step 4). The
LS regression method Harvey and Arthur (2004) is then applied to
estimate value 1 for Y IJ (Step 5) as:
1 = α + βX + ξ (8)
where ξ is the error term that minimizes the variance in the LS model
(parameters α and β). For a single regression, the estimate of α and β
are, respectively,
α = y − βX and β = xy
xx
,
where xy = 1/(n − 1)n
J =1(XJ − X)(Y J − Y ) is the empirical
covariance between X and Y , Y J is the gene with the missing value
and XJ is the predictor gene in Rk .
xx = 1/(n − 1)n
J =1(XJ − X)2 is the empirical variance of
X with X and Y being the respective means over X1, . . . , Xn and
Y 1, . . . , Y n, so the LS estimate of Y given X is expressed as:
Y = Y − xy
xx
(X − X)2.
The two other missing value estimates 2 and 3 (Step 6) are,
respectively, given by:
2 =
ki=1
φ + η −
ki=1
ξ 2 (9)
3 =
ki=1(φT × I )
k+ η (10)
where φ is the vector that minimizes ξ 0 in Equation (12), η is the
normal residual and ξ is the actual residual. These three parameters
are obtained from the non-negative least square (NNLS) algorithm
(Charles et al., 1974). The objective is now to find a linear combina-
tion of models that best fit Rk and I . The objective function in NNLS
minimizes, using linear programming techniques, the prediction
error ξ 0 so that:
ξ , φ, η = min(ξ 0) (11)
i.e. min(ξ 0
) is a function that locates the normal vector φ with
minimum prediction error ξ 0 and residual η. The value of ξ 0 in
Equation (11) is obtained from
ξ 0 = max(SV(Rk · φ − I )) (12)
where SV are the singular valuesof the difference vectorbetween the
dot product Rk and prediction coefficients φ with thegene expression
row I . The tolerance used in the linear programming to compute
vector φ is given by
Tol = k × n × max(SV(Rk )) × C (13)
where k is the number of predictor genes, n the number of samples
in the dataset and C is the normalization factor. The final estimate χ
for Y IJ is formed using
χ = ρ · 1 + · 2 + · 3 (14)
where ρ = = = 0.33 ensures an equal weighting to the
respective estimates 1, 2 and 3. The rationale for this choice
is that as each estimate is highly data-dependent, it avoids any bias
toward one particular estimate.
The reason (14) has a lower NRMS error is because the imputation
matrix 1 uses LS regression, while matrices 2 and 3 use NNLS,
which is superior for estimating positive correlated values. NNLS is
unable, however, to estimate negative values and given microarray
data, possessesboth negativeand positivevalues, thiswas the motiva-
tion to embed the LSImpute-based matrix into the gene expression
prediction, thus combiningthe advantagesof both algorithms to more
accurately estimate the missing values.
2419
b y g u e s t onM a r c h 2 5 ,2 0 1 4
h t t p : / / b i oi nf or m a t i c s . oxf or d j o u
r n a l s . or g /
D o wnl o a d e d f r om
8/10/2019 Bioinformatics 2005 Sehgal 2417 23
http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 4/7
M.S.B.Sehgal et al.
4 THEORETICAL FOUNDATIONS OF CMVE
This section explores the theoretical principles underpinning the
reasons behind why the CMVE algorithm performs better when
compared with KNN, LSImpute and BPCA techniques in estimating
missing values. For completeness, a computational complexity ana-lysis of CMVE is provided and is shown to be exactly of the same
order as both LSImpute and KNN.
Proposition 1. KNN only considers positive correlations.
If there are two sets α and β which are inversely proportional to each
other, then the distance d between α and β will be larger in those sets
which are proportional to each other. Several distance functions can
be used for KNN, with the most common being Euclidian distance
which is given by
d = α − β (15)
so d is always higher when α is inversely proportional to β , rather
than when they are directly proportional to each other.
Proposition 2. The CMVE algorithm considersboth positive and
negative correlation values.
Assume two sets ν and ω that are inversely proportional, such that
CoV < 0 ∀ν, ω. From (7) it is clear that if a high correlation exists
between the gene values (either directly proportional and positive
correlation or inversely proportional and negative correlation) then
a higher absolute CoV value will exist.
Proposition 3. The probability P() of the normalized imputa-
tion error of missing values using CMVE is always less than that
for BPCA, LSImpute and KNN.
The probability P() of the normalized imputation error of the
missing value for correlated data is directly proportional to the
number of missing values M (Mclean, 2000). Assume P 1 and P 2are the probabilities of normalized imputation errors of CMVE (1)
and the three comparative algorithms (2) such that:
P 1 =
M i=0
P (1)P(M) = M × P (1)P(M) (16)
P 2 =
M i=0
P (2)P(i) (17)
Since the comparative methods do not estimate any future missing
value predictions, such algorithms only consider M missing values
for each prediction. In contrast, CMVE uses estimated values for the
future prediction of missing values so each estimate increases the
number of predictor genes to be considered, while concomitantly
decreasing the prediction probabilities in Equation (17). Hence
P 2 < P 1 such that P 2 → 0 when i → 0
as P (i) = 0 for i = 0 (18)
Proposition 4. CMVE always has a lower estimation error of
missing values in the case of transitive gene dependency (Gene A →
B → C ) than BPCA, LSImpute and KNN.
Assume that gene Ga1 is correlated with S 1 such that
Ga1 → S 1 such that S 1 = {Gb1, Gb2, . . . , Gbn} (19)
Similarly, gene Gb1 is correlated with S 2 as:
Gb1 → S 2 such that S 2 = {Gc1, Gc2, . . . , Gcn} (20)
If the values of both Ga1 and Gb1 are missing then Gb1 can be
predicted using set S 2 and subsequently used to predict Ga1 more
accurately using S 1 by including Gb1 rather than ignoring it.
CMVE, unlike the other imputation techniques considers estima-
ted values in predicting future missing values. LSImpute replaces the
gene missing value with an average value to compute the CoV matrix
(Hellem etal., 2004). The NRMS error using this approach is always
higher than CMVE, since each iteration (Fig. 1, Steps 1–7) lowers
this error. KNN and BPCA consider that missing genes have no
correlation with the missing value gene as they ignore these missingvalues while searching the estimation space. In contrast, CMVE
includes these genes when searching for the most correlated gene.
This may incur a small accumulative error in future predictions, but
it will always be less than when either the average value of the gene
is used or the gene is totally ignored.
Proposition 5. CMVE generates a lower estimation error than
BPCA when genes have dominant local correlation.
BPCA assumes only a global correlation structure and has a similar
effect in selectinga high value of k forCMVE.Owingto this assump-
tion, BPCA does not provide accurate estimates when genes have
dominant local correlation Oba et al. (2003), because in predicting
missing values, information from all genes is considered, many of which have little or no correlation with the gene with the missing
value. In contrast, the CMVE variable k can be adjusted depend-
ing upon the type of the data, ensuring that only those genes with
strong correlations are considered, which concomitantly reduces
the estimation error. The empirical results presented in the next
section demonstrate that a value of k = 10 is suitable for local
correlated data.
Computational complexity analysis. The order of computational
complexity for CMVE is exactly the same as for the KNN and
LSImpute algorithms.
The critical operation for the CMVE, KNN and LSImpute
algorithms is the search for the most correlated genes. These
algorithms search for correlated genes with the gene that has miss-ing values. Each estimation takes linear time O(n); therefore, for
m genes and n samples the complexity order is O(m2n) for all
algorithms. KNN uses a weighted average of k correlated genes
to estimate the missing values, while CMVE and LSImpute use
regression and linear programming for estimation, although these
additional overheads are negligible compared with the time taken to
search for the most correlated genes. Similar to KNN and LSImpute,
CMVE also only searches once per estimation for correlated genes.
As discussed in Section 2.3, BPCA has a computational complex-
ity of O(mn) as it only considers the global correlation structure of
the data. This is pyrrhic, however, becausethe corresponding estima-
tion accuracy is significantly inferior whenever data have a localized
correlation structure.
2420
b y g u e s t onM a r c h 2 5 ,2 0 1 4
h t t p : / / b i oi nf or m a t i c s . oxf or d j o u
r n a l s . or g /
D o wnl o a d e d f r om
8/10/2019 Bioinformatics 2005 Sehgal 2417 23
http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 5/7
Collateral missing value imputation
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
1 4 7 10 13 16 19 22 25 2 8 31 3 4 37 4 0 43 46 49
k
N R M S E r r o r
B1
B2
Sp
Yeast
Fig. 2. NRMS error over a wide range of k in the CMVE algorithm for 5%
missing values.
5 RESULTS ANALYSISTo test the different imputation algorithms, four different types of
microarray data were used including both time series and non-time
series data. The dataset contained 18, 16, 27 and 77 samples of
BRCA1, BRCA2, sporadic mutations (neither BRCA1 nor BRCA2)
of ovarian cancer data (non-time series) and yeast sporulation data
(time series), respectively. Each ovarian cancer data sample con-
tained logarithmic microarray data of 6445 genes while there were
6179 genetic expressions per sample for yeast dataset. The rationale
for selecting cancer data is that in such data some of the genes are
up/downregulated hence it is very difficult to determine their expres-
sion levels from non-regulated genes. The missing value estimation
techniques were tested by randomly removing data values and then
computing the estimation error. In the experiments, between 1 and
5% of the values were removed from each dataset samples and the
NRMS error θ was computed by
= RMS(M − M est)
RMS(M)(21)
where M is the original data matrix and M est is the estimated matrix
using KNN, LSImpute, BPCA and CMVE. This particular metric
was used for error estimation because ξ = 1 for zero imputation
(Ouyang et al., 2004).
To compare the performance of the CMVE, KNN and LSImpute
imputation algorithms, k = 10 wasused throughout theexperiments.
The rationale for this was that Olga et al. (2001) observed that KNN
was insensitive to values of k in the range 10–20 and the best estim-ation results were observed in this range. Hellem et al. (2003) also
suggested using k = 10 for LSImpute. Figure 2 plots the minimum
overall prediction error rates for CMVE over a range of k values for
the different test datasets, with results showing that k in the range
10–15 (highlighted) is the most appropriate. Lower k values include
only a small set of correlated genes for prediction leading to predic-
tion errors as other correlated genes are ignored. Conversely, when k
is high, genes which have either very little or no correlation with the
gene having missing values will be included in the prediction, again
leading to erroneous results (Troyanskaya et al., 2001).
To fully test the robustness of the new CMVE algorithm, exper-
iments were performed for missing values up to 20% (Figs 3–9).
Figure8a and 8b showthe error values for 10% missing values which
0
0.01
0.02
0.03
0.040.05
0.06
0.07
0.08
0.09
0.1
Brca1 Brca2 Sporadic
Data
N o r m a l i z e d
R M S E r r o r
BPCA
KNN
LSImputeCMVE
Fig. 3. NRMS error for 1% missing values for ovarian cancer data.
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
Brca1 Brca2 Sporadic
Data
N o r m a l i z e d R M
S E r r o r
BPCA
KNN
LSImpute
CMVE
Fig. 4. NRMS error for 2% missing values for ovarian cancer data.
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
Brca1 Brca2 Sporadic
Data
N o r m a l i z e d R M S E r r
o r
BPCA
KNN
LSImpute
CMVE
Fig. 5. NRMS error for 3% missing values for ovarian cancer data.
especially reveal (Fig. 8a) the significant deterioration in the resultsof KNN for the sporadic dataset.
To clarify the performance of CMVE, Figure 8b plots the error
results without KNN, which consistently confirm the lower error
values compared with LSImpute and BPCA. Figure 9a and 9b show
thecorresponding results for20% missing values, which again reveal
the superiority and greater robustness of the CMVE algorithm for
missing value imputation. Note, for the sake of clarity a logarithmic
scale is used in Figure 9b.
Whenever there is a high number of missing values in a gene,
sparse covariance matrices will ensue and with them, the increased
likelihood of ill-conditioning. The CMVE algorithm avoids ill-
conditioningby ensuring theremoval ofall genes with >20% missing
values before imputation.
2421
b y g u e s t onM a r c h 2 5 ,2 0 1 4
h t t p : / / b i oi nf or m a t i c s . oxf or d j o u
r n a l s . or g /
D o wnl o a d e d f r om
8/10/2019 Bioinformatics 2005 Sehgal 2417 23
http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 6/7
M.S.B.Sehgal et al.
0
0.015
0.03
0.045
0.06
0.075
0.09
0.105
0.12
0.135
0.15
Brca1 Brca2 Sporadic
Data
N o r m a l i z e
d R M S E r r o r
BPCA
KNN
LSImpute
CMVE
Fig. 6. NRMS error for 4% missing values for ovarian cancer data.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Brca1 Brca2 Sporadic
Data
N o r m a l i
z e d R M S E r r o r
BPCA
KNN
LSImputeCMVE
Fig. 7. NRMS error for 5% missing values for ovarian cancer data.
(a)
(b)
Fig. 8. (a) NRMS error for 10% missing values for ovarian cancer data.
(b) NRMS error of CMVE, BPCA and LSImpute for 10% missing values for
ovarian cancer data.
As highlighted in Section 1, experiments performed on datasets
with randomly introduced missing values may not truly reflect the
nature of actual microarray data missing values. All four imputation
algorithms were, therefore, tested on the yeast time series data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Brca1 Brca2 Sporadic
Data
N o r m a l i z
e d R M S E r r o r
BPCA
KN N
LSImpute
C M V E
0.001
0.01
0.1
1
Brca1 Brca2 Sporadic
Data
L o g a r i t h m i c
N o r m a l i z e d R M S E r r o r
BPCA
KN N
LSImpute
CM VE
(a)
(b)
Fig. 9. (a) NRMS error for 20% missing values for ovarian cancer data.
(b) NRMS error (log scale) for 20% missing values for ovarian cancer data.
Table 1. NRMS errors for the actual missing value distribution (AMVD) of
1.7% missing values and additional 1–10% randomly introduced values in
the yeast dataset
Missing BPCAImpute KNNImpute LSImpute CMVE
values
AMVD 0.1485 0.0654 0.0130 0.0064
1% 0.0319 0.8930 0.0849 0.0030
2% 0.0569 0.5284 0.1555 0.0843
3% 0.0674 0.6232 0.1612 0.0547
4% 0.0846 0.9307 0.2003 0.0090
5% 0.0927 0.5821 0.2071 0.0091
10% 0.1756 0.8763 0.0638 0.0130
containing 1.7% missing values. Since NRMS errors could not be
calculated for these actual missing values, the gene value adjacent
to the gene with the missing value was replaced before applyingthe imputation algorithms. This had the effect of a delay function,
while retaining the same distribution of missing values. The results
in Table 1 again confirm the superior performance of CMVE, par-
ticularly when an additional 4, 5 and 10% of missing values are
introduced into the data, with the corresponding average improve-
ments being 60, 72 and 64%, respectively. The imputation results
also reveal some other broader noteworthy issues. KNN for instance,
performed better when missing values were randomly introduced
because KNN only considers positive correlations and certain ran-
domly introduced missing values will inevitably contain negative
correlations with other genetic data. Similarly, LSImpute exhibited
an improved performance compared with BPCA (Oba et al., 2003)
confirming the discussion underpinning Proposition 5.
2422
b y g u e s t onM a r c h 2 5 ,2 0 1 4
h t t p : / / b i oi nf or m a t i c s . oxf or d j o u
r n a l s . or g /
D o wnl o a d e d f r om
8/10/2019 Bioinformatics 2005 Sehgal 2417 23
http://slidepdf.com/reader/full/bioinformatics-2005-sehgal-2417-23 7/7
Collateral missing value imputation
6 CONCLUSIONS
This paper has presented a new CMVE algorithm based on the novel
concept of multiple imputations. Experimental results confirmed that
CMVEconsistently provided superior estimationaccuracy compared
with the existing missing value imputation algorithms, includingKNN, LSImpute and BPCA. This performance improvement was
especially evident when estimating higher numbers of missing val-
ues in both time series and non-time series data. The algorithm’s
theoretical basis, which was the exploitation of a combination of
global and local correlations in a given dataset, repeatedly proved to
be a more effectiveand robuststrategythanthe distance function used
by KNN, with no increase in the order of computational complexity
for all values of k . The results corroborate the fact that CMVE can
be successfully applied to accurately impute missing values before
any microarray data experiment, crucially without any bias being
introduced into the estimation process.
REFERENCESAcuna,E. and Rodriguez,C. (2004) The treatment of missing values and its effect in
the classifier accuracy. In Banks,D. et al. (eds) Classification, Clustering and Data
Mining Applications. Springer-Verlag, Berlin, Heidelberg, pp. 639–648.
Amir,A.J. et al. (2002) Gene expression profiles of BRCA1-linked, BRCA2-linked, and
sporadic ovarian cancers. J. Natl Cancer Inst., 94, 981–990.
Brown,W.N. etal. (1997)Knowledge-based analysis of microarray gene expression data
using support vector machines. Proc. Natl Acad. Sci. USA, 97, 262–267.
Golub,T.R. et al. (1999) Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring. Science, 286, 531–537.
Gustavo,B.and Monard,C.M. (2003)An analysis of fourmissing data treatment methods
for supervised learning. Appl. Artif. Intell., 17, 519–533.
Furey,T.S. et al. (2000) Support vector machine classification and validation of cancer
tissue samples using microarray expression data. Bioinformatics, 16, 906–914.
Harvey,M. and Arthur,C. (2004) Fitting Models to Biological Data Using Linear and
Nonlinear Regression. Oxford University Press, Oxford.
Hellem,B.T. etal. (2004)LSimpute: accurateestimationof missing valuesin microarray
data with least squares methods. Nucleic Acids Res., 32, e34.Lawson,C.L. and Hanson,R.J. (1974) Solving Least Squares Problems. Prentice-Hall,
Inc., Englewood Cliffs, NJ.
Munagala,K. et al. (2004) Cancer characterization and feature set extraction by
discriminative margin clustering. BMC Bioinformatics, 5, 21.
McLean,A. (2000) The predictive approach to teaching statistics. J. Stat. Education, 8.
Oba,S. et al. (2003) A bayesian missing value estimation method for gene expression
profile data. Bioinformatics, 19, 2088–2096.
Ouyang,M. etal. (2004)Gaussian mixtureclustering andimputation of microarray data.
Bioinformatics, 20, 917–923.
Ramaswamy,S. et al. (2001) Multiclass cancer diagnosis using tumour gene expression
signatures. Proc. Natl Acad. Sci. USA, 98, 15149–15154.
Sehgal,M.S.B. et al. (2004a) Support vector machine and generalized regression neural
network based classification fusion models for cancer diagnosis. In HIS’04, Japan.
Sehgal,M.S.B. et al. (2004b) A collimator neural network model for the classification
of genetic data. ICBA 04, USA.
Sehgal,M.S.B. et al. (2004c) Communal neural network for ovarian cancer mutation
classification. Complex 04, Australia.
Sehgal,M.S.B. et al. (2004d) K-Ranked covariance based missing values estimation for
microarray data classification. HIS’04, Japan.
Sehgal,M.S.B. et al. (2004e) Statistical neural networks and support vector machine for
the classification of genetic mutations in ovarian cancer. IEEE CIBCB 04, USA.
Shipp,M.A. et al., (2002) Diffuse large B-cell lymphoma outcome prediction by gene
expression profiling and supervised machine learning. Nat. Med., 8, 68–74.
Spellman,P.T. et al. (1998) Comprehensive identification of cell cycle-regulated genes
of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell,
9, 3273–3297.
Troyanskaya,O. et al. (2001) Missing value estimation methods for DNA microarrays.
Bioinformatics, 17, 520–525.
2423
b y g u e s t onM a r c h 2 5 ,2 0 1 4
h t t p : / / b i oi nf or m a t i c s . oxf or d j o u
r n a l s . or g /
D o wnl o a d e d f r om