application of singular value decomposition to dna...
TRANSCRIPT
Application of Singular Value Decomposition
to DNA Microarray
Amir Niknejad
M.S. Claremont Graduate School, 1997
M.S. Northeastern University, 1983
Thesis
Submitted in partial fulfilment of the requirements
for the degree of Doctor of Philosophy
in the Graduate College of the
University of Illinois at Chicago
Chicago, Illinois
August, 2005
1
Dedication:Dedicated to Dr. Iraj Broomand, Founder of the National Iranian
Organization for Gifted and Talented Education, and to the memory of my father.
2
Acknowledgement: I am grateful to my advisor Professor Shmuel Friedland
for his kind supervision, patience and support. I would like to thank my committee
members Professor Calixto Calderon, Professor Louis Kauffman, Professor Stanley
Sclove and Professor Jan Verschelde, for kindly agreeing to be on my committee
despite their busy schedules. I would like to thank Kari Dueball and Darlette Willis
for their assistance during my stay at UIC, the UIC mathematics department and
the Institute of Mathematics and its Applications (IMA) for their generous support
and fellowship, Ali Shaker and Marcus Bishop for kindly helping me with La-
TeX, and Dr. Laura Chihara and Hossein Zare for coding and simulation of FRAA
and IFRAA. Finally I would like to acknowledge the generous support of Hossein
Andikfar, Hossein Basiri, Vali Siadat and Mahyad Zaerpour.
3
Summary
In chapter 1 we introduce the singular value decomposition (SVD) of matrices and
its extensions. We then mention some applications of SVD in analyzing gene ex-
pression data, image processing and information retrieval. We also introduce the
low rank approximation of matrices and present our Monte Carlo algorithm to
achieve this approximation along with other algorithms.
Chapter 2 deals with clustering methods and their application in analysing gene
expression data. We introduce the most common clustering methods such as the
KNN, SVD, and weighted least square methods.
Chapter 3 deals with imputing missing data in gene expression of micro-arrays.
We use some of the methods mentioned in part 2 for these purposes. Then we
introduce our fixed rank approximation algorithm (FRAA) for imputing missing
data in the DNA gene expression array. Finally, we use simulation to compare
FRAA versus other methods and indicate the advantages and its shortcomings, and
how to overcome the shortcomings of FRAA.
4
Contents
1 Introduction and Background 8
1.1 Missing gene imputation in DNA microarrays . . . . . . . . . . . . 8
1.2 Some biological background . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Genes and Gene Expression . . . . . . . . . . . . . . . . . 10
1.2.2 Transcription and Translation . . . . . . . . . . . . . . . . 12
1.2.3 DNA microarrays (chips) . . . . . . . . . . . . . . . . . . . 13
1.3 Some linear algebra background . . . . . . . . . . . . . . . . . . . 14
1.3.1 Inner Product Spaces (IPS) . . . . . . . . . . . . . . . . . . 14
1.3.2 Definition and Properties of Positive Definite and Semidef-
inite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.3 Various Matrix Factorizations . . . . . . . . . . . . . . . . 16
1.3.4 Gram-Schmidt Process . . . . . . . . . . . . . . . . . . . . 16
1.3.5 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . 17
1.3.6 The QR Decomposition . . . . . . . . . . . . . . . . . . . 18
1.3.7 An Introduction to the Singular Value Decomposition . . . . 18
1.3.8 SVD on inner product spaces . . . . . . . . . . . . . . . . . 22
1.4 Extended Singular Value decomposition . . . . . . . . . . . . . . . 24
1.5 Some Applications of SVD . . . . . . . . . . . . . . . . . . . . . . 28
1.5.1 Analysing DNA gene expression data via SVD . . . . . . . 28
1.5.2 Low rank approximation of matrices . . . . . . . . . . . . 29
2 Randomized Low Rank Approximation5
of a Matrix 30
2.1 Random Projection Method . . . . . . . . . . . . . . . . . . . . . . 32
2.2 Fast Monte-Carlo Rank Approximation . . . . . . . . . . . . . . . 35
3 Cluster Analysis 43
3.1 Clusters and Clustering . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Various Gene Expression Clustering Procedures . . . . . . . . . . . 43
3.2.1 Proximity (Similarity) Measurement . . . . . . . . . . . . . 44
3.2.2 Hierarchical Cluster Analysis . . . . . . . . . . . . . . . . 45
3.2.3 Clustering Using K-means Algorithm . . . . . . . . . . . . 46
3.2.4 Mixture Models and EM Algorithm . . . . . . . . . . . . . 47
4 Various Methods of Imputation of Missing Values in DNA Microarrays 50
4.1 The Gene Expression Matrix . . . . . . . . . . . . . . . . . . . . . 50
4.2 SVD and gene clusters . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Missing Data in the Gene Expression Matrix . . . . . . . . . . . . . 53
4.3.1 Reconsideration of 4.2 . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Bayesian Principal Component (BPCA): . . . . . . . . . . 54
4.3.3 The least square imputation method . . . . . . . . . . . . . 57
4.3.4 Iterative method using SVD . . . . . . . . . . . . . . . . . 59
4.3.5 Local least squares imputation (LLSimpute) . . . . . . . . . 59
4.4 Motivation for FRAA . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Additional matrix theory background for FRAA . . . . . . . 61
4.4.2 The Optimization Problem . . . . . . . . . . . . . . . . . . 636
4.5 Fixed Rank Approximation Algorithm . . . . . . . . . . . . . . . . 65
4.5.1 Description of FRAA . . . . . . . . . . . . . . . . . . . . . 65
4.5.2 Explanation and justification of FRAA . . . . . . . . . . . 66
4.5.3 Algorithm for (4.14) . . . . . . . . . . . . . . . . . . . . . 70
4.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7 Discussion of FRAA . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.8 IFRAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.8.2 Computational comparisons of
BPCA, FRAA, IFRAA and LLSimpute . . . . . . . . . . . 77
4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.10 Matlab code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7
1 Introduction and Background
1.1 Missing gene imputation in DNA microarrays
In the last decade, molecular biologists have been using DNA microarrays as a
tool for analyzing information in gene expression data. During the laboratory pro-
cess, some spots on the array may be missing due to various factors (for example,
machine error.) Because it is often very costly or time consuming to repeat the
experiment, molecular biologists, statisticians, and computer scientists have made
attempts to recover the missing gene expressions by some ad-hoc and systematic
methods.
In this thesis we introduce various imputation methods for missing data in
gene expression matrices. In particular, we review two recent imputation methods,
namely Bayesian Principal Component Analysis (BPCA) and Local Least Square
Impute (LLSImpute). Then we introduce our fixed rank approximation algorithm
(FRAA). Finally in the latter section of this thesis we present improved fixed rank
approximation (IFRAA) and use simulation in order to compare IFRAA with BPCA
nd LLSimpute.
More recently, microarray gene expression data have been formulated as a gene
expression matrix E with n rows, which correspond to genes, and m columns,
which correspond to experiments. Typically n is much larger than m. In this setting,
the analysis of missing gene expressions on the array would translate to recovering
missing entries in the gene expression matrix values.
The most common methods for recovery are [35]:8
(a) Zero replacement method;
(b) Row sum mean;
(c) Cluster analysis methods such as K-nearest neighbor clustering , hierarchical
clustering;
(d) SVD - Singular Value Decomposition (which is closely related to Principal
Component Analysis).
In these methods, the recovery of missing data is done independently, i.e., the
estimation of each missing entry does not influence the estimation of the other miss-
ing entries. The iterative method using SVD suggested in [35] takes into account
implicitly the influence of the estimation of one entry on the other ones. See also
[9].
In this thesis we suggest a new method in which the estimation of missing en-
tries is done simultaneously, i.e., the estimation of one missing entry influences the
estimation of the other missing entries. If the gene expression matrix E has missing
data, we want to complete its entries to obtain a matrix E, such that the rank of E is
equal to (or does not exceed) d, where d is taken to be the number of significant sin-
gular values of E. The estimation of the entries of E to a matrix with a prescribed
rank is a variation of the problem of communality (see [16, p. 637].) We give an
optimization algorithm for finding E using the techniques for inverse eigenvalue
problems discussed in [14].
9
1.2 Some biological background
1.2.1 Genes and Gene Expression
Cells and organisms are divided into two classes; procaryotic (such as bacteria) and
eucaryotic [10]. The latter have a nucleus. The cell is enclosed by its membrane;
embedded in the cell’s cytoplasm is its nucleus, surrounded and protected by its
own membrane. The nucleus contains DNA, a one dimensional molecule, made of
two complementary strands, coiled around each other as a double helix. Each strand
consists of a backbone to which a linear sequence of bases is attached. There are
four kinds of bases, denoted by C, G, A, T. The two strands contain complementary
base sequences and are held together by hydrogen bonds that connect matching
pairs of bases; G-C (three hydrogen bonds that connect matching pairs of bases;
G-C (three hydrogen bonds) and A-T (two).
A gene is a segment of DNA, which contains the formula for the chemical com-
position of one particular protein. Proteins are the working molecules of life; most
biological processes that take place in a cell are carried out by proteins. Topolog-
ically, a protein is also a chain; each link is an amino acid, with neighbors along
the chain connected by covalent bonds. All proteins are made of 20 different amino
acids - hence the chemical formula of a protein of length N is an N -letter word,
whose letters are taken from a 20-letter alphabet. A gene is nothing but an alpha-
betic cookbook recipe, listing the order in which the amino acids are to be strung
when the corresponding protein is synthesized. Genetic information is encoded in
the linear sequence in which the bases on the two strands are ordered along the DNA
10
molecule. The genetic code is a universal translation table, with specific triplets of
consecutive bases coding for every amino acid.
The genome contains the collection of all the genes that code the chemical for-
mulae of all the proteins (and RNA) that an organism needs and produces. The
genome of a simple organism such as yeast contains about 6000 genes; the hu-
man genome has between 30,000 and 40,000. An overwhelming majority (98%)
of human DNA contains non-coding regions , i.e., strands that do not code for any
particular protein (but play a role in regulating the level of synthesis of the different
proteins).
Here is an amazing fact: every cell of a multicellular organism contains its entire
genome! That is, every cell has the entire set of recipes the organism may ever need;
the nucleus of each of the reader’s cell contains every piece of information needed
to make a copy (”clone”) of him/her! Even though each cell contains the same set
of genes, there is differentiation: cells of a complex organism, taken from different
organs, have entirely different functions and the proteins that perform these func-
tions are very different. Cells in our retina need photosensitive molecules, whereas
our livers do not make much use of these. A gene is expressed in a cell when the
protein it codes for is actually synthesized. In an average human cell about 10,000
genes are expressed. The set of (say 10,000) numbers that indicate the expression
level of each of theses genes is called the expression profile of the cell.
The large majority of abundantly expressed genes are associated with common
functions, such as metabolism, and hence are expressed in all cells. However there
will be differences between the expression of profiles of different cells, and even11
in single cell, expression will vary with time, in a manner dictated by external and
internal signals that reflect the state of the organism and the cell itself.
Synthesis of proteins take a place at the ribosomes. These are enormous ma-
chines (also made of proteins) that read the chemical formulae written on the DNA
and synthesize the protein according to the instructions. The ribosomes are in the
cytoplasm, whereas the DNA is in the protected environment of the nucleus. This
poses an immediate logistic problem - how does the information get transferred
from the nucleus to the ribosome?
1.2.2 Transcription and Translation
The obvious solution of information transfer would be to rip out the piece of DNA
that contains the gene that is to be expressed, and transport it to the cytoplasm.
When a gene receives a command to be expressed, the corresponding segment of
the double helix of DNA opens, and a precise copy of the information, as written
on one of the strands, is prepared, which is called the messenger RNA, and the pro-
cess of its production is called transcription. The subsequent reading of mRNA,
deciphering the message (written using base triplets) into amino acids and synthe-
sis of the corresponding protein at the ribosomes is called translation [32]. In fact,
when many molecules of a certain protein are needed, the cell produces many cor-
responding mRNAs, which are transferred through the nucleus’ membrane to the
cytoplasm, and are ”read” by several ribosomes. Thus the single master copy of
the instructions, contained in the DNA, generates many copies of the protein. This
transcription strategy is prudent and safe, preserving the precious master copy; at12
the same time it also serves as a remarkable amplifier of the genetic information.
1.2.3 DNA microarrays (chips)
A DNA Chip is the instrument that measures simultaneously the concentration of
thousands of different mRNA molecules. It is also referred to as a DNA microarray
[32] DNA microarrays, produce by Affymetrix, can measure simultaneously the
expression levels of up to 20,000 genes; the less expensive spotted arrays do the
same for several thousand. Schematically, the Affymetrix arrays are produced as
follows. Divide a chip ( a glass plate of about 1 cm across) into ”pixels” - each
dedicated to one gene g. Millions of 25 base-pair long pieces (oligonucleotides) of
single strand DNA, copied from a particular segment of gene g are are photoliti-
graphically synthesized on the dedicated pixel (these are referred to as ”probes”).
The mRNA molecules are extracted from the cells taken from the tissue of interest
(such as a tumor tissue) obtained by concentration is enhanced. Next, the resulting
DNA is transcribed back into fluorescently marked single strand RNA diffuse over
the dense forest of single strand DNA and the labeled RNA diffuse over the dense
forest of single strand DNA and the labeled RNA diffuse over the dense forest of
single strand DNA probes. When such an mRNA encounters a bit of the probe, of
which the RNA is a perfect copy, it hybridizes to this strand - i.e. attaches to it
with a high affinity (considerably higher than to a probe of which the target is not a
perfect copy). When the mRNA solution is washed off, only those molecules that
found their perfect match remain stuck to the chip. Now the chip is illuminated by
a laser, and these stuck ”targets” fluoresce; by measuring the light intensity ema-13
nating from each pixel, one obtains a measure of the number of targets that stuck,
which form a chip on which Ng genes were placed. These Ng numbers represent
the expression levels of these genes in that tissue. A typical experiment provides the
expression profiles of several tens of samples (say Ns ≈ 100), over several thou-
sand (Ng) genes. These results are summarized in an Ng × Ns expression table;
each row corresponds to one particular gene and each column to a sample. Entry
Egs of such an expression table stands for the expression level of gene g in sample
s. For example, the experiment on colon cancer, first reported by Alon et al. [3]
contains Ng = 2000 genes whose expression levels passed some threshold, over
Ns = 62 samples, 40 of which were taken from tumor and 22 from normal colon
tissue.
1.3 Some linear algebra background
Since DNA microarray data involves real values only, we will be restricting our
treatment of relevant facts from linear algebra to matrices and vector spaces over R.
1.3.1 Inner Product Spaces (IPS)
Let V be a vector space over R. Then 〈•, •〉 : V × V → R is a real inner product if
1. 〈x,x〉 ≥ 0 for all x ∈ V and 〈x,x〉 = 0 iff x = 0.
2. 〈x,y〉 = 〈y,x〉 for all x,y ∈ V .
3. 〈x, αy1 + βy2〉 = α〈x,y1〉 + β〈x,y2〉 for all x,y1,y2 ∈ V and for all
α, β ∈ R.14
1.3.2 Definition and Properties of Positive Definite and Semidefinite Matrices
In the following x will denote a vector in Rn. A symmetric matrix A ∈ Rn×n is
1. positive definite if and only if xT Ax > 0 for all nonzero x. We write A > 0.
2. nonnegative definite (or positive semi-definite) if and only if xT Ax ≥ 0 for
all x. We write A ≥ 0.
3. negative definite if −A is positive definite.
4. nonpositive definite (or negative semidefinite) if −A is nonnegative defi-
nite. We write A ≤ 0.
We have the following facts.
• Any principal submatrix of a positive (nonnegative) definite matrix is positive
(nonnegative) definite . (A ≥ 0 implies any principal submatrix is n.n.d)
• The diagonal elements of a positive definite matrix are positive.
• A is positive definite if and only if its eigenvalues are positive. (It is positive
semidefinite if its eigenvalues are nonnegative.)
• If A is positive definite, its determinant is positive. If A is positive semidefi-
nite, its determinant is nonnegative.
• If A is positive definite, then A is nonsingular and A−1 is positive definite.
15
1.3.3 Various Matrix Factorizations
In the following section, we introduce several matrix factorizations useful through-
out this thesis.
Theorem 1.1 (Spectral Decomposition) Let A = AT ∈ Rn×n. Then there
exists an orthogonal matrix Q ∈ Rn×n (whose columns are orthogonal eigenvectors
of A) such that
A = QDQT =n∑
i=1
λiqiqTi
where λi are the eigenvalues of A and D = diag(λ1, . . . λn).
1.3.4 Gram-Schmidt Process
The Gram-Schmidt process is a recursive procedure for generating an orthonor-
mal set of vectors from a given set of linearly independent vectors. Let x1, . . .xn
be linearly independent vectors in an IPS V . Then generate an orthonormal set
u1, . . .un ∈ V from x1, . . .xn such that:
span(x1, . . . ,xk) = span(u1, . . . ,uk), k = 1, . . . , n,
by the following procedure known as the Gram-Schmidt algorithm:
• u1 := 1‖x1‖x1
• p1 := 〈x2,u1〉u1, u2 := 1‖x2−p1‖(x2 − p1)
• p2 := 〈x3,u1〉u1 + 〈x3,u2〉u2, u3 := 1‖x3−p2‖(x3 − p2)
...16
• pk := 〈xk+1,u1〉u1 + 〈xk+1,u2〉u2 + · · · + 〈xk+1,uk〉uk,
and uk+1 := 1‖xk+1−pk‖(xk+1 − pk)
In practise the Gram-Schmidt Process (GSP) is numerically unstable. That is,
there is a severe loss of orthogonality of u1, . . . as we proceed to compute ui. In
computations one uses either a Modified GSP or Householder orthogonalization
[21].
The Modified Gram-Schmidt Process (MGSP) consisting of the following steps:
• Initialize j = 1.
• uj := 1‖xj‖xj
• pi := 〈xi,uj〉ui and replace xi by xi := xi − pi for i = j + 1, . . . , n.
• Let j = j + 1 and repeat the process.
MGSP is stable, needs mn2 flops, which is more time consuming then GSP.
1.3.5 Cholesky Decomposition
Let A be a positive definite matrix. Then A = U T · U where U is an upper trian-
gular matrix with positive diagonal elements. The above factorization is called the
Cholesky decomposition. The matrix U is called a Cholesky factor of A and is not
unique.
17
1.3.6 The QR Decomposition
Let X be an n×m matrix with n ≥ m. Then there is a unitary n×m matrix Q and
an upper triangular n × n matrix R with nonnegative diagonal elements such that
Xn×m = Q
R
0
.
This factorization is called the QR decomposition of X . The matrix Q is called the
Q-factor of X and the matrix R is the R-factor of X .
If we partition Q = (QX , QL) where QX has m columns, then X = QXR
which is called the QR factorization of X .
If X is of rank m, then the QR factorization is unique up to a factor of ±1 . The
columns of QX form an orthogonal basis for the column space of X , and those of
Q⊥ form an orthonormal basis for the orthogonal complement of the column space
of X obtained by the Gram-Schmidt process. If X is of rank m, then A = XT X is
positive definite and R is a Cholesky factor of A. For more details see [34].
1.3.7 An Introduction to the Singular Value Decomposition
In many science and engineering problems, one would like to approximate a given
matrix by a lower rank matrix and calculate the distance between them.
Without loss of generality, let E be a n×m matrix where n ≥ m (The following
procedure is also valid when n < m). Denote by R(E) and N (E) the range and
the null space of E.
The idea of the singular value decomposition is to factor the matrix E into the
18
product of matrices U , Σ, V such that
E(n×m) = U (n×n)Σ(n×m)V (m×m)T . (1.1)
Here, U is n × n, Σ is a diagonal n × m matrix, V is a m × m matrix, and U and
V are orthogonal matrices where
σ1 ≥ σ2 ≥ · · · ≥ σmin(m,n) ≥ 0.
In this procedure, the σi’s are unique but U and V are not. The σi’s are called
the singular values of E and are the square roots of the eigenvalues of EET or
equivalently ET E. Also, in this setting the columns of U are eigenvectors of EET
and the rows of V are eigenvectors of ET E. The idea is that the rank of the matrix E
is the number of the nonzero singular values and the magnitude of nonzero singular
values would indicate the proximity of matrix E to the closest lower rank matrix.
Computationally, one brings E to upper bidiagonal form A using Householder
matrices. Then one applies implicitly the QR algorithm to ATA to find the positive
eigenvalues σ21, ..., σ
2r and the corresponding orthonormal eigenvectors v1, ...,vr of
the matrix ETE [21]. Next we set
uq :=1
σq
Evq (1.2)
Theorem 1.2 If E is an m × n matrix, then E has a singular value decompo-
sition.
The proof can be found in [21].
19
We summarize the facts about the singular value decomposition of a matrix. Let
E be a m × n matrix with singular value decomposition UΣV T .
1. The singular values σ1, σ2, . . . , σn of E are unique, however the matrices U
and V are not unique. Suppose
σ1 > σ2 > · · · > σr > 0,
then the columns of V are determined up to ±1.
2. Since V diagonalizes ET E, it follows that vj’s are eigenvectors of ET E.
3. Since EET = UΣΣT UT , it follows that U diagonalizes EET and that the
uj’s are the eigenvectors of EET .
4. Comparing the jth columns of each side of the equation
EV = UΣ
we get
Evj = σjuj j = 1, . . . , n.
Similarly,
ET U = V ΣT
and hence
ETuj = σjvj j = 1, . . . , n
ETuj = 0 j = n + 1, . . . ,m.
The vj’s are called the right singular vectors of E and the uj’s are called the
left singular vectors of E.20
5. If E has rank r, then
(a) v1,v2, . . .vr form an orthonormal basis for R(
ET)
,
(b) vr+1,vr+2, . . . ,vn form an orthonormal basis for N (E),
(c) u1,u2, . . . ,ur form an orthonormal basis for R (E),
(d) ur+1,ur+2, . . . ,um form an orthonormal basis for N(
ET)
.
6. The rank of E is equal to the number of its nonzero singular values (where
singular values are counted according to their multiplicities). We can rear-
range the σi’s such that
σ1 = · · · = σi1 > σi1 = · · · = σi1+l2 > · · · > σip = · · · = σip+lp = σr
and for every 1 ≤ j ≤ p − 1, σij+lj+1 = σij+1 and vij , . . . ,vij+lj form an
orthonormal basis for the eigenspace of ET E corresponding to σij .
Let Ur, Σr, Vr be matrices obtained from U, Σ, V , respectively, as follows: Ur is
an n × r matrix obtained by deleting the last m − r columns of U , Vr is the m × r
matrix obtained by deleting the last m − r columns of V , and Σr is obtained by
deleting the last m − r columns and rows of Σ . Then
E = UrΣrVTr , UT
r Ur = V Tr Vr = Ir, Σr = diag(σ1, ..., σr), σ1 ≥ ... ≥ σr > 0.
(1.3)
In this setting Ur, Σr, Vr are all rank r matrices: the last m − r columns of U and
the last m − r rows of V T are arbitrary, up to the condition that the last m − r
columns of U and last m − r rows of V T are orthonormal bases of the orthogonal21
complement of the column space and the row space of E, respectively. Hence (1.3)
is sometimes called a reduced SVD of E.
1.3.8 SVD on inner product spaces
In this section we discuss briefly the standard notion of the SVD decomposition of a
linear operator that maps one finite dimensional inner product space to another finite
dimensional inner product space. For the utmost generality we consider here the
inner products over the complex numbers C. In this section we denote by Mnm(C)
the space of n × m matrices with complex entries.
Let Ui be an mi-dimensional inner product space over C, equipped with 〈·, ·〉i
for i = 1, 2. Let T : U1 → U2 be a linear operator. Let T ∗ : U2 → U1 be the
adjoint operator of T , i.e. 〈Tx,y〉 = 〈x, T ∗y〉 for all x ∈ U1,y ∈ U2. Equiv-
alently, let [a1, ..., am1] and [b1, ...,bm2
] be orthonormal bases of U1 and U2 re-
spectively. Let A ∈ Mm2m1(C) be the representation matrix of T in these bases:
[Ta1, ..., Tam1] = [b1, ...,bm2
]A. Then the matrix A∗ := AT
represents T ∗ in these
bases. The SVD decomposition of A = UΣV ∗, where U, V are unitary matrices
of dimensions m2,m1 respectively, and Σ ∈ Mm2m1(R) is a diagonal matrix with
the diagonal entries σ1 ≥ ...σmin(m2,m1) ≥ 0, corresponds to the following base free
concepts of T .
Consider the operators S1 := T ∗T : U1 → U1 and S2 := TT ∗ : U2 → U2.
Then S1, S2 are self-adjoint, i.e. S∗1 = S1, S
∗2 = S2 and nonnegative definite:
〈Sixi,xi〉 ≥ 0 for all xi ∈ Ui for i = 1, 2. The positive eigenvalues of S1 and S2,
counted with their multiplicities and arranged in a decreasing order are σ21 ≥ ... ≥
22
σ2r > 0, where r = rankT = rankT ∗. Let
S1vi = σ2i vi, i = 1, ..., r, 〈vi,vj〉1 = δij, i, 1, ..., r.
Then ui := σ−1i Tvi for i = 1, ..., r is an orthonormal set of the eigenvectors of
S2 corresponding to the eigenvalue σ2i for i = 1, ..., r. Complete the orthonor-
mal systems v1, ...,vr and u1, ...,ur to orthonormal bases [v1, ...,vm1] and
[u1, ...,um2] in U1 and U2 respectively. Then the unitary matrices U, V are the
transition matrices from basis [b1, ...,bm2] to [u1, ...,um2
] and basis [a1, ...am1] to
[v1, ...,vm1]:
[u1, ...,um2] = [b1, ...,bm2
]U, [v1, ...,vm1] = [a1, ..., am1
]V.
Let A ∈ Mm2m1(C). Then A can be viewed as a linear operator A : Cm1 →
Cm2 , where x → Ax for any x ∈ Cm1 . Let Pi be an mi × mi hermitian positive
definite matrix for i = 1, 2. We define the following inner products on Cm1 and
Cm2 :
〈x,y〉i := y∗Pix, x,y ∈ Cmi , i = 1, 2. (1.4)
It is straightforward to show that the SVD decomposition of A, viewed as the above
operator, with respect to the inner products given by (1.4) is
A = UΣV ∗, U ∗P2U = Im2, V ∗P−1
1 V = Im1, (1.5)
where Σ is an m2 × m1 diagonal matrix with nonnegative diagonal entries in a
decreasing order. A simple way to deduce this decomposition is to observe that
〈x,y〉i = (P1
2
i y)∗(P1
2
i x), where P1
2
i is the unique hermitian positive definite square23
root of Pi. The decomposition (1.5) is called the extended singular value decom-
position of A, ESVD for short. The diagonal entries of Σ are called the extended
singular values of A. ESVD of A corresponds to the standard SVD decomposition
of P1
2
2 AP− 1
2
1 .
1.4 Extended Singular Value decomposition
In its utmost generality, the SVD deals with a linear operator T : Vs → Vt, where
Vs (the source space) and Vt (the target space) are finite dimensional vector spaces
over the real numbers R (or more generally over the complex numbers C), which
are endowed with the inner products 〈·, ·〉s and 〈·, ·〉t respectively. Choose bases
e1, ..., eM and f1, ..., fN . Then Vs and Vt can be identified with the standard vector
spaces RM and RN respectively, and T is represented by the n × m matrix A =
(aij)NMi=1,j=1 as explained later on. Usually one chooses the inner products 〈·, ·〉s
and 〈·, ·〉t to be given by the standard inner products on RM and RN . That is, one
assumes
〈ej, ej〉s = 1, 〈ej, ek〉s = 0 for j 6= k, j, k = 1, ...,M, (1.6)
〈fi, fi〉t = 1, 〈fi, fl〉t = 0 for i 6= l, i, l = 1, ..., N. (1.7)
Then the SVD of T is the singular value decomposition A = UΣV T: U, Σ, V
are N × d, d × d,M × d matrices respectively, where UTU = V TV are equal to
the d × d identity matrix Id, and Σ is a diagonal matrix with the diagonal entries
σ1 ≥ σ2 ≥ ...σd ≥ 0 called the singular values of A. Furthermore
rank A ≤ d ≤ min(M,N), 0 < σrank A and σm = 0 for m > rank A. (1.8)24
In general, one does not have to assume that e1, ..., eM and f1, ..., fN are orthonormal
bases. By letting Ps := (〈ej, ek〉s)Mj,k=1 and Pt := (〈fi, fl〉t)N
i,l=1 be any positive
definite matrices, one obtains the corresponding SVD of T :
A = UΣV T, UTPtU = V TP−1s V = Id, Σ = diag (σ1, ..., σd) , (1.9)
where U and V are N × d and M × d matrices whose columns form orthonormal
sets of d vectors with respect to Pt and P−1s respectively. Σ is a diagonal matrix
with the singular values σ1 ≥ ... ≥ σd ≥ 0 of T satisfying (1.8). We call (1.9)
the singular value decomposition of A with respect to Ps, Pt, or Extended Singular
Value Decomposition (ESVD) of A. We call σ1, ..., σd the extended singular values
of A or singular values when no ambiguity arises. Let a1, ..., aM and b1, ...,bN
be orthonormal bases in VS and Vt respectively. Let T : Vs → Vt. Then the
conjugate operator T ∗ : Vt → Vs is defined by the equalities
〈Taj,bi〉t = 〈aj, T∗bi〉s , for i = 1, ..., N, j = 1, ...,M. (1.10)
Assume that
Taj =N∑
j=1
cijbi, i = 1, ...,M. (1.11)
Then
T ∗bi =M∑
j=1
cijaj, i = 1, ..., N. (1.12)
That is, the N × M matrix C := (cij)N,Mi=1,j=1 represents T in the orthonormal bases
a1, ..., aM and b1, ...,bN , and CT represents T ∗ in these bases. In particular
T ∗T : Vs → Vs, TT ∗ : Vt → Vt (1.13)
25
are presented by CTC and CCT in the bases a1, ..., aM and b1, ...,bN respectively.
Clearly
rank T = rank A = rank AT = rank T ∗ = rank ATA = rank AAT = rank A∗T = rank TT ast
Since T ∗T is self-adjoint ((T ∗T )∗ = T ∗T ) and nonnegative definite (〈T ∗Tx,x〉s ≥
0 for any x ∈ Vs) it follows that there exists an orthonormal basis c1, ..., cM in Vs
such that
T ∗Tci = σ2i ci, σi ≥ 0, 〈ci, ck〉s = δik, i, k = 1, ...,M. (1.14)
Let di := 1σi
Tci, i = 1, ..., rank T . Then d1, ...,drank T is an orthonormal system
of vectors. Extend this system to an orthonormal basis d1, ...,dN of Vt. Then SVD
of T is given by
Tci = σidi, i = 1, ...,M, 〈dj,dl〉t = δjl, j, l = 1, ..., N, di = 0 if i > N.
(1.15)
Lemma 1.3 Let Vs,Vt be finite dimensional vector spaces over R of dimension
M,N and with the inner products 〈·, ·〉s , 〈·, ·〉t respectively. Let e1, ..., eM and
f1, ..., fN be bases in Vs and Vt with the corresponding positive matrices Ps =
(〈ej, ek〉s)Mj,k=1 and Pt = (〈fi, fl〉t)N
i,l=1 respectively. Let T : Vs → Vt be a linear
transformation given by (1.11). Then the extended singular value decomposition of
A = (aij)N,Mi,j=1 with respect to the positive definite matrices Ps, Pt, corresponding
to the singular value decomposition of T in bases e1, ..., eM and f1, ..., fN is given
by (1.9). Furthermore the operator T ∗ : Vt → Vs represented by the matrix
26
A† = (a†ji)
M,Nj,i=1 in the bases e1, ..., eM and f1, ..., fN :
T ∗fi =M∑
j=1
a†jiej, i = 1, ..., N, A† = P−1
s ATPt. (1.16)
Proof. Let
ej =M∑
k=1
vjkck, dl =N∑
i=1
uilfi, j = 1, ...,M, l = 1, ..., N. (1.17)
Hence
Tej =M∑
k=1
vjkTck =d∑
k=1
vjkσkdk =
d,N∑
k=1,i=1
vjkσkuikfi.
Compare the above equalities to (1.11) to deduce the the first part of the equality
(1.9) A = UΣV T, where
U = (uik)N,di,k=1, U = (uil)
Ni,l=1 = (U,Ur),
U = (uil)Ni,l=1 := (U)−1, U0 := (uil)
d,Ni,l=1, UT = (UT
0 , UTd )
(1.18)
V = (vjk)M,dj,k=1, V = (vjk)
Mj,k=1 = (V, Vr),
V = (vjk)Mj,k=1 := (V )−1, V0 := (vjk)
d,Mj,k=1, V T = (V T
0 , V Td )
The second part of (1.9) UTPtU = Id is equivalent to the orthonormality of
the system d1, ...,dd. We now show the third part of (1.9) V TP−1s V = Id. Since
c1, ..., cM is an orthonormal basis in Vs it follows that Ps = V V T. Hence P−1s =
(V T)−1(V )−1. Note that since (V )−1V = IM it follows that (V )−1V = Q is the
M × d matrix whose d columns are the first d columns of the identity matrix IM .
Hence V TP−1s V = QTQ = Id.
From the definition of T ∗ (1.12) it follows that
T ∗di = σici, i = 1, ..., N, σi = 0 and ci = 0 for i > M. (1.19)27
As fi =∑N
l=1 ulidl we obtain
T ∗fi =N∑
l=1
uliT∗dl =
d∑
l=1
uliσlcl =
d,M∑
l,j=1
uliσlvljej, .
Compare the above equalities with the definition of A† in (1.16) to deduce that
A† = V T0 ΣU0. (1.20)
Since d1, ...,dN is an orthonormal basis it follows that Pt = UTU . As P−1s = V TV
and A = UΣV T a straightforward calculation shows that A† = P−1s ATPt.
1.5 Some Applications of SVD
1.5.1 Analysing DNA gene expression data via SVD
We now give another form of (1.3) which has a significant interpretation in mi-
croarray data. Let u1, · · · ,um denote the columns of U and v1, . . . ,vm denote the
columns of V . Then (1.1) and (1.3) can be written as
E =m∑
q=1
σquqvTq =
r∑
q=1
σquqvTq . (1.21)
If σ1 > ... > σr then uq and vq are determined up to the sign ±1 for q = 1, ..., r.
Namely uq and vq are length 1 eigenvectors of EET and ETE, respectively, corre-
sponding to the common eigenvalue σ2q . (Note the choice of a sign in vq forces the
unique choice of the sign in uq.) The vectors u1, . . . ,ur are called eigengenes,
the vectors v1, . . . ,vr are called eigenarrays and σ1, . . . , σr are called eigenex-
pressions. The rank r can be viewed as the number of different biological func-
tions of n genes observed in m experiments. The eigenarrays v1, . . . ,vr give the28
principal r orthogonal directions in Rm corresponding to σ1, . . . , σr. The eigen-
genes u1, . . . ,ur give the principal r orthogonal directions in Rn corresponding
to σ1, . . . , σr. The eigenexpressions describe the relative significance of each bio-
function. From the data given in [4], it seems that the number of significant singular
values never exceeds m2
. See the discussion on the number of significant singular
values in the beginning of §3. The essence of the FRAA algorithm, suggested in
this thesis, is based on this observation.
1.5.2 Low rank approximation of matrices
A well-known technique for dimension reduction is the low rank approximation by
the singular value decomposition [21].
Let E be the data matrix, and let E = UΣV T be the SVD of E, where U
and V are orthogonal and Σ is diagonal. Then, for a given κ, the optimal rank κ
approximation of E is given by
Eκ = UκΣκVTκ ,
where Uκ and Vκ are the matrices formed by the first κ columns of the matrices U
and V respectively, and Σκ is the κ-th principal submatrix of Σ. A key property of
this rank κ approximation is that it achieves the best possible approximation with
respect to the Frobenius norm, (and more generally with respect to any unitarily-
invariant norm), among all matrices with rank κ, .
Denote by ||E||F the Frobenius (`2) norm of E. It is the Euclidean norm of E
viewed as a vector with nm coordinates. Each term uqvTq in (1.21) is a rank one
29
matrix with ||uqvTq ||F = 1. Let R(n,m, k) denote the set of n × m matrices of at
most rank k (m ≥ k). Then for each k, k ≤ r, the SVD of E gives the solution to
the following approximation problem:
minF∈R(n,m,k)
||E − F ||F = ||E −k∑
q=1
σquqvTq ||F =
√
√
√
√
r∑
q=k+1
σ2q . (1.22)
If σk > σk+1 then∑k
q=1 σquqvTq is the unique solution to the above minima
problem. For our purposes, it will be convenient to assume that σq = 0 for any
q > m.
2 Randomized Low Rank Approximation
of a Matrix
In many applied settings where one is dealing with a very large data set, it is im-
portant to reduce the size of data in order to make an inference about the features of
data set, in a timely manner. Assume that the data set is represented by an m×n ma-
trix A ∈ Rm×n. It is important to find an approximation B ∈ Rm×n of a specified
rank k to A, where k is much smaller than m and n.
Here are several motivations to obtain such B. First, the storage space needed
for B is k(m + n), which is much smaller than the storage space mn needed for A.
Indeed, B can be represented as
B = x1yT1 + x2y
T2 + . . .xky
Tk , x1, . . . ,xk ∈ Rm, y1, . . . ,yk ∈ Rn, (2.23)
30
where x1, . . . ,xk and y1, . . . ,yk are k column vectors with m and n coordinates
respectively. We store the vectors x1, . . . ,xk,y1, . . . ,yk, which need the storage
k(m + n), and compute the entries of B, when needed, using the above expression
for B.
The second most common application is clustering algorithms as in [1], [2],
[11], [13] and [30]. Assume that our data represents n points and each point has
m coordinates. That is each point represented by a column of the matrix A. We
want to cluster the points in such a way that the distance between points in the same
cluster is much smaller than the distance between any two points from different
clusters. One way to do this is to project all the points on the k main orthonormal
directions encoded by the first k left singular vectors of A. Then cluster using
either k one dimensional subspaces or the whole k dimensional subspace. The k-
rank approximation B gives the approximation to this k dimensional subspace and
the approximation to the first k singular vectors. Another way to cluster is to use
the projective clustering. It is known that a fast SVD, i.e. fast k-rank approximation
is a main tool in this area [2] and [11].
The third application is in DNA microarrays, in particular in data imputation
[18]. Let A be a gene expression matrix. The m rows of the matrix A are indexed
by the genes, while the n columns are indexed by the number of experiments. It
is known that the effective rank of A, k, is usually much less than n. The SVD
decomposition is the main ingredient of the FRAA, (Fixed Rank Approximation
Algorithm), which is successfully implemented in [18] to impute the corrupted en-
tries of A. The fast k- approximation algorithm suggested in this paper, combined31
with the clustering of similar genes, can be implemented to improve the FRAA
algorithm.
One way to find a fast k-rank approximation is to choose at random l ≥ k
columns or rows of A and obtain from them k-rank approximations of A. This is
basically the spirit of the algorithm suggested by Frieze, Kannan and Vempala in
[19]. We call this algorithm the FKV algorithm. Assuming a statistical model for
the distribution of the entries of A, the authors give some error bounds on their
k-rank approximation.
In order to grasp the nature of the FKV algorithm, we need to discuss the ran-
dom projection method. Santosh S. Vempala has a thorough discussion of the ran-
dom projection method. [36]
2.1 Random Projection Method
Let u = (u1, . . . , un). In order to obtain the projection of u, we consider an or-
thonormal matrix R, whose entries are uniformly distributed. Then we scale RTu
with√
n/k to obtain the projection v, v = (√
n/k)RTu. The rational for choos-
ing a scaling factor of√
n/k is to have the expected value E(‖v‖2) = ‖u‖2. There
are various strategies for coming up with the random matrix R, which are data and
application dependent.
One of the interesting properties of random projection is that while it is useful
for reducing dimensionality, it preserves pairwise distances with high probability.
These properties can be represented by the following lemma.
32
Lemma 2.1 (Johnson and Lindenstrauss) For any 0 < ε < 1/2 and any set
of points S in Rn with |S| = m, upon projection to a uniform rank κ-dimensional
subspace where κ ≥ 1 +9 ln m
ε2 − (2/3)ε3, the following property holds:
With probability at least 1/2, for every pair u,u′ in S,
(1 − ε)‖u − u′‖2 ≤ ‖f(u) − f(u′)‖2 ≤ (1 + ε)‖u − u′‖2
where f(u) and f(u′) are projections of u and u′.
Random Projection Algorithm. Random projection preserves distances approxi-
mately, while reducing dimensionality. We could use the following steps to achieve
these goals.
1. Project an l-dimensional data point to k-dimensional space. (k ≥ 0)
2. Find the k largest singular values for obtaining the low-rank approximation
of the original data matrix.
Suppose we project an m×n matrix A using an m×l matrix B, B =
√
n
lRT A.
Then SVD of A and B could be written as
A =r∑
i=1
σiuivTi , B =
t∑
i=1
λiu′iv
′Ti
where ui and u′i are left singular vectors and vi and v′
i are right singular values
of A and B respectively. The following lemma indicates that the singular values are
approximately preserved.
Lemma 2.2 Let ε > 0. If l ≥ Clog n
ε2for sufficiently large constant C, then
n∑
i=1
λ2i ≥ (1 − ε)
κ∑
i=1
σ2i
33
We close this discussion on the random projection by the following theorem
which emphasizes that the matrix obtained by random projection is as good as the
least κ-rank approximation.
Theorem 2.3 Let Aκ be the best k-rank approximation of A, and let Aκ be the
matrix obtained by random projection. If l > C log n/ε2 for large enough constant
C, then
‖A − Aκ‖2F ≤ ‖A − Aκ‖2
F + 2ε‖Aκ‖2F
Low rank approximation (FKV method).
Given an m×n matrix A, we would like to approximate a few of the top singular
values and their associated singular vector. Here we present the FKV algorithm
[12].
The FKV algorithm starts by choosing l rows of A and forming an l × n matrix
L. Let p1, . . . , pm be real numbers such that 0 ≤ pi ≤ 1 and∑m
i=1 pi = 1. The
algorithm will input a matirx Am×n and integers l ≤ m and k ≤ l.
We then construct a matrix L in the following way: Choose i ∈ 1, . . . ,m
where the probability of choosing i is equal to pi. Then scale the i-th row A(i) of A
by√
lpi, to obtain A(i)/√
lpi, which will constitute the i-th row of L.
In the next step we compute LLT and its singular decomposition
LLT =l∑
t=1
σ2t u(t)u
T(t)
Let H be the matrix whose rows are
h(t) := LTu(t)/‖LTu(t)‖, t = 1, . . . , k34
and whose singular values are σ2t .
Notice that h(t) are the right singular values of L, and SVD of L is
L =l∑
t=1
σtu(t)hT(t)
The weak point of FKV algorithm is its inability to improve iteratively FKV
approximation by incorporating additional parts of A. In the following section, we
present our algorithm [17], which overcomes this shortcoming.
2.2 Fast Monte-Carlo Rank Approximation
The weak point of the FKV algorithm is its inability to improve iteratively FKV
approximation by incorporating additional parts of A. In fact in the recent paper
[11], which uses FKV random algorithm, this point is mentioned specifically in the
end of the paper: “..., it would be interesting to design an algorithm that improves
this approximation by accessing A (or parts of A) again.”
We provide here a sampling framework for iterative updates of k-rank approx-
imations of A, by reading iteratively additional columns (rows) of A, which im-
proves for sure the approximation B each time it is updated. The quality of the
approximation of B is given by the Frobenius norm ||B||F , the successive values of
which are a nondecreasing sequence under these iterations. The rate of increase of
the norms ||B||F can be used to give a stopping rule for terminating the algorithm.
Also, the updating algorithm of the k-rank approximation gives approximation to
the first k singular values of A, and the approximations to the first k left and right
singular vectors of A. We believe that this algorithm will have many applications35
in data mining, data storage and data analysis.
The following set of theorems ([17]) will be useful in the construction of the
algorithm.
Theorem 2.4 Let A ∈ Rm×n and k ∈ [1, min(m,n)]. Let x1, . . . ,xk and
y1, . . . ,yk be two sets of orthonormal vectors in Rm and Rn respectively. Then
k∑
i=1
(ATxi)T(ATxi) ≤
k∑
i=1
σ2i ,
k∑
i=1
(Ayi)T(Ayi) ≤
k∑
i=1
σ2i . (2.24)
Equality in the first or second inequality occurs if span(x1, . . . ,xk) or span(y1, . . . ,yk),
contains k linearly independent left or right singular vectors of A, corresponding
the k maximal singular values of A respectively.
The above characterization follows from the maximal, (Ky Fan), characteriza-
tion, of the sum of the first largest eigenvalues of a real symmetric matrix:
Theorem 2.5 Let S ∈ Sp(R) be a real p × p symmetric matrix. Let λ1 ≥
. . . ≥ λp be the eigenvalues S arranged in a decreasing order and listed with their
multiplicities. Let w1, . . . ,wp ∈ Rp be an orthonormal set of the corresponding
eigenvectors of S: Swi = λiwi, i = 1, . . . , p. Let k ∈ [1, p] be an integer. Then for
any orthonormal set x1, . . . ,xk ∈ Rp
k∑
i=1
xTi Sxi ≤
∑
i=1
λi =k∑
i=1
wTi Swi. (2.25)
Equality holds if and only if the subspace span(x1, . . . ,xk) contains k linearly
independent eigenvectors of S corresponding to the eigenvalues λ1, . . . , λk.
36
See for example [14] for proofs and the references. Note that x1, . . . ,xk ∈ Rm
is a system of orthonormal vectors in Rm if and only if the m×k matrix (x1, . . . ,xk)
is in Omk(R).
To obtain Theorem 2.4 from Theorem 2.5 we let p = m, (or n) and S be equal
to AAT, (or ATA). In (2.24) we emphasized the complexity of the computations of
the left-hand side of the inequalities. See also [18] for applications of Theorem 2.5
to data imputation in DNA microarrays.
Corollary 2.6 Let A ∈ Rm×n and k ∈ [1, min(m,n)] be an integer. Then
for any two k-orthornormal systems x1, . . . ,xk ∈ Rm and y1, . . . ,yk ∈ Rm the
following equalities hold:
||A −k∑
i=1
xi(xTi A)||2F = ||A||2F −
k∑
i=1
(ATxi)T(ATxi), (2.26)
||A −k∑
i=1
(Ayi)yTi ||2F = ||A||2F −
k∑
i=1
(Ayi)T(Ayi). (2.27)
In particular the best k-rank approximation Ek of A is given by∑k
i=1 ui(uTi A) and
∑k
i=1(Avi)vTi , where u1, . . . ,uk ∈ Rm and v1, . . . ,vk ∈ Rn are an orthonormal
sets of the left and right singular vectors of A corresponding to σ1, . . . , σk.
The next theorem is the key theorem for updating the k-rank approximation for
∑k
i=1 xi(xTi A), ( or
∑k
i=1(Ayi)yTi ), for some (x1, . . . ,xk) ∈ Omk(R), (or some
(y1, . . . ,yk) ∈ Onk(R)).
Theorem 2.7 Let x1, . . . ,xk ∈ Rm, (or y1, . . . ,yk ∈ Rn), be an orthonor-
mal system in Rm, (or in Rn). Let w1, . . . ,wl ∈ Rm, (or z1, . . . , zl ∈ Rn),
37
be a given set in Rm, (or in Rn). Perform the Modified Gram-Schmidt process
on x1, . . . ,xk,w1, . . . ,wl, (or y1, . . . ,yk, z1, . . . , zl), to obtain an orthonormal
set x1, . . . ,xp ∈ Rm, (or y1, . . . ,yp ∈ Rn), where k ≤ p ≤ k + l. Assume
that k < p, i.e. span(w1, . . . ,wl) * span(x1, . . . ,xk), (or span(z1, . . . , zl) *
span(y1, . . . ,xk)). Form p × p real symmetric matrix S := ((ATxi)T(ATxj))
pi,j=1,
(or S := ((Ayi)T(Ayj))
pi,j=1), and assume that λ1 ≥ . . . ≥ λk are the k-largest
eigenvalues of S with the corresponding k-orthonormal vectors o1, . . . ,ok ∈ Rp.
Let O := (o1, . . . ,ok) ∈ Opk(R) and define k-orthonomal vectors x1, . . . , xk ∈
Rm, (or y1, . . . , yk ∈ Rn), as follows:
(x1, . . . , xk) = (x1, . . . ,xp)O, (or (y1, . . . , yk) = (y1, . . . ,yp)O). (2.28)
Then
k∑
i=1
(ATxi)T(ATxi) ≤
k∑
i=1
(ATxi)T(ATxi), (2.29)
(ork∑
i=1
(Ayi)T(Ayi) ≤
k∑
i=1
(Ayi)T(Ayi)).
Furthermore
λi = (ATxi)T(ATxi), i = 1, . . . , k, (ATxi)
T(ATxj) = 0 for i 6= j,(2.30)
(or λi = (Ayi)T(Ayi), i = 1, . . . , k, (Ayi)
T(Ayj) = 0 for i 6= j.)
We now explain the essence of Theorem 2.7. View x1, . . . ,xk as an approxi-
mation to the first k-left singular vectors of A, and∑k
i=1 xi(ATxi)
T as the k-rank
approximation to A. Hence (ATxi)T(ATxi) is an approximation to σ2
i of A for
i = 1, . . . , k. Read additional vectors w1, . . . ,wl ∈ Rm such that at least one of38
these vectors is not in the subspace spanned by x1, . . . ,xk. Let X be the subspace
spanned by x1, . . . ,xk and w1, . . . ,wl. Hence x1, . . . ,xk, . . . ,xp is the orthonor-
mal basis of X obtained from the vectors x1, . . . ,xk,w1, . . . ,wl, using the Gram-
Schmidt process. Note that k < p ≤ k + l, and in general one has that p = k + l.
Consider the p × p nonnegative definite matrix S = ((ATxi)T(ATxj))
pi,j=1. Find
its first k eigenvectors to obtain x1, . . . , xk ∈ Rm. Then C :=∑k
i=1 xi(ATxi)
T
is the best approximation of A by matrix B of rank at most k, whose columns are
in the subspace X. In particular, the approximation C is better than the previous
approximation∑k
i=1 xi(ATxi)
T, which is equivalent to the (2.29). Similar situation
holds for in the second case y1, . . . ,yk, z1, . . . , zl ∈ Rn.
Outline of the Proof of Theorem 2.7. Let S = (sij)pi,j=1. Let ei = (δi1, . . . , δip)
T, i =
1, . . . , p be the standard orthonormal basis in Rp. Assume that we are dealing with
the orthonormal set x1, . . . ,xp ∈ Rm. Use the definition of A and the Ky Fan
characterization of the sum of the maximal k eigenvalues of symmetric S to deduce
k∑
i=1
(ATxi)T(ATxi) =
k∑
i=1
eTi Sei ≤
k∑
i=1
λi =k∑
i=1
oTi Soi.
Let C := AT(x1, . . . ,xp). Then S = CTC. Hence the√
λ1 ≥ . . . ≥√
λp are
the singular values of C and o1, . . . ,op are the right singular vectors of C. Thus
Coi =√
λiti ∈ Rn, where ti is the left singular vector of C corresponding to
the singular value√
λi for i = 1, . . . , p. A straightforward calculation shows that
λi = oTi Soi = (ATxi)
T(ATxi) for i = 1, . . . , k. Hence the first inequality in (2.29)
holds. Furthermore we deduce the first set of equalities in (2.30) for i = 1, . . . , k.39
The second set of equalities in (2.30) follows from the orthonormality of t1, . . . , tk.
Similar arguments apply when dealing with the orthonomal system y1, . . . ,yp.
Algorithm [17]
One starts the algorithm by choosing the first k-rank approximation to A as
follows. Let c1, . . . , cn ∈ Rm and rT1 , . . . , rT
m ∈ Rn be the n columns and the m
rows of A. (We view the rows of A as row vectors.) Choose 1 ≤ n1 < . . . <
nk ≤ n, (or 1 ≤ m1 < . . . < mk ≤ m), to be k integers. Let x1, . . . ,xq ∈
Rm, (or y1, . . . ,yq ∈ Rn), be the orthonormal set obtained from cn1, . . . , cnk
, (or
rTm1
, . . . , rTmk
), using the Modified Gram-Schmidt process. Then let
B0 :=
q∑
i=1
xi(ATxi)
T, (or B0 :=
q∑
i=1
(Ayi)yTi ). (2.31)
That is, the first k-rank B0 approximation to A is obtained as follows. Choose
at random k columns, (or rows), of the data matrix A. Apply the Gram-Schmidt
process to them to obtain the orthonormal column vectors x1, . . . ,xq with m coor-
dinates, (or the orthonormal column vectors y1, . . . ,yq with n coordinates). In gen-
eral q = k, but in some cases if x1, . . . ,xk are linearly dependent, (or if y1, . . . ,yq
are linearly dependent), q < k. Assume for simplicity of the exposition that
q = k. Then B0 is of the form (2.23) where yi = ATxi, i = 1, . . . , k, (or
xi = Ayi, i = 1, . . . , k.)
Now update iteratively the k-rank approximation of Bt−1 of A to Bt, using
Theorem 2.7, by letting w1 := cj1 , . . . ,wl = cjl∈ Rm, (or z1 := rT
j1, . . . , zl =
rTjl∈ Rn), for some l integers 1 ≤ j1 < . . . < jl ≤ n, (or 1 ≤ j1 < . . . < jl ≤ m).
40
That is, we choose another l sets of columns of A, (or rows of A), preferably that
were not chosen before, and update the k-rank approximation using the algorithm
suggested by Theorem 2.7 to obtain an improved k-rank approximation Bt of A.
Furthermore one can use the k-rank approximation Bt from the above algo-
rithm to approximate the first k-singular values σ1, . . . , σk, and the left and the
right singular vectors u1, . . . ,uk and v1, . . . ,vk as follows. First the square roots
√
λ1(S), . . . ,√
λk(S) of the matrix S are approximations for σ1, . . . , σk. If S
was obtained using x1, . . . ,xp ∈ Rm then the vectors x1, . . . , xk approximate
u1, . . . ,uk. Let vi := ATui, i = 1, . . . , k. Then ||vi|| =√
λi(S) is an approx-
imation to σi of A for i = 1, . . . , k. The renormalized vi which are given as 1||vi|| vi
approximate the right singular eigenvectors vi for i = 1, . . . , k.
If S was obtained using y1, . . . ,yp ∈ Rn then the vectors y1, . . . , yk approxi-
mate v1, . . . ,vk and the vectors Ay1
||Ay1|| , . . . ,Ayk
||Ayk|| approximate u1, . . . ,uk.
41
Fast k-rank approximation and SVD algorithm
Input: positive integers m,n, k, l, N , m × n matrix A, ε > 0.
Output: an m × n k-rank approximation Bf of A, with the ratios ||B0||||Bt|| and
||Bt−1||||Bt|| , approximations to k-singular values and k left and right singular vec-
tors of A.
1. Choose k-rank approximation B0 using k columns (rows) of A.
2. for t = 1 to N
- Choose l columns (rows) from A at random and update Bt−1 to Bt.
- Compute the approximations to k-singular values, and k left and right
singular vectors of A.
- If ||Bt−1||||Bt|| > 1 − ε let f = t and finish.
We now explain briefly the main steps of our algorithm. We read the dimensions
m,n of the data matrix A. We set N as the maximal number of iterations we are
going to execute to find the k-rank approximation of A. We read the entries of
the data matrix A, and finally the small parameter ε > 0. We choose the k-rank
approximation B0 using (2.31) as explained above. Assume that Bt−1 is the current
k-rank approximation to A. Then we pick up additional l columns, (or rows), of A
and update Bt−1 to Bt using Theorem 2.7 as explained. Recall that ||Bt−1|| ≤ ||Bt||.
If the relative improvement in ||Bt|| is less than ε, i.e. ||Bt−1||||Bt|| > 1 − ε, we are
satisfied the approximation Bt and finish our algorithm. If this does not happen42
then our algorithm stops after the N iteration.
3 Cluster Analysis
3.1 Clusters and Clustering
Clustering is the process of grouping data objects into a set of disjoint classes called
clusters, so that the objects within the class have high similarity to each other, while
objects in separate clusters are more dissimilar [25]. Clustering is an example of
unsupervised classification. Classification refers to a procedure that assigns data
objects to a set of classes. Unsupervised means that clustering does not rely on
predefined classes and training examples while classifying the data objects. Thus,
clustering is distinguished from pattern recognition or the areas of statistics known
as discriminant analysis and decision analysis, which seek to find rules for classify-
ing objects from a given set of pre-classified objects.
3.2 Various Gene Expression Clustering Procedures
Usually a microarray experiment contains 106 genes. One of the characteristics
of gene expression data is that it is meaningful to cluster both genes and samples.
We could group the co-expressed genes in a cluster according to their expression
patterns. In such gene based clustering, the genes are treated as objects, while the
samples are the features. On the other hand the samples can be partitioned into
homogeneous groups. Each group may correspond to some particular cancer types.
43
Such sample based clustering regards the samples as the objects and the genes as
the features.
3.2.1 Proximity (Similarity) Measurement
Proximity measurement for gene expression data computes the distance between
the two data points. As mentioned before, gene profile in a gene expression matrix
can be formalized as a vector xi = xij : 1 ≤ j ≤ m, where xij is the value of the
j-th
feature the i-th data object, and m is the number of features or experiments. This
similarity between two expression profiles xi and xj is measured by the distance
function of corresponding vectors xi and xj .
Euclidean distance is one of the most commonly used methods to measure the
distance between two gene expression profiles. The distance between two gene
expression profiles xi and xj in m-dimensional space is defined as
dEuclidean(xi,xj) =
√
√
√
√
m∑
k=1
(xik − xjk)2
However, for gene expression data, the overall Euclidean distance does not perform
well for shifting or scaled patterns (profiles). To circumvent this, each gene profile
is standardized with zero mean and variance one, before calculating the distance.
An alternate measure of similarity is the Pearson correlation coefficient, which
measures similarity between the shapes of two expression patterns between two
44
profiles.
Pearson(xi,xj) =
∑m
k=1(xik − µxi)(xjk − µxj
)√∑m
k=1(xik − µxi)2√
∑m
k=1(xjk − µxj)2
where µxiand µxj
are the means for gene profiles of xi and xj respectively. Pear-
son’s correlation coefficient views each gene profile as a random variable, and mea-
sures the similarity between two gene profiles by calculating the linear relationship
between their distribution. The shortcoming of Pearson correlation is that it is not
robust for outliers.
3.2.2 Hierarchical Cluster Analysis
One possibility for clustering objects is their hierarchical aggregation. Here the ob-
jects are combined according to their distances from or similarities to each other.
Cluster formation is based on splitting the whole set of objects into individual clus-
ters. With the more frequently used agglomerative clustering, one starts with single
objects and measures them to larger object groups.
To decide the number of clusters, different criteria can be used. Very often, the
number of clusters is unknown. For example, in given clinical data, the number
of clusters might be predefined by a given number of diseases. In some cases,
the number of clusters can be obtained from a predetermined distance measure or
difference between clusters. In general, the distance to a new object or cluster K is
computed by calculating the distance from the object A and B to objects i. In this
case, the weighted average linkage dki= (dAi
+ dBi)/2. The sizes of the clusters
and their weights are assumed to be equal. Several other formulas exist for the45
contruction of the distance matrix. The most important ones are considered now for
the case of aggregating two clusters.
Single Linkage
Here the shortest distance between the opposite clusters is calculated, i.e.
dki =dAi
+ dBi
2− |dAi
− dBi|
2= min(dAi
, dBi)
As a result, clusters are formed that are loosely bound. The clusters are often
linearly elongated in contrast to the usual spherical clusters. This chaining is caused
by the fusion of single objects to a cluster.
3.2.3 Clustering Using K-means Algorithm
K-means algorithm. Of all clustering procedures for microarray data, K-means
is among the simplest and most widely used, and has probably the cleanest proba-
bilistic interpretation as a form of expectation maximization (EM) on the underlying
mixture model. In a typical implementation of the K-means algorithm, the number
of clusters is fixed at some value K, based, for instance on the expected number of
regularity patterns. K representative points or centers are initially chosen for each
cluster more or less at random. In microarray data, this could reflect, for instance,
the expected number of regularity patterns. These points are also called centroids
or prototypes. Then at each step, we have the following:
• Each point in the data is assigned to the cluster associated with the closest
representative.
46
• After the assignment, new representative points are computed, for instance
by computing the center of gravity of each computed cluster.
• The two procedures above are repeated until the system converges or fluctu-
ation remains small.
We notice that using K-means clustering method requires choosing the number
of clusters and also being able to compute a distance or similarity between points
and compute a representative for each cluster given its members. The general idea
behind K-means clustering can lead to different software implementations depend-
ing on how the initial centroids are chosen, how symmetries are broken, and so
forth. A good implementation has to run the algorithm multiple times with dif-
ferent initial conditions and possibly also try different values of K automatically.
When the cost function corresponds to an underlying probabilistic mixture model,
K-means is an online approximation to the classical EM algorithm, and as such,
in general is bound to converge towards a solution that is at least a local maximum
likelihood or maximum posterior solution. A classical case is when Euclidean dis-
tances are used in conjunction with a mixture of Gaussian model.
3.2.4 Mixture Models and EM Algorithm
Let us consider a data set D = (d1, . . . , dN) and an underlying mixture model with
K components of the form
P (d) =K∑
k=1
P (Mk)P (d|Mk) =K∑
k=1
λkP (d|Mk)
47
where λk ≥ 0 and∑
k λk = 1 and Mk is the model for cluster k [7]. Mixture
distributions provide a flexible way of modelling complex distributions, combining
together simple building blocks, such as Gaussian distributions. The Lagrangian
associated with log-likelihood and normalization constraints on the mixing coeffi-
cients is given by
L =N∑
i=1
log
(
K∑
k=1
λkP (di|Mk)
)
− µ
(
K∑
k=1
λk − 1
)
with the corresponding critical equation
∂L∂λk
=N∑
i=1
P (di|Mk)
P (di)− µ = 0
multiplying each critical equation by λk and summing over k gives us the value of
the Lagrange multiplier µ = N .
Multiplying again the critical equation across by P (Mk) = λk and using Bayes’
theorem in the form
P (Mk|di) = P (di|Mk)P (Mk)
P (di)
gives
λ∗k =
1
N
N∑
i=1
P (Mk|di)
Therefore the maximum likelihood estimate of the mixing coefficients for class K
is the sample mean of the conditional probabilities that di comes from the model
K. Consider now that each model Mk has its own vector of parameters (wkj).
Differentiating the Lagrangian with respect to wkj gives
∂L∂wkj
=N∑
i=1
λk
P (di)· ∂P (di|Mk)
∂wkj
48
Combining the above equations we get
N∑
i=1
P (Mk|di)∂ log P (di|Mk)
∂wkj
= 0
for each k and j. The maximum likelihood equations for estimating the parameters
are weighted averages of the maximum likelihood equations ∂ log P (di|Mk)/∂wkj =
0. Notice that the weights are the probabilities of membership of the di in each class.
49
4 Various Methods of Imputation of Missing Values
in DNA Microarrays
4.1 The Gene Expression Matrix
In this section we will view E ∈ Rn×m, with n ≥ m as the gene expression matrix:
E =
g11 g12 . . . g1m
g21 g22 . . . g2m
......
......
gj1 gj2 . . . gjm
......
......
gn1 gn2 . . . gnm
=
gT1
gT2
...
gTj
...
gTn
= [c1 c2 . . . cm], (4.1)
gTj := (gj1, gj2, ..., gjm), j = 1, ..., n, ci =
g1i
g2i
...
gji
...
gni
, i = 1, ...,m.
The row vector gTj corresponds to the (relative) expression levels of the j th gene
in m experiments. The column vector ci corresponds to the (relative) expression
levels of the n genes in the ith experiment.
Consider the SVD of the gene expression matrix E = UΣV T . In the terminol-
ogy of [4], the columns of U are eigenarrays, the columns of V are eigengenes, and50
the singular values of E are eigenexpression levels.
In many microarray data sets, researchers have found that only a few eigen-
genes are needed to capture the overall gene expression pattern. (Here, by a “few”
we mean less than half of the number of experiments m.) The number of these
significant eigengenes is a fundamental problem in principal component analysis
[24]. Let us mention explicitly three methods to estimate the number of significant
eigengenes. The fraction criterion can be stated simply as follows. Let
pq :=σ2
q∑m
t=1 σ2t
, q = 1, ...,m, p := (p1, ..., pm)T. (4.2)
Thus pq represents the fraction of the expression level contributed by the qth eigen-
gene. Then we choose the l eigengenes that contribute about 70% − 90% of the
total expression level. Another method is to use scree plots for the σ2q . (In prin-
cipal component analysis, the pq are proportional to the variances of the principal
components, so we choose the principal components of maximum variability [26].)
According to [24], the most consistent estimates of the number of significant eigen-
genes is achieved by the broken-stick model.
If E has l significant eigenvalues, we view σq to be effectively equal to zero for
q > l. We define the matrix
El :=l∑
q=1
σquqvTq (4.3)
as the filtered part of E and consider E − El the noise part of E.
Let
1 ≥ h(p) := − 1
log m
m∑
q=1
pq log pq ≥ 0. (4.4)
51
Then h(p) is the rescaled entropy of the probability vector p. h(p) = 1 only when
pq = 1m
, q = 1, ...,m; in other words, all the eigengenes are equally expressed.
On the other hand, h(p) = 0 if and only if pq(1 − pq) = 0, q = 1, ...,m and this
corresponds to r = 1: in other words, the gene expression is captured by a single
eigengene (and eigenarray).
The following example points out a potential weakness of SVD theory in trying
to detect groups of genes with similar properties.
4.2 SVD and gene clusters
Suppose the set of genes gTj , j ∈ [n] can be grouped into k + 1 disjoint subsets
[n] = ∪k+1q=1Gq with G1, ..., Gk nonempty and m ≥ k (usually m > k). In particular,
consider the genes in each group Gq ( q = 1, ..., k) to have similar characteristics (in
other words, Gq is a cluster). Genes that have no similar characteristics are placed
in Gk+1. Denote by #Gq the cardinality of the set Gq for q = 1, ..., k + 1. Suppose
that our m experiments do not distinguish between any two genes belonging to the
same group Gq for q = 1, ..., k + 1. More precisely we assume:
gji = aqi for each j ∈ Gq and q = 1, ..., k, i = 1, ...,m, (4.5)
gji = 0 for each j ∈ Gk+1 and i = 1, ...,m,
Let A = (aqi)k,mq,i=1 ∈ Rk×m be the corresponding k × m matrix with the rows
52
rT1 , ..., rT
k :
A =
rT1
rT2
...
rTk
.
Then the row rq appears exactly #Gq times in E for q = 1, ..., k. In addition
E has #Gk+1 zero rows. Clearly the row space of E is the row space of A. So
k ≥ rank E = rank A. Hence if rank A = k then
σ1(E) ≥ ... ≥ σk(E) > σk+1(E) = ... = σm(E) = 0.
However, there is no simple formula relating the singular values of E and A. It
may happen that the rows of A are linearly dependent which indicates that several
groups out of G1, ..., Gk are somehow related, and the number of the significant
singular values of E is less than k.
Conclusion: The number of gene clusters is no less than the number of significant
singular values of the gene expression matrix.
4.3 Missing Data in the Gene Expression Matrix
We now consider the problem of missing data in the gene expression matrix E.
(Our analysis can be applied to any matrix E.) Let N ⊂ [n] denote the set of rows
of E that contain at least one missing entry. Thus for each j ∈ N c := [n]\N , the
gene gTj has all of its entries. Let n′ denote the size of N c so that the size of N is
n − n′. We want to complete the missing entries of each gTj , j ∈ N , under some
53
assumptions.
We first describe the reconstruction of the missing data in E using the SVD as
given in [4].
4.3.1 Reconsideration of 4.2
Let us reconsider Example 3.1. Assume that rank A = k. Let j ∈ N and assume
that the gene j is in the cluster Gq. Then we can reconstruct all missing entries
of gTj if Gq\N 6= ∅. Indeed, if for some gene p ∈ Gq we have the results of m
experiments, then gj = gp and we reconstructed the missing entries for gj . In this
example we can reconstruct all the missing entries in E if E ′ has the same rank as
E. Equivalently, we can reconstruct all the missing entries in E if the equality (4.1)
holds, where l and l′ are the ranks of E and E ′ respectively.
4.3.2 Bayesian Principal Component (BPCA):
BPCA consists of three processes [28]:
• Principal Component (PC) regression
• Bayesian estimation
• Iterative expectation-maximization (EM) algorithm.
Here we give a summary of each process.
PC regression
In order to explain PC, we need to mention principal component analysis (PCA).
54
Consider a D×N gene matrix Y where D represents the number of arrays (samples)
and N the number of genes. Let µ be the mean vector of y:
µ :=
(
1
N
) N∑
i=1
yi
Then construct the correlation matrix
S =1
N
N∑
i=1
(yi − µ)(yi − µ)T
Let λ1 ≥ λ2 ≥ · · · ≥ λD be the eigenvalues of S and u1,u2, . . . ,uD the cor-
responding eigenvectors. Then we scale the eigenvectors by their singular values.
The l-th principal access vector wl could be defined by wl =√
λlul and the l-th
factor score by xl = (wl/λl)Ty.
Then the variation of the gene expression of y could be represented as a linear
combination of principal access vectors wl (1 ≤ l ≤ K), K < D (there are a few
wl):
y =K∑
l=1
xlwl + ε
This is the spirit of PC regression. We now consider the missing values in the gene
expression matrix. We could estimate the missing part ymiss of gene expression y by
estimating the of observed segment of it yobs, using PCA. Denote wobsl and wmiss
l be
the principal axis wl corresponding to observe and missing data respectively. Now
construct a matrix W = (W obs,W miss), where
W obs = (wobs1 , . . . ,wobs
k ) and W miss = (wmiss1 , . . . ,wmiss
k )
respectively.55
Now we could obtain the factor score x = (x1, . . . , xk) by minimizing the
residual error err = ‖yobs − wobs · x‖2. Using the least square method,
x = [(W obs)T W obs]−1(W obs)Tyobs
Then we could estimate the missing part of the gene expression y by ymiss = wmiss ·
x.
Bayesian estimation Probabilistic extension of Principal Component Analysis (PPCA)
has found useful applications in supervised learning (tipping). In this setting the
residual error and the factor scores are normally distributed.
x ∼ N (0, Ik) and
ε ∼ N (0,1
τID)
the interesting fact is that the maximum likelihood estimation of PPCA and PCA
are identical
ln P (y,x|θ) = ln P (y,x|W ,µ, τ ) =
−τ
2‖y − Wx − µ‖2 − 1
2‖x‖2 +
D
2ln τ − K + D
2ln 2π
where θ ≡ W ,µ, τ is the parameter set. Using Bayes theorem, we can
obtain the posterior probability distribution of θ and X:
p(θ,X|Y ) ∝ (Y,X|θ)p(θ)
Iterative E-M algorithm
Having the information about the true parameter θtrue, the posterior pobability of56
the missing values is given by
q(Y miss) = P (Y miss|Y obs,θtrue).
Having the parameter posterior q(θ), then
q(ymiss) =
∫
dθq(θ)p(Y miss|Y obs,θ).
Then the variation Bayes (VB) algorithm could be used to estimate the model pa-
rameter θ and missing value Y miss. VB algorithm is similar to EM and it obtains the
posterior distribution for θ, Y miss, q(θ), q(Y miss) by using a repetitive algorithm.
Summary of VB algorithm
1. Initialize q(Y miss) by replacing missing genes by the gene-wise average.
2. Estimate q(θ) using Y obs and current q(Y miss).
3. Estimate q(Y miss) using current q(θ)
4. Update the hyperparameter α using both current q(θ) and the curent q(Y miss).
5. Repeat (2)-(4) until convergence.
Then we could impute the missing values in the gene expression matrix by
ˆY miss =
∫
Y missq(Y miss)dY miss
4.3.3 The least square imputation method
Let E ′ be the n′×m matrix containing the rows gTj , j ∈ N c of E which do not have
any missing entries, and l′ be the number of significant singular values of E ′. Let57
X ⊂ Rm be the invariant subspace of the symmetric matrix (E ′)TE ′ corresponding
to the eigenvalues σ1(E′)2, ..., σl′(E
′)2. Let x1, ...,xl′ be the orthonormal eigenvec-
tors of (E ′)TE corresponding to the eigenvalues σ1(E′)2, ..., σl′(E
′)2. Then x1, ..xl′
is a basis of X.
Let M ⊂ [m] be a subset of cardinality m − m′. Consider the projection
πM : Rm → Rm′ by deleting all the coordinates i ∈ M for any vector x =
(x1, ..., xm)T ∈ Rm. Then πM(X) is spanned by πM(x1), ..., πM(xl′).
Fix j ∈ N and let M ⊂ [m] be the set of experiments (columns) where the
gene gTj has missing entries. Let y ∈ πM(X) be the least square approximation
to πM(gj). Then any gj ∈ π−1M (y) is a completion of gj . If πM|X is 1-1 then
gj is unique. Otherwise one can choose gj ∈ π−1M (y) with the least norm. Note
that to find y ∈ πM(X) one needs to solve the least square problem for a subspace
πM(X). In principle, for each j ∈ N one solves a different least square problem.
The crucial assumption of this method is
l = l′. (4.1)
That is the completed matrix E and its submatrix E ′ have the same number of
significant singular values. This follows from the observation that the completion
of the row gj, j ∈ N lies in the subspace X. (Note that the inequalities (4.5) imply
that the assumption (4.1) can be a very restrictive assumption.)
The significant singular values of E ′ and of the reconstructed E are joint func-
tions of all the rows (genes). By trying to reconstruct the missing data in each gene
gTj , for j ∈ N , separately, we ignore any correlation between gT
j and the genes58
gTq , q ∈ N ; consequently, this will have an impact on the singular values of E.
4.3.4 Iterative method using SVD
In the recent papers [35] and [9], the following iterative method using SVD to
impute missing values in a gene expression matrix is suggested. First, replace the
missing values with 0 or with values computed from another method. Call the
estimated matrix Ep, where p = 0. Find the lp significant singular values of Ep,
and let Ep,lp be the filtered part of Ep (4.3). Replace the missing values in E by the
corresponding values in Ep,lp to obtain the matrix Ep+1. Continue this process until
Ep converges to a fixed matrix (within a given precision). This algorithm takes into
account implicitly the influence of the estimation of one entry on the other ones.
But it is not clear if the algorithm converges, nor what are the features of any fixed
point(s) of this algorithm.
4.3.5 Local least squares imputation (LLSimpute)
Kim et. al. in their recent paper introduced an algorithm which is based on a
least squares method. This method considers k genes which are similar to the gene
with missing data. In the following, we review the essence of their method. In their
method, we are concerned with two matrices A and B and a vector w. We construct
the matrix A by eliminating the elements of k nearest neighbors at q missing loca-
tions of the given genes. Then construct a matrix B by collecting all the eliminated
elements. Finally we construct a row vector w which consists of missing values in
the original gene. Then the recovery of the missing genes can be formulated as the59
following least square problem:
minx
‖ATx − w‖2
Denote u = (α1α2 . . . αq)T of q missing values. Then u could be estimated as
follows:
u = BTx = BT (AT )†w
where (AT )† is the pseudoinverse of AT .
Now we use an example which is borrowed from [27] to explain the model.
Suppose the target gene g has two missing values in the first and tenth positions in
total of ten experiments. Now we use k similar genes gTs1
, . . . ,gTsk
. Now we could
reconstruct the original gene and the k similar genes with the following matrices
gTs1
gTs2
...
gTsk
=
α1 w1 w2 · · · w8 α2
B1,1 A1,1 A1,2 · · · A1,8 B1,2
......
......
......
Bk,1 Ak,1 Ak,2 · · · Ak,8 Bk,2
The values of w could be written as
w ' x1a1 + x2a2 + · · · + xkak
Then the missing value could be estimated by
α1 = B1,1x1 + B2,1x2 + · · · + Bk,1xk
α2 = B1,2x1 + B2,2x2 + · · · + Bk,2xk
where α1 and α2 are two missing values in the original gene.60
4.4 Motivation for FRAA
We suggest a new method in which the estimation of missing entries is done simul-
taneously, i.e., the estimation of one missing entry influences the estimation of the
other missing entries. If the gene expression matrix E has missing data, we want to
complete its entries to obtain a matrix E, such that the rank of E is equal to (or does
not exceed) d, where d is taken to be the number of significant singular values of E.
The estimation of the entries of E to a matrix with a prescribed rank is a variation
of the problem of communality (see [16, p. 637].) We give an optimization algo-
rithm for finding E using the techniques for inverse eigenvalue problems discussed
in [14].
It is likely that our algorithm can be used to estimate missing entries in data sets
other than gene expression data. Such a data set should be represented by an n×m
matrix whose rank is smaller than min(m,n).
4.4.1 Additional matrix theory background for FRAA
To compute the decomposition (1.21), it is enough to know vq and σquq. If σq
repeats k > 1 times in the sequence σ1 ≥ ... ≥ σr > 0, then the choice of the
corresponding k eigenvectors vj is not unique: any choice of the orthonormal basis
in the eigenspace of ETE corresponding to the eigenvalue σ2q is a legitimate choice.
In what follows we will use yet another equivalent definition of the singular
values of E. Let Rn×m denote the space of all real n × m matrices and let Sm(R)
61
denote the space of all real m × m symmetric matrices. For A ∈ Sm(R), we let
λ1(A) = λ1 ≥ ... ≥ λm(A) = λm, Azq = λqzq, zTq zt = δqt, q, t = 1, ...,m,
(4.2)
be the eigenvalues and corresponding eigenvectors of A, where the eigenvalues are
counted with their multiplicities, and the eigenvectors form an orthonormal basis in
Rm.
Consider the following (n + m) × (n + m) real symmetric matrix:
Es :=
0 E
ET 0
. (4.3)
It is known [22, §7.3.7]
σq(E) := σq = λq(Es) = −λn+m+1−q(E
s), for q = 1, ...,m, (4.4)
λq(Es) = 0 for q = m + 1, ..., n.
The Cauchy interlacing property for Es implies [22, §7.3.9]
Let [n] := 1, 2, . . . , n, and let N ⊂ [n], M ⊂ [m] denote sets of cardinalities
n − n′,m − m′ ≥ 0 respectively.
Proposition 4.1 Let E ∈ Rn×m and denote by E ′ ∈ Rn′×m′
the matrix obtained
from E by deleting all rows i ∈ N and all columns j ∈ M. Then
σq(E) ≥ σq(E′) for q = 1, ...,m, (4.5)
σq(E′) ≥ σq+n−n′+m−m′(E) for q = 1, ...,m′ + n′ − n.
62
The significance of this proposition is explained in §4 and §5.
4.4.2 The Optimization Problem
We suggest a new method in which the estimation of missing entries is done simul-
taneously, i.e., the estimation of one missing entry influences the estimation of the
other missing entries. If the gene expression matrix E has missing data, we want to
complete its entries to obtain a matrix E, such that the rank of E is equal to (or does
not exceed) d, where d is taken to be the number of significant singular values of E.
The estimation of the entries of E to a matrix with a prescribed rank is a variation
of the problem of communality (see [16, p. 637].) We give an optimization algo-
rithm for finding E using the techniques for inverse eigenvalue problems discussed
in [14]. We now show that the estimation problem discussed in the previous section
can be cast as the following optimization problem:
Problem 4.2 Let S be a given subset of [n] × [m]. (S is the set of uncorrupted
entries of the gene expression matrix E given by (4.1).) Let e(S) := eji, (j, i) ∈
S be a given set of real numbers. (e(S) is the set of uncorrupted (known) values
of the entries of E.) Let M(e(S)) ⊂ Rn×m be the affine subset of all matrices
A = (aji) ∈ Rn×m such that aji = eji for all (j, i) ∈ S . (M(e(S)) all possible
choices for E.) Let ` be a positive integer not exceeding m. Find E ∈ M(e(S))
with the minimal σ`.
Let E = (gji) denote the gene expression matrix with missing values. We
choose the S in Problem 1 to be the set of coordinates (j, i) for which the entry gji
63
is not missing. Recall that N ⊂ [n] denotes the set of rows of E, such that each row
j ∈ N contain at least one missing entry. The cardinality of N is n − n′. Thus the
set S contains all elements (j, 1), ...(j,m) for each j ∈ N c. The complement of S
is the set of coordinates Sc = (j, i) | gji is missing ⊂ N × [m] . Let o denote the
total number of missing entries in E. Then o ≥ n − n′.
Let E ′ be the matrix as in §4.3.3 with l′ significant singular values. Note that
(4.5) yields σq(E) ≥ σq(E′) for q = 1, ...,m. Thus if we want to complete E such
that the resulting matrix still has exactly l′ significant singular values, we should
consider Problem 4.2 with ` = l′ + 1.
A more general possibility is to assume that the number of significant singular
values of a possible estimation of E is l = l′ + k where k is a small integer, e.g.
k = 1 or 2. That is, the group of genes gTj for j ∈ N contributes to l′ + 1, ..., l′ + k
significant eigengenes of E. Then one considers Problem 4.2 with ` = l′ + k + 1.
We now consider a modification of Problem 4.2 which has a nice numerical
algorithm.
Problem 4.3 Let S ⊂ [n]× [m] and denote by e(S) a given set of real numbers
eji for (j, i) ∈ S . Let M(e(S)) ⊂ Rn×m be the affine subset of all matrices A =
(aji) ∈ Rn×m such that aji = eji for all (j, i) ∈ S . Let ` be a positive integer not
exceeding m. Find E ∈ M(e(S)) such that∑m
q=` σ2q is minimal.
Clearly, we can find E ∈ M(e(S)) with a “small” σ2` (E) if and only if we can find
E ∈ M(e(S)) with a “small”∑m
q=` σ2q (E).
64
4.5 Fixed Rank Approximation Algorithm
4.5.1 Description of FRAA
We now describe one of the standard algorithms to solve Problem 4.3. Mathemati-
cally it is stated as follows:
Algorithm 4.4 Fixed Rank Approximation Algorithm (FRAA)
Let Ep ∈ M(e(S)) be the pth approximation to a solution of Problem 4.3. Let
Ap := ETp Ep and find an orthonormal set of eigenvectors for Ap, vp,1, ...,vp,m as
in (4.2). Then Ep+1 is a solution to the following minimum of a convex nonnegative
quadratic function
minE∈M(e(S))
m∑
q=`
(Evp,q)T(Evp,q). (4.1)
The flow chart of this algorithm can be given as:
Fixed Rank Approximation Algorithm (FRAA)
Input: integers m,n, L, iter, the locations of non-missing entries S , initial approx-
imation E0 of n × m matrix E.
Output: an approximation Eiter of E.
for p = 0 to iter − 1
- Compute Ap := ETp Ep and find an orthonormal set of eigenvectors for Ap,
vp,1, ...,vp,m.
- Ep+1 is a solution to the minimum problem (4.1) with ` = L.
65
4.5.2 Explanation and justification of FRAA
We now explain the algorithm and show that in each step, we decrease the value of
the function we minimize:
m∑
q=`
σ2q (Ep) ≥
m∑
q=`
σ2q (Ep+1). (4.2)
For any integer k ∈ [m], let Ωk denote the set of all k orthonormal vectors y1, ...,yk
in Rm. Let A be an m×m real symmetric matrix and assume (4.2). Then the min-
imal principle (the Ky-Fan characterization for −A) is:
m∑
q=`
λq(A) =m∑
q=`
zTq Azq = min
y`,...,ym∈Ωm−`+1
m∑
q=`
yTq Ayq. (4.3)
See for example [14].
Let E = Ep + X ∈ M(e(S)). Then X = (xji)n,mj,i=1 where xji = 0 if (j, i) ∈ S
and xji is a free variable if (j, i) 6∈ S .
Let x = (xj1i1 , xj2i2 , . . . , xjoio)T denote the o × 1 vector whose entries are in-
dexed by Sc, the coordinates of the missing values in E. Then there exists a unique
o × o real valued symmetric nonnegative definite matrix o × o matrix Bp which
satisfies the equality
xTBpx =m∑
q=`
vTp,qX
TXvp,q. (4.4)
Let F (j, i) be the n×m matrix with 1 in the (j, i) entry and 0 elsewhere. Then
the (s, t) entry of Bp is given by
bp(s, t) =1
2
m∑
q=`
vTp,q(F (js, is)
TF (jt, it) + F (jt, it)TF (js, is))vp,q, (4.5)
s, t = 1, . . . o.66
Let N ⊂ [n]. Let S(j) denote the set of coordinates in row j with known values
in E so that S(j)c denotes the set of coordinates of the missing values in row j.
Sc = ∪j∈NS(j)c, S(j)c = (j, i(j, 1)), ..., (j, i(j, oj)), (4.6)
m ≥ i(j, oj) > ... > i(j, 1) ≥ 1 for j ∈ N ,
o :=∑
j∈Noj. (4.7)
Note that the set Oj described just after (4.13) is given by Oj := i(j, 1), ..., i(j, oj).
Theorem 4.5 The o × o symmetric nonnegative definite matrix Bp given by
(4.4) decomposes into a direct sum of #N = n−n′ symmetric nonnegative definite
matrices indexed by the set N :
Bp = ⊕j∈NBp,j, Bp,j = (bp,j(q, r))oj
q,r=1) is oj × oj for j ∈ N , (4.8)
and
xTBpx =∑
i∈NxT
j Bp,jxj. (4.9)
More precisely, let vp,k = (vp,k,1, ..., vp,k,m)T, k = 1, ...,m be given as in Algorithm
4.4. Then
bp,j(q, r) =m∑
k=`
vp,k,i(j,q)vp,k,i(j,r), q, r = 1, ..., oj . (4.10)
Equivalently, let Wp be the following m × m idempotent symmetric matrix (W 2p =
Wp) of rank m − l + 1:
Wp =m∑
k=`
vp,kvTp,k = TpT
Tp , Tp = [vp,`, ...,vp,M ] ∈ Rm×(m−`+1). (4.11)
Then Bp,j is the submatrix of Wp of order oj with respect to the rows and columns
in the set Oj for j ∈ N . In particular, if in each row of E there is at most one
missing entry then Bp is a diagonal matrix.67
Proof. View the rows and the columns of Bp as indexed by (s, i(s, q)) and
(t, i(t, r)) respectively, where s, t ∈ N and q = 1, ..., os, r = 1, ..., ot. (For the
purposes of this proof, the notation here is slightly different from that in the body
of the paper.) So Bp = (bp((s, i(s, q)), (t, i(t, r)))). Let F (j, i) be the n×m matrix
which has 1 on the (j, i) place and all other entries are equal to zero. Then
bp((s, i(s, q)), (t, i(t, r))) =
1
2
m∑
k=`
vTp,k(F (s, i(s, q))TF (t, i(t, r)) + F (t, i(t, r))TF (s, i(s, q)))vp,k,(4.12)
s, t ∈ N , q = 1, ..., os, r = 1, ..., ot.
It is straightforward to show that F (s, i(s, q))TF (t, i(t, r)) = 0 if s 6= t. Further-
more, for s = t the matrix F (s, i(s, q))TF (t, i(t, r)) + F (t, i(t, r))TF (s, i(s, q))
has 1 in the places (i(s, q), i(t, r)) and (i(t, r), i(s, q)) for r 6= q, and has 2 in the
place (i(s, q), i(s, q)) if r = q and zero in all other positions. Hence
bp((s, i(s, q)), (t, i(t, q))) = 0 unless s = t. If s = t then a straightforward calcu-
lation yields (4.10). Other claims of the theorem follow straightforward from the
equality (4.10).
Observe that Bp can be decomposed into the direct sum of o symmetric non-
negative definite matrices indexed by N . Hence the function minimized in (4.1) is
68
given by
m∑
q=`
vTp,qE
TEvp,q =m∑
q=`
vTp,q(Ap + ET
p X + XTEp + XTX)vp,q =
xTBpx + 2wTp x +
m∑
q=`
λq(Ap) =
∑
i∈N(xT
j Bp,jxj + 2wTp,jxj) +
m∑
q=`
λq(Ap), (4.13)
where wp := (wp,1, . . . , wp,o)T, and
wp,t =m∑
q=`
vTp,qE
Tp F (jt, it)vp,q, t = 1, ..., o.
For j ∈ N the vector xj ∈ Roj contains all oj missing entries of E in the row j of
the form xjit , it ∈ Oj for the corresponding set Oj ⊂ [m] of cardinality oj . (See
Appendix.) Since the expression in (4.1), and hence in (4.13), is always nonneg-
ative, it follows that wp is in the column space of Bp. Hence the minimum of the
function given in (4.13) is achieved at the critical point
Bpxp+1 = −wp, (4.14)
and this system of equations is always solvable. (If Bp is not invertible, we find the
least-squares solution).
We now show (4.2). The vector xp+1 contains the entries for the matrix Xp+1.
Then Ep+1 := Ep + Xp+1. From the definition of Ap+1 := ETp+1Ep+1 and the
69
minimality of xp+1 we obtain
m∑
q=`
σq(Ep)2 =
m∑
q=`
vTp,q(Ep + 0)T(Ep + 0)vp,q ≥
m∑
q=`
vTp,q(Ep + Xp+1)
T(Ep + Xp+1)vp,q =m∑
q=`
vTp,qAp+1vp,q ≥
m∑
q=`
λq(Ap+1) =m∑
q=`
σq(Ep+1)2.
We conclude this section by remarking that to solve Problem 4.2, one may use
the methods of [16].
4.5.3 Algorithm for (4.14)
From Theorem 4.5, the system of equations Bpx = −wp in o unknowns is equiva-
lent to n − n′ smaller systems
Bp,jxp+1,j = −wp,j j ∈ N . (4.1)
Thus the big system of equations in o unknowns, the coordinates of xp+1, given
(4.14) splits to n − n′ independent systems given in (4.1). That is, in the iterative
update of the unknown entries of E given by the matrix Ep+1, the values in the row
j ∈ N in the places S(j)c are determined by the values of the entries of Ep in the
places S(j)c and the eigenvectors vp,`, ...,vp,m of ETp Ep.
We now show how to efficiently solve the system (4.14).
Algorithm 4.6 For j ∈ N let Tp,j is the oj × (m− ` + 1) matrix obtained from
Tp, given by (4.11), by deleting all rows except the rows i(j, 1), ..., i(j, oj). Then70
(4.1) is equivalent to
Tp,jTTp,jxp+1,j = −wp,j, i ∈ N , (4.2)
which can be solved efficiently by the QR algorithm as follows. Write Tp,j as
Qp,jRp,jPp,j , where Qp,j is an oj × dp,j matrix with dp,j orthonormal columns, Rp,j
is an upper triangular dp,j × oj matrix of rank dp,j nonzero rows, where the rank
Vp,j = dp,j , and Pp,j is a permutation matrix. (The columns of Qp,j are obtained
from the columns of Vp,j using Modified Gram-Schmidt process.) Then
QTp,jxp+1,j = −(Rp,jR
Tp,j)
−1QTp,jwp,j
and
xp+1,j = −Qp,j(Rp,jRTp,j)
−1QTp,jwp,j , j ∈ N (4.3)
is the least square solution for xp+1,j .
4.6 Simulation
We implemented the Fixed Rank Approximation Algorithm (FRAA) in Matlab and
tested it on the microarray data Saccharomyces cerevisiae [33] as provided at
http://genome-www.stanford.edu (the elutriation data set). The dimension of the
complete gene expression matrix is 5986×14. We randomly deleted a set of entries
and ran FRAA on this “corrupted” matrix to obtain estimates for the deleted entries.
The FRAA requires four inputs: the matrix E with N rows and M columns with
missing entries, an initial guess for the missing entries, a parameter L–the number
of significant singular values, and the number of iterations. We set the initial guess71
to the missing data matrix with 0’s replacing the missing values, the number of
significant values to L = 2, and ran the algorithm through 5 iterations. (There was
no significant change in the estimates when we replaced L = 2 with L = 3.)
We compared our estimates to estimates obtained by three other methods: re-
placing missing values with 0’s (zeros method), row means (row means method),
or the values obtained by the KNNimpute program [35]. We used a normalized
root mean square as the metric for comparison: if C represents the complete matrix
and Ep represents an estimate to the corrupted matrix E, then the root mean square
(RMS) of the difference D = C−Ep is ||D||F√N
. We normalized the root mean square
by dividing RMS by the average value of the entries in C.
In simulations where 1% − 20% of the entries were randomly deleted from
the complete matrix C, the FRAA performed slightly better than the row means
method, and significantly better than the zeros method. However, the KNNimpute
algorithm (with parameters k= 15, d= 0) produced the most accurate estimates,
with normalized RMS errors that were smaller than the normalized RMS errors
from the other three methods. Figure 7.1 displays the results of one set of experi-
ments estimating the elutriation matrix when each of 1, 5, 10, 15, 20% of entries was
removed: the normalized RMS errors are plotted against percent missing. When 25
simulations of deleting and then estimating 5% of the the entries was conducted, we
found the average normalized RMS to be approximately 0.19 for KNNimpute and
0.24 for FRAA, with standard deviation to be approximately 0.02 for both methods.
Not surprisingly, normalized RMS’s increase with increasing percentage of missing
values.72
Percent missing
norm
aliz
ed R
MS
err
or
1 5 10 15 20
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
FRAAKNNimputeRow means
Elutriation
Fig. 7.1 Comparison of normalized RMS against percent missing
for threee methods: FRAA, KNNimpute, and row means methods.
The normalized RMS for the zeros method is not displayed, but
the values are 0.397, 0.870, 1.24, 1.52, 1.76, for 1, 5, 10, 15, 20%
percent missing, respectively.
In [35], the authors caution against using KNNimpute for matrices with fewer
than 6 columns. We randomly selected four columns from the elutriation data set
to form a truncated data set, then randomly deleted from 1% − 20% of the entries
from this newly formed matrix. Figure 7.2 gives a comparison of the normalized
RMS errors against percent missing in one run of the simulation at each of the
percentages. When 25 simulations at 10% missing was run, we found the average
normalized RMS to be approximately 0.143 for FRAA and 0.166 for KNNimpute,
with standard deviations of approximately, 0.001 and 0.003, respectively.
73
Percent missing
norm
aliz
ed R
MS
err
or
1 5 10 15 20
0.00
00.
050
0.10
00.
150
0.20
00.
250 FRAA
KNNimputeRow means
Elutriation: 4 columns
Fig. 7.2 Four columns of the full elutriation matrix were randomly
selected. Entries were then randomly deleted from this truncated
matrix. Plot of normalized RMS against percent missing.
For one simulation in which we randomly deleted and then estimated 10%
(4200) of the entries from the full elutriation matrix,we compared the raw errors
(true value - estimated value) for each of the 4200 imputed entries obtained us-
ing either KNNimpute or FRAA. Figure 7.3 shows a scatter plot of the raw errors
from the estimate using KNNimpute against the raw errors from the estimate using
FRAA. This plot seems to suggest that the algorithms KNNimpute and FRAA are
rather consistent in how they are estimate the missing values.
74
FRAA: raw errors
KN
Nim
pute
: raw
err
ors
-1 0 1 2 3 4 5
01
23
45
Scatter plot of raw errors 10% missing
Fig. 7.3 Scatter plot of the raw errors (true - estimate) of each of the 4200 imputed
entries in one simulation using KNNimpute and FRAA. The correlation between
the two sets of raw errors is .84.
We ran similar simulations on the Cdc15 data set available on the web, ( http://genome-
www.stanford.edu/SVD/htmls/spie.html), and on subsets of this data set (using 4
columns). We also ran a couple of simulations on one of the data sets included
by [28]. The outcomes were similar to that using the Elutriation data set, with
the FRAA algorithm outperforming KNN on the matrices with a small number of
columns.
4.7 Discussion of FRAA
The Fixed Rank Approximation Algorithm uses the singular value decomposition
to obtain estimates of missing values in a gene expression matrix. It uses all the
known information in the matrix to simultaneously estimate all missing entries.
Preliminary tests indicate that, under a normalized root mean square metric, FRAA
is more accurate than replacing missing values with 0’s or with row means. The
75
KNNimpute algorithm was more accurate when estimating missing entries deleted
from the full elutriation matrix, but FRAA might be a feasible alternative in cases
when the number of columns is small.
FRAA is another option, in addition to KNN, Bayesian estimations or local least
squares imputations, for estimating missing values in gene expression data. FRAA
by itself is a very useful tool for gene data analysis without using clustering meth-
ods. Experimental results on various data sets show that FRAA is robust. FRAA
has been used by several computational biologists, who confirmed the accessibility
of the algorithm.
To improve the results given by FRAA one needs to combine it with an algo-
rithm for gene clustering. A possible implementation is as follows: First, apply
FRAA to the corrupted data set; next, using this estimated data set, subdivide the
genes into clusters of genes with similar traits; now apply FRAA again to the miss-
ing entries of genes in each cluster. We intend to apply these steps in a future paper.
Our final remark is that the biology of the data should guide the researcher in
determining the best method to use for imputing missing values in these data sets.
4.8 IFRAA
4.8.1 Introduction
The aim of this section to introduce IFRAA, an improved version of FRAA, which
improves significantly the performance of FRAA. IFRAA is a successful combina-
tion of FRAA and a good clustering algorithm. IFRAA works as follows. First we
76
use FRAA to find a completion G. Then we use a clustering algorithm (we used
K-means here) to find a reasonable number of clusters of similar genes. For each
cluster of genes we apply FRAA separately to recover the missing entries in this
cluster. It turns out that this modification results in a very efficient algorithm for
reconstructing the missing values of the gene expression matrix. We also note that
FRAA and IFRAA are very effective in reconstructing missing values of n × m
matrices, which are expected to have low effective ranks.
4.8.2 Computational comparisons of
BPCA, FRAA, IFRAA and LLSimpute
For comparison of different imputation algorithms, five different types of data set
were used, including two microarray gene expression data and three randomly gen-
erated synthetic data [15]. Microarray data sets were obtained from studies for
the identification of cell-cycle regulated genes in yeast (Saccharomyces cerevisiae)
[33]. The first gene expression data set is a set of complete matrix of 5986 genes
and 14 experiments based on the Elutriartion data set in [33]. This data set does
not have any missing value since all the genes that originally had missing values
were deleted to assess the performance of the imputation algorithms. The second
microarray data set is based on the Cdc15 data set in [33], which contains 5611
genes and 24 experiments. We obtained the complete data set in the same way as
for first data set. Three synthetic data sets were randomly generated matrices of
size 2000 × 20 and ranks 2, 4, 8.
To assess the performance of missing value estimation methods, we performed77
simulations where 1%, 5%, 10% and 20% of the entries were randomly deleted from
the complete matrix C. Then we estimated the various completions of the missing
values by BPCA, FRAA, IFRAA and LLSimpute. We used the L2-norm version of
LLSimpute. We set the K-value parameter (number of similar genes) for LLS such
that there was no increase in accuracy of LLS by increasing the K-value.
We used a normalized root mean square error (NRMSE) as a metric for compar-
ison. If C represents the complete matrix and C represents the completed matrix
using an estimate to the corrupted entries in C, then the root mean square error
(RMSE) is ‖D‖F√N
, where D = C − C. We normalized the root mean square error by
dividing RMSE by the average value of the entries in C.
The random matrices of order 2000 × 20 and of rank k = 2, 4, 8 appearing in
Figures 1,2 and 3 were generated as follows. One generates 2k random column
vectors x1, . . . ,xk ∈ R2000,y1, . . . ,yk ∈ R20, where the entries of these vectors
are chosen according to a uniform distribution in the interval [0, 1]. Then C =
∑k
i=1 xiyTi .
Figure 1 represents the comparisons of BPCA, FRAA and LLSimpute for 2000×
20 random matrix of rank 2. In this case FRAA completely reconstructed C. The
performance of IFRAA was identical to FRAA, and we did not plot the performance
of IFRAA. BPCA performed excellent for 1% and its performance somewhat de-
teriorated with the increase of the percentage of missing data. The performance of
LSSimpute was excellent for 1% of missing data, and its performance significantly
deteriorated with the increase of the percentage of missing data.
Figure 2 represents the comparisons of BPCA, FRAA, IFRAA and LLSimpute78
for 2000 × 20 random matrix of rank 4. Here BPCA and IFRAA performed ex-
tremely well. IFRAA slightly outperformed BPCA in particular in the case with
20% of missing data. FRAA performed reasonably well, but not as good as IFRAA
or BPCA. The behavior of LSSimpute was similar to its performance on Figure 1.
Figure 3 represents the comparisons of BPCA, FRAA, IFRAA and LLSimpute
for 2000 × 20 random matrix of rank 8. The behavior of BPCA, IFRAA and LL-
Simpute are similar to their behavior on Figure 2. In this case, however the per-
formance of the FRAA is good for percentage of missing entries less than 20% but
FRAA does not perform well when 20% of entries are missing.
Figure 4 and 5 compare BCPA, IFRAA and LLSimpute for gene expression data
matrices,the Elutriation data set and the Cdc15 data set, respectively. Since some of
the methods are local and others global, to be fair in comparison we first clustered
the data set using the clustering algorithm, K-means, and then we performed the
missing value estimation on the clustered data. In our simulation we randomly
deleted 1%, 5%, 10%, 15%, and 20% entries of data matrix. In IFRAA we chose
the parameter L, which means the number of significant singular values +1, to be
equal to 2 for Elutriation data set and 3 for Cdc15 data set. The initial guess for the
missing entries in each gene was chosen to be the row average of its corresponding
row. FRAA was not performing as well as the other three methods, so we did not
include FRAA graphs in Figures 4 and 5.
Figure 4 depicts the comparison of BPCA, IFRAA and LLSimpute Elutriation
data set in [33]. The corresponding gene expression data set is a complete matrix of
n = 5986 genes and m = 14 experiments. In this case BPCA performed the best,79
0 2 4 6 8 10 12 14 16 18 200
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
% of missing
NE
RM
S
Data matrix of rank 2
FRAA
Bayesian
least square
Figure 1: Comparison of NRMSE against percent of missing entries for three meth-
ods: FRAA, BPCA and LLS. Data set was a 2000 × 20 randomly generated matrix
of rank 2.
IFRAA was slightly inferior to BPCA, and LLSimpute weaker than IFRAA.
Figure 5 depicts the comparison of BPCA, IFRAA and LLSimpute Cdc15 data
set in [33] which contains 5611 genes and 24 experiments. In this case BPCA
performed the best and again IFRAA performed slightly better than LLSimpute.
80
0 2 4 6 8 10 12 14 16 18 200
0.01
0.02
0.03
0.04
0.05
0.06
NR
MS
E
% of missing
Data matrix of rank 4
BPCA
FRAA
LLS
IFRAA
Figure 2: Comparison of NRMSE against percent of missing entries for four meth-
ods: FRAA,IFRAA, BPCA and LLS. Data set was a 2000×20 randomly generated
matrix of rank 4.
0 2 4 6 8 10 12 14 16 18 200
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
% of missing
NR
MS
E
Data matrix of rank 8
IFRAA
BPCA
FRAA
LLS
Figure 3: Comparison of NRMSE against percent of missing entries for four meth-
ods: FRAA, IFRAA, BPCA and LLS. Data set was a 2000×20 randomly generated
matrix of rank 8.
81
0 2 4 6 8 10 12 14 16 18 200.02
0.03
0.04
0.05
0.06
0.07
0.08
% of missing
NR
MS
E
Elutriation data set
BPCA
IFRAA
LLS
Figure 4: Comparison of NRMSE against percent of missing entries for three meth-
ods: IFRAA, BPCA and LLS. Data set was a clustered data from Elutriation data
set in [33] with 14 samples.
0 2 4 6 8 10 12 14 16 18 200.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
% of missing
NR
MS
E
Cdc15, 24 samples
BPCA
IFRAA
LLS
Figure 5: Comparison of NRMSE against percent of missing entries for three meth-
ods: IFRAA, BPCA and LLS. Data set was a clustered data from Cdc15 data set in
[33] with 24 samples.
82
4.9 Conclusions
We applied the four algorithms on several data matrices including two microarray
data sets. We corrupted, at random, certain percentages of these data sets and let the
four algorithms recover them. We found BPCA and IFRAA to be the most reliable
approaches. The LLSimpute also performed quite well. FRAA by itself was infe-
rior to all four methods, unless the gene expression matrix has a small rank. We also
applied the four algorithms to synthetic data sets, which were random 2000 × 20
matrices of small ranks k = 2, 4, 8, where we again corrupted at random certain
percentages of these data sets. Not surprisingly IFRAA and BPCA were able to
recover the data quite well. Where the percentage of the corrupted data was not too
large, or the matrix had rank 2, FRAA was able to reconstruct the data almost per-
fectly. LLS-impute performed well when the data set has 1% percentage of missing
entries. The performance of LSSimpute deteriorated gradually with increasing per-
centage of missing entries. In conclusion BPCA and IFRAA, combined with a good
clustering algorithm such as K-means used here, appear to be reliable methods for
recovering DNA microarray gene expression data, or any other noisy data matrix
which is effectively low-rank.
These results and the results in [27] and [28] show that the performances of
BPCA, IFRAA and LSSimpute depend on the structure and also the size of the data
matrices and none of the above mentioned three algorithms outperforms the others
in all cases.
Our studies also show that the performance of FRAA and hence IFRAA is de-
83
pendent on missing entries distribution. Since the performance of the FRAA de-
pends on effective rank of the submatrix formed by rows which have no missing
entries, it is important to have enough number of rows with this property. We can
argue that the reason BPCA in some cases works slightly better than IFRAA is
because of dependency of IFRAA on missing entries distribution. However, in a
real application, when the distribution of missing entries is not uniform, contrary
to what we had in our simulations, where not many rows have missing entries we
expect IFRAA to perform better than BPCA. For local methods like KNN and LLS
where similar genes should not have missing entries in the same column, uniform
distribution of missing entries does not have any degradation effect because it is
unlikely to have missing entries in the same column for similar genes.
4.10 Matlab code
function Ep1 = fraa(E,Ep,L,iter)
%Fixed rank algorithm -- estimate missing values
%Usage: fraa(E,Ep,L,iter)
%E: matrix with missing values
%Ep: initial solution
%L: parameter (number of significant singular values + 1)
%iter: number of iterations to perform
%Note: Any rows with all missing values must be removed
%%%%%%%%%% THIS IS THE SET-UP
84
%Get size of E
[N,M]=size(E);
if (L > M)
error(’need L<=#columns of E ’)
end;
%get index of missing values
missing=find(isnan(E));
%Number of missing values
m=length(missing);
m2=m*m;
%%%%%%%%%%% NOW WE WORK WITH THE ALGORITHM
Xp1=zeros(N,M);
track=iter;
while(iter > 0)
A=Ep’*Ep;
%Find singular value decomposition of A
[U,S,V]=svd(A);
%Singular values of Ep
sigma2=S(S˜=0);
singular=sqrt(sigma2);
partial_sig2=sum(sigma2(L:M));
total_sig2=sum(sigma2(1:M));
fprintf(’\n iteration %3.0f \n’, track-iter+1)85
fraction=partial_sig2/total_sig2;
fprintf(’ partial sum/total sum of sq. singular values
\n %1.8f’, fraction)
fprintf(’\n’)
%Construct B=Bp
B=sparse(m,m); %pre-allocate space
[is,js]=ind2sub([N,M],missing(1:m));
for s=1:m
for t=s:m
if (i(s)==i(t))
B(s,t)=sum(U(js(s),L:M)*U(js(t),L:M)’);
B(t,s)=B(s,t); %B is symmetric
end %end if
end %end For t
end %end for s
%%%NOW CONSTRUCT THE VECTOR Wp
W=sparse(m,1); %pre-allocate space
for t=1:m
K=sparse(N,M);
K(missing(t))=1;
W(t)=sum(diag(U(:,L:M)’*Ep’*K*U(:,L:M)));
end %end for
%Solve Bx_(p+1)= -W86
xp1=-B\W;
%Create matrix B_p+1
Xp1(missing)=xp1;
%Update solution
Ep=Ep+Xp1;
%set counter
iter=iter-1;
end %End while
fprintf(’\n’)
fprintf(’ singular values (final iteration):\n’)
fprintf(’%16.6f’,singular)
Ep1=Ep;
For the Matlab m file or a version of this algorithm for R, see
http://people.carleton.edu/˜lchihara/LMCProf.html
87
References
[1] C.C. Aggrawal, C.M. Procopiuc, J.L. Wolf, P.S. Yu and J.S. Park, Fast algo-
rithms for projected clustering, Proc. of ACM SIGMOD Intl. Conf. Manage-
ment of Data 1999, 61-72.
[2] R. Agrawal, J.Gerhrke, D.Gunopulos, and P. Raghavan, Automatic subspace
clustering of high dimensional data for data mining applications, Proc. ACM
SIGMOD Conf. on Management of Data, 1998, 94-105.
[3] U. Alon et al., Broad patterns of gene expression revealed by clustering analy-
sis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc.
Natl. Acad. Sci. USA 96 (1999), 6745-6750.
[4] O. Alter, P.O. Brown and D. Botstein, Processing and modelling gene ex-
pression expression data using the singular value decomposition, Proceedings
SPIE, vol. 4266 (2001), 171-186.
[5] O. Alter, P.O. Brown and D. Botstein, Generalized singular decomposition for
comparative analysis of genome-scale expression data sets of two different
organisms, Proc. Nat. Acad. Sci. USA 100 (2003), 3351-3356.
[6] O. Alter, G.H. Golub, P.O. Brown and D. Botstein, Novel genome-scale cor-
relation between DNA replication and RNA transcription during the cell cycle
in yeast is predicted by data-driven models, 2004 Miami Nature Winter Sym-
posium, Jan. 31 - Feb. 4, 2004.
88
[7] P. Baldi and G. Wesley Hatfield, DNA Microarrays and Gene Expression,
Cambridge University Press, 2002
[8] T.H. Bo, B. Dysvik and I. Jonassen, LSimpute, Accurate estimation of missing
values in microarray data with least squares methods, Nucleic Acids Research,
32 (2004), e34.
[9] H. Chipman, T.J. Hastie and R. Tibshirani, Clustering micrarray data In:
T. Speed, (Ed.), Statistical Analysis of Gene Expression Microarray Data, ,
Chapman & Hall/CRC, 2003 pp. 159-200.
[10] E. Domany, Cluster Analysis of Gene Expression Data, Journal of Statistical
Physics, 110 (2003), 1117-1139
[11] P. Drineas, A. Frieze, R. Kannan, S. Vempala and V. Vinay, Clustering large
graphs via the singular value decomposition, Journal of Machine Learning,
56 (2004), 9-33.
[12] Petros Drineas, Eleni Drinea, Patrick S. Huggins,An Experimental Evaluation
of a Monte-Carlo Algorithm for Singular Value Decomposition, Panhellenic
Conference on Informatics 2001: 279-296
[13] M. Ester, H.-P. Krieger, J. Sander and X.Xu, A density-based algortihm for
discovering clusters in large spatial databases with nose, Proc. 2nd Intl. Conf.
Knowledge Discovery and Data Mining, 1996, 226-231.
[14] S. Friedland, Inverse eigenvalue problems, Linear Algebra Appl., 17 (1977),
15-51.89
[15] S. Friedland, M. Kaveh, A. Niknejad, H. Zare, An Improved Fixed Rank Ap-
proximation Algorithm for Missing Value Estimation for DNA Microarray
Data, submitted to the 2005 IEEE Symposium on Computional Intelligence
in Bioinformatics and Computational Biology
[16] S. Friedland, J. Nocedal and M. Overton, The formulation and analysis of
numerical methods for inverse eigenvalue problems, SIAM J. Numer. Anal.
24 (1987), 634-667.
[17] S. Friedland and A. Niknejad, Fast Monte-Carlo low rank approximations for
matrices, preprint, 9 pp..
[18] S. Friedland, A. Niknejad and L. Chihara, A Simultaneous Reconstruction of
Missing Data in DNA Microarrays, Linear Algebra Appl., to appear, (Institute
for Mathematics and its Applications, Preprint Series, No. 1948).
[19] A. Frieze, R. Kannan and S. Vempala, Fast Monte-Carlo alogrithms for find-
ing low rank approximations, Proceedings of the 39th Annual Symposium on
Foundation of Computer Science, 1998.
[20] X. Gan, A.W.-C. Liew and H. Yan, Missing Microaaray Data Estimation
Based on Projection onto Convex Sets Method, Proc. 17th International Con-
ference on Pattern Recognition, 2004.
[21] G.H. Golub and C.F. Van Loan, Matrix Computations, John Hopkins Univ.
Press, 1983.
[22] R.A. Horn and C.R. Johnson, Matrix analysis, Cambridge Univ. Press, 1987.90
[23] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov,
H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and
E.S. Lander, Molecular classification of cancer: class discovery and class pre-
diction by gene expression monitoring, Science 286 (1999), 531-537.
[24] D.A. Jackson, Stopping rules in principal component analysis: a comparison
of heuristical and statistical approaches, Ecology 74 (1993), 2204-2214.
[25] D. Jiang , C. Tang and A.Zhang, Cluster Analysis for Gene Expression Data:
A Survey, IEEE Transactions on Knowledge and Data Engineering (TKDE),
Volome 16(11), page 1370 - 1386, 2004
[26] R.A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis,
Prentice Hall, New Jersey, 4th edition (1998).
[27] H. Kim, G.H. Golub and H. Park, Missing value estimation for DNA microar-
ray gene expression data: local least squares imputation, Bioinformatics 21
(2005), 187-198.
[28] S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara and S. Ishii, A
Baesian missing value estimation method for gene expression profile data,
Bioinformatics 19 (2003), 2088-2096.
[29] C.C. Paige and M. A. Saunders, Towards a generalized singular value decom-
position, SIAM J. Numer. Anal. 18 (1981), 398-405.
91
[30] C.M. Procopiuc, P.K. Agarwal, M. Jones and T.M. Murali, A Monte Carlo
algorithm for fast projective clustering, Proc. of ACM SIGMOD Intl. Conf.
Management of Data 2002.
[31] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C. Aguiar,
M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus et al., Diffuse large B-cell
lymphoma outcome prediction by gene-expression profiling nad supervised
machine learning, Nat. Med. 8 (2002), 68-74.
[32] A. Schulze and J. Downward, Nature Cell. Biol. 3:190 (2001)
[33] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen,
P.O. Brown, D. Botstein and B. Futcher, Comprehensive identification of cell
cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray
hybridization, Mol. Biol. Cell, 9 (1998), 3273-3297.
[34] G.W. Stewart, A method for computing the generalized singular value decom-
position, Matrix Pencils, B. Kagstrom and A. Ruhe, Lecture Notes in Mathe-
matics, 973 (1982), 207-220.
[35] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani,
D. Botstein and R. Altman, Missing value estimation for DNA microarrays,
Bioinformatics 17 (2001), 520-525.
[36] S. Vempala, The Random Projection Method, DIMACS Vol. 65, American
Mathematical Society, 2004
92