application of singular value decomposition to dna...

Application of Singular Value Decomposition

to DNA Microarray

Amir Niknejad

M.S. Claremont Graduate School, 1997

M.S. Northeastern University, 1983

Thesis

Submitted in partial fulfilment of the requirements

for the degree of Doctor of Philosophy

in the Graduate College of the

University of Illinois at Chicago

Chicago, Illinois

August, 2005

1

Dedication:Dedicated to Dr. Iraj Broomand, Founder of the National Iranian

Organization for Gifted and Talented Education, and to the memory of my father.

2

Acknowledgement: I am grateful to my advisor Professor Shmuel Friedland

for his kind supervision, patience and support. I would like to thank my committee

members Professor Calixto Calderon, Professor Louis Kauffman, Professor Stanley

Sclove and Professor Jan Verschelde, for kindly agreeing to be on my committee

despite their busy schedules. I would like to thank Kari Dueball and Darlette Willis

for their assistance during my stay at UIC, the UIC mathematics department and

the Institute of Mathematics and its Applications (IMA) for their generous support

and fellowship, Ali Shaker and Marcus Bishop for kindly helping me with La-

TeX, and Dr. Laura Chihara and Hossein Zare for coding and simulation of FRAA

and IFRAA. Finally I would like to acknowledge the generous support of Hossein

Andikfar, Hossein Basiri, Vali Siadat and Mahyad Zaerpour.

3

Summary

In chapter 1 we introduce the singular value decomposition (SVD) of matrices and

its extensions. We then mention some applications of SVD in analyzing gene ex-

pression data, image processing and information retrieval. We also introduce the

low rank approximation of matrices and present our Monte Carlo algorithm to

achieve this approximation along with other algorithms.

Chapter 2 deals with clustering methods and their application in analysing gene

expression data. We introduce the most common clustering methods such as the

KNN, SVD, and weighted least square methods.

Chapter 3 deals with imputing missing data in gene expression of micro-arrays.

We use some of the methods mentioned in part 2 for these purposes. Then we

introduce our fixed rank approximation algorithm (FRAA) for imputing missing

data in the DNA gene expression array. Finally, we use simulation to compare

FRAA versus other methods and indicate the advantages and its shortcomings, and

how to overcome the shortcomings of FRAA.

4

Contents

1 Introduction and Background 8

1.1 Missing gene imputation in DNA microarrays . . . . . . . . . . . . 8

1.2 Some biological background . . . . . . . . . . . . . . . . . . . . . 10

1.2.1 Genes and Gene Expression . . . . . . . . . . . . . . . . . 10

1.2.2 Transcription and Translation . . . . . . . . . . . . . . . . 12

1.2.3 DNA microarrays (chips) . . . . . . . . . . . . . . . . . . . 13

1.3 Some linear algebra background . . . . . . . . . . . . . . . . . . . 14

1.3.1 Inner Product Spaces (IPS) . . . . . . . . . . . . . . . . . . 14

1.3.2 Definition and Properties of Positive Definite and Semidef-

inite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3.3 Various Matrix Factorizations . . . . . . . . . . . . . . . . 16

1.3.4 Gram-Schmidt Process . . . . . . . . . . . . . . . . . . . . 16

1.3.5 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . 17

1.3.6 The QR Decomposition . . . . . . . . . . . . . . . . . . . 18

1.3.7 An Introduction to the Singular Value Decomposition . . . . 18

1.3.8 SVD on inner product spaces . . . . . . . . . . . . . . . . . 22

1.4 Extended Singular Value decomposition . . . . . . . . . . . . . . . 24

1.5 Some Applications of SVD . . . . . . . . . . . . . . . . . . . . . . 28

1.5.1 Analysing DNA gene expression data via SVD . . . . . . . 28

1.5.2 Low rank approximation of matrices . . . . . . . . . . . . 29

2 Randomized Low Rank Approximation5

of a Matrix 30

2.1 Random Projection Method . . . . . . . . . . . . . . . . . . . . . . 32

2.2 Fast Monte-Carlo Rank Approximation . . . . . . . . . . . . . . . 35

3 Cluster Analysis 43

3.1 Clusters and Clustering . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 Various Gene Expression Clustering Procedures . . . . . . . . . . . 43

3.2.1 Proximity (Similarity) Measurement . . . . . . . . . . . . . 44

3.2.2 Hierarchical Cluster Analysis . . . . . . . . . . . . . . . . 45

3.2.3 Clustering Using K-means Algorithm . . . . . . . . . . . . 46

3.2.4 Mixture Models and EM Algorithm . . . . . . . . . . . . . 47

4 Various Methods of Imputation of Missing Values in DNA Microarrays 50

4.1 The Gene Expression Matrix . . . . . . . . . . . . . . . . . . . . . 50

4.2 SVD and gene clusters . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3 Missing Data in the Gene Expression Matrix . . . . . . . . . . . . . 53

4.3.1 Reconsideration of 4.2 . . . . . . . . . . . . . . . . . . . . 54

4.3.2 Bayesian Principal Component (BPCA): . . . . . . . . . . 54

4.3.3 The least square imputation method . . . . . . . . . . . . . 57

4.3.4 Iterative method using SVD . . . . . . . . . . . . . . . . . 59

4.3.5 Local least squares imputation (LLSimpute) . . . . . . . . . 59

4.4 Motivation for FRAA . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4.1 Additional matrix theory background for FRAA . . . . . . . 61

4.4.2 The Optimization Problem . . . . . . . . . . . . . . . . . . 636

4.5 Fixed Rank Approximation Algorithm . . . . . . . . . . . . . . . . 65

4.5.1 Description of FRAA . . . . . . . . . . . . . . . . . . . . . 65

4.5.2 Explanation and justification of FRAA . . . . . . . . . . . 66

4.5.3 Algorithm for (4.14) . . . . . . . . . . . . . . . . . . . . . 70

4.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.7 Discussion of FRAA . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.8 IFRAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.8.2 Computational comparisons of

BPCA, FRAA, IFRAA and LLSimpute . . . . . . . . . . . 77

4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.10 Matlab code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7

1 Introduction and Background

1.1 Missing gene imputation in DNA microarrays

In the last decade, molecular biologists have been using DNA microarrays as a

tool for analyzing information in gene expression data. During the laboratory pro-

cess, some spots on the array may be missing due to various factors (for example,

machine error.) Because it is often very costly or time consuming to repeat the

experiment, molecular biologists, statisticians, and computer scientists have made

attempts to recover the missing gene expressions by some ad-hoc and systematic

methods.

In this thesis we introduce various imputation methods for missing data in

gene expression matrices. In particular, we review two recent imputation methods,

namely Bayesian Principal Component Analysis (BPCA) and Local Least Square

Impute (LLSImpute). Then we introduce our fixed rank approximation algorithm

(FRAA). Finally in the latter section of this thesis we present improved fixed rank

approximation (IFRAA) and use simulation in order to compare IFRAA with BPCA

nd LLSimpute.

More recently, microarray gene expression data have been formulated as a gene

expression matrix E with n rows, which correspond to genes, and m columns,

which correspond to experiments. Typically n is much larger than m. In this setting,

the analysis of missing gene expressions on the array would translate to recovering

missing entries in the gene expression matrix values.

The most common methods for recovery are [35]:8

(a) Zero replacement method;

(b) Row sum mean;

(c) Cluster analysis methods such as K-nearest neighbor clustering , hierarchical

clustering;

(d) SVD - Singular Value Decomposition (which is closely related to Principal

Component Analysis).

In these methods, the recovery of missing data is done independently, i.e., the

estimation of each missing entry does not influence the estimation of the other miss-

ing entries. The iterative method using SVD suggested in [35] takes into account

implicitly the influence of the estimation of one entry on the other ones. See also

[9].

In this thesis we suggest a new method in which the estimation of missing en-

tries is done simultaneously, i.e., the estimation of one missing entry influences the

estimation of the other missing entries. If the gene expression matrix E has missing

data, we want to complete its entries to obtain a matrix E, such that the rank of E is

equal to (or does not exceed) d, where d is taken to be the number of significant sin-

gular values of E. The estimation of the entries of E to a matrix with a prescribed

rank is a variation of the problem of communality (see [16, p. 637].) We give an

optimization algorithm for finding E using the techniques for inverse eigenvalue

problems discussed in [14].

9

1.2 Some biological background

1.2.1 Genes and Gene Expression

Cells and organisms are divided into two classes; procaryotic (such as bacteria) and

eucaryotic [10]. The latter have a nucleus. The cell is enclosed by its membrane;

embedded in the cell’s cytoplasm is its nucleus, surrounded and protected by its

own membrane. The nucleus contains DNA, a one dimensional molecule, made of

two complementary strands, coiled around each other as a double helix. Each strand

consists of a backbone to which a linear sequence of bases is attached. There are

four kinds of bases, denoted by C, G, A, T. The two strands contain complementary

base sequences and are held together by hydrogen bonds that connect matching

pairs of bases; G-C (three hydrogen bonds that connect matching pairs of bases;

G-C (three hydrogen bonds) and A-T (two).

A gene is a segment of DNA, which contains the formula for the chemical com-

position of one particular protein. Proteins are the working molecules of life; most

biological processes that take place in a cell are carried out by proteins. Topolog-

ically, a protein is also a chain; each link is an amino acid, with neighbors along

the chain connected by covalent bonds. All proteins are made of 20 different amino

acids - hence the chemical formula of a protein of length N is an N -letter word,

whose letters are taken from a 20-letter alphabet. A gene is nothing but an alpha-

betic cookbook recipe, listing the order in which the amino acids are to be strung

when the corresponding protein is synthesized. Genetic information is encoded in

the linear sequence in which the bases on the two strands are ordered along the DNA

10

molecule. The genetic code is a universal translation table, with specific triplets of

consecutive bases coding for every amino acid.

The genome contains the collection of all the genes that code the chemical for-

mulae of all the proteins (and RNA) that an organism needs and produces. The

genome of a simple organism such as yeast contains about 6000 genes; the hu-

man genome has between 30,000 and 40,000. An overwhelming majority (98%)

of human DNA contains non-coding regions , i.e., strands that do not code for any

particular protein (but play a role in regulating the level of synthesis of the different

proteins).

Here is an amazing fact: every cell of a multicellular organism contains its entire

genome! That is, every cell has the entire set of recipes the organism may ever need;

the nucleus of each of the reader’s cell contains every piece of information needed

to make a copy (”clone”) of him/her! Even though each cell contains the same set

of genes, there is differentiation: cells of a complex organism, taken from different

organs, have entirely different functions and the proteins that perform these func-

tions are very different. Cells in our retina need photosensitive molecules, whereas

our livers do not make much use of these. A gene is expressed in a cell when the

protein it codes for is actually synthesized. In an average human cell about 10,000

genes are expressed. The set of (say 10,000) numbers that indicate the expression

level of each of theses genes is called the expression profile of the cell.

The large majority of abundantly expressed genes are associated with common

functions, such as metabolism, and hence are expressed in all cells. However there

will be differences between the expression of profiles of different cells, and even11

in single cell, expression will vary with time, in a manner dictated by external and

internal signals that reflect the state of the organism and the cell itself.

Synthesis of proteins take a place at the ribosomes. These are enormous ma-

chines (also made of proteins) that read the chemical formulae written on the DNA

and synthesize the protein according to the instructions. The ribosomes are in the

cytoplasm, whereas the DNA is in the protected environment of the nucleus. This

poses an immediate logistic problem - how does the information get transferred

from the nucleus to the ribosome?

1.2.2 Transcription and Translation

The obvious solution of information transfer would be to rip out the piece of DNA

that contains the gene that is to be expressed, and transport it to the cytoplasm.

When a gene receives a command to be expressed, the corresponding segment of

the double helix of DNA opens, and a precise copy of the information, as written

on one of the strands, is prepared, which is called the messenger RNA, and the pro-

cess of its production is called transcription. The subsequent reading of mRNA,

deciphering the message (written using base triplets) into amino acids and synthe-

sis of the corresponding protein at the ribosomes is called translation [32]. In fact,

when many molecules of a certain protein are needed, the cell produces many cor-

responding mRNAs, which are transferred through the nucleus’ membrane to the

cytoplasm, and are ”read” by several ribosomes. Thus the single master copy of

the instructions, contained in the DNA, generates many copies of the protein. This

transcription strategy is prudent and safe, preserving the precious master copy; at12

the same time it also serves as a remarkable amplifier of the genetic information.

1.2.3 DNA microarrays (chips)

A DNA Chip is the instrument that measures simultaneously the concentration of

thousands of different mRNA molecules. It is also referred to as a DNA microarray

[32] DNA microarrays, produce by Affymetrix, can measure simultaneously the

expression levels of up to 20,000 genes; the less expensive spotted arrays do the

same for several thousand. Schematically, the Affymetrix arrays are produced as

follows. Divide a chip ( a glass plate of about 1 cm across) into ”pixels” - each

dedicated to one gene g. Millions of 25 base-pair long pieces (oligonucleotides) of

single strand DNA, copied from a particular segment of gene g are are photoliti-

graphically synthesized on the dedicated pixel (these are referred to as ”probes”).

The mRNA molecules are extracted from the cells taken from the tissue of interest

(such as a tumor tissue) obtained by concentration is enhanced. Next, the resulting

DNA is transcribed back into fluorescently marked single strand RNA diffuse over

the dense forest of single strand DNA and the labeled RNA diffuse over the dense

forest of single strand DNA and the labeled RNA diffuse over the dense forest of

single strand DNA probes. When such an mRNA encounters a bit of the probe, of

which the RNA is a perfect copy, it hybridizes to this strand - i.e. attaches to it

with a high affinity (considerably higher than to a probe of which the target is not a

perfect copy). When the mRNA solution is washed off, only those molecules that

found their perfect match remain stuck to the chip. Now the chip is illuminated by

a laser, and these stuck ”targets” fluoresce; by measuring the light intensity ema-13

nating from each pixel, one obtains a measure of the number of targets that stuck,

which form a chip on which Ng genes were placed. These Ng numbers represent

the expression levels of these genes in that tissue. A typical experiment provides the

expression profiles of several tens of samples (say Ns ≈ 100), over several thou-

sand (Ng) genes. These results are summarized in an Ng × Ns expression table;

each row corresponds to one particular gene and each column to a sample. Entry

Egs of such an expression table stands for the expression level of gene g in sample

s. For example, the experiment on colon cancer, first reported by Alon et al. [3]

contains Ng = 2000 genes whose expression levels passed some threshold, over

Ns = 62 samples, 40 of which were taken from tumor and 22 from normal colon

tissue.

1.3 Some linear algebra background

Since DNA microarray data involves real values only, we will be restricting our

treatment of relevant facts from linear algebra to matrices and vector spaces over R.

1.3.1 Inner Product Spaces (IPS)

Let V be a vector space over R. Then 〈•, •〉 : V × V → R is a real inner product if

1. 〈x,x〉 ≥ 0 for all x ∈ V and 〈x,x〉 = 0 iff x = 0.

2. 〈x,y〉 = 〈y,x〉 for all x,y ∈ V .

3. 〈x, αy1 + βy2〉 = α〈x,y1〉 + β〈x,y2〉 for all x,y1,y2 ∈ V and for all

α, β ∈ R.14

1.3.2 Definition and Properties of Positive Definite and Semidefinite Matrices

In the following x will denote a vector in Rn. A symmetric matrix A ∈ Rn×n is

1. positive definite if and only if xT Ax > 0 for all nonzero x. We write A > 0.

2. nonnegative definite (or positive semi-definite) if and only if xT Ax ≥ 0 for

all x. We write A ≥ 0.

3. negative definite if −A is positive definite.

4. nonpositive definite (or negative semidefinite) if −A is nonnegative defi-

nite. We write A ≤ 0.

We have the following facts.

• Any principal submatrix of a positive (nonnegative) definite matrix is positive

(nonnegative) definite . (A ≥ 0 implies any principal submatrix is n.n.d)

• The diagonal elements of a positive definite matrix are positive.

• A is positive definite if and only if its eigenvalues are positive. (It is positive

semidefinite if its eigenvalues are nonnegative.)

• If A is positive definite, its determinant is positive. If A is positive semidefi-

nite, its determinant is nonnegative.

• If A is positive definite, then A is nonsingular and A−1 is positive definite.

15

1.3.3 Various Matrix Factorizations

In the following section, we introduce several matrix factorizations useful through-

out this thesis.

Theorem 1.1 (Spectral Decomposition) Let A = AT ∈ Rn×n. Then there

exists an orthogonal matrix Q ∈ Rn×n (whose columns are orthogonal eigenvectors

of A) such that

A = QDQT =n∑

i=1

λiqiqTi

where λi are the eigenvalues of A and D = diag(λ1, . . . λn).

1.3.4 Gram-Schmidt Process

The Gram-Schmidt process is a recursive procedure for generating an orthonor-

mal set of vectors from a given set of linearly independent vectors. Let x1, . . .xn

be linearly independent vectors in an IPS V . Then generate an orthonormal set

u1, . . .un ∈ V from x1, . . .xn such that:

span(x1, . . . ,xk) = span(u1, . . . ,uk), k = 1, . . . , n,

by the following procedure known as the Gram-Schmidt algorithm:

• u1 := 1‖x1‖x1

• p1 := 〈x2,u1〉u1, u2 := 1‖x2−p1‖(x2 − p1)

• p2 := 〈x3,u1〉u1 + 〈x3,u2〉u2, u3 := 1‖x3−p2‖(x3 − p2)

...16

• pk := 〈xk+1,u1〉u1 + 〈xk+1,u2〉u2 + · · · + 〈xk+1,uk〉uk,

and uk+1 := 1‖xk+1−pk‖(xk+1 − pk)

In practise the Gram-Schmidt Process (GSP) is numerically unstable. That is,

there is a severe loss of orthogonality of u1, . . . as we proceed to compute ui. In

computations one uses either a Modified GSP or Householder orthogonalization

[21].

The Modified Gram-Schmidt Process (MGSP) consisting of the following steps:

• Initialize j = 1.

• uj := 1‖xj‖xj

• pi := 〈xi,uj〉ui and replace xi by xi := xi − pi for i = j + 1, . . . , n.

• Let j = j + 1 and repeat the process.

MGSP is stable, needs mn2 flops, which is more time consuming then GSP.

1.3.5 Cholesky Decomposition

Let A be a positive definite matrix. Then A = U T · U where U is an upper trian-

gular matrix with positive diagonal elements. The above factorization is called the

Cholesky decomposition. The matrix U is called a Cholesky factor of A and is not

unique.

17

1.3.6 The QR Decomposition

Let X be an n×m matrix with n ≥ m. Then there is a unitary n×m matrix Q and

an upper triangular n × n matrix R with nonnegative diagonal elements such that

Xn×m = Q

R

0

.

This factorization is called the QR decomposition of X . The matrix Q is called the

Q-factor of X and the matrix R is the R-factor of X .

If we partition Q = (QX , QL) where QX has m columns, then X = QXR

which is called the QR factorization of X .

If X is of rank m, then the QR factorization is unique up to a factor of ±1 . The

columns of QX form an orthogonal basis for the column space of X , and those of

Q⊥ form an orthonormal basis for the orthogonal complement of the column space

of X obtained by the Gram-Schmidt process. If X is of rank m, then A = XT X is

positive definite and R is a Cholesky factor of A. For more details see [34].

1.3.7 An Introduction to the Singular Value Decomposition

In many science and engineering problems, one would like to approximate a given

matrix by a lower rank matrix and calculate the distance between them.

Without loss of generality, let E be a n×m matrix where n ≥ m (The following

procedure is also valid when n < m). Denote by R(E) and N (E) the range and

the null space of E.

The idea of the singular value decomposition is to factor the matrix E into the

18

product of matrices U , Σ, V such that

E(n×m) = U (n×n)Σ(n×m)V (m×m)T . (1.1)

Here, U is n × n, Σ is a diagonal n × m matrix, V is a m × m matrix, and U and

V are orthogonal matrices where

σ1 ≥ σ2 ≥ · · · ≥ σmin(m,n) ≥ 0.

In this procedure, the σi’s are unique but U and V are not. The σi’s are called

the singular values of E and are the square roots of the eigenvalues of EET or

equivalently ET E. Also, in this setting the columns of U are eigenvectors of EET

and the rows of V are eigenvectors of ET E. The idea is that the rank of the matrix E

is the number of the nonzero singular values and the magnitude of nonzero singular

values would indicate the proximity of matrix E to the closest lower rank matrix.

Computationally, one brings E to upper bidiagonal form A using Householder

matrices. Then one applies implicitly the QR algorithm to ATA to find the positive

eigenvalues σ21, ..., σ

2r and the corresponding orthonormal eigenvectors v1, ...,vr of

the matrix ETE [21]. Next we set

uq :=1

σq

Evq (1.2)

Theorem 1.2 If E is an m × n matrix, then E has a singular value decompo-

sition.

The proof can be found in [21].

19

We summarize the facts about the singular value decomposition of a matrix. Let

E be a m × n matrix with singular value decomposition UΣV T .

1. The singular values σ1, σ2, . . . , σn of E are unique, however the matrices U

and V are not unique. Suppose

σ1 > σ2 > · · · > σr > 0,

then the columns of V are determined up to ±1.

2. Since V diagonalizes ET E, it follows that vj’s are eigenvectors of ET E.

3. Since EET = UΣΣT UT , it follows that U diagonalizes EET and that the

uj’s are the eigenvectors of EET .

4. Comparing the jth columns of each side of the equation

EV = UΣ

we get

Evj = σjuj j = 1, . . . , n.

Similarly,

ET U = V ΣT

and hence

ETuj = σjvj j = 1, . . . , n

ETuj = 0 j = n + 1, . . . ,m.

The vj’s are called the right singular vectors of E and the uj’s are called the

left singular vectors of E.20

5. If E has rank r, then

(a) v1,v2, . . .vr form an orthonormal basis for R(

ET)

,

(b) vr+1,vr+2, . . . ,vn form an orthonormal basis for N (E),

(c) u1,u2, . . . ,ur form an orthonormal basis for R (E),

(d) ur+1,ur+2, . . . ,um form an orthonormal basis for N(

ET)

.

6. The rank of E is equal to the number of its nonzero singular values (where

singular values are counted according to their multiplicities). We can rear-

range the σi’s such that

σ1 = · · · = σi1 > σi1 = · · · = σi1+l2 > · · · > σip = · · · = σip+lp = σr

and for every 1 ≤ j ≤ p − 1, σij+lj+1 = σij+1 and vij , . . . ,vij+lj form an

orthonormal basis for the eigenspace of ET E corresponding to σij .

Let Ur, Σr, Vr be matrices obtained from U, Σ, V , respectively, as follows: Ur is

an n × r matrix obtained by deleting the last m − r columns of U , Vr is the m × r

matrix obtained by deleting the last m − r columns of V , and Σr is obtained by

deleting the last m − r columns and rows of Σ . Then

E = UrΣrVTr , UT

r Ur = V Tr Vr = Ir, Σr = diag(σ1, ..., σr), σ1 ≥ ... ≥ σr > 0.

(1.3)

In this setting Ur, Σr, Vr are all rank r matrices: the last m − r columns of U and

the last m − r rows of V T are arbitrary, up to the condition that the last m − r

columns of U and last m − r rows of V T are orthonormal bases of the orthogonal21

complement of the column space and the row space of E, respectively. Hence (1.3)

is sometimes called a reduced SVD of E.

1.3.8 SVD on inner product spaces

In this section we discuss briefly the standard notion of the SVD decomposition of a

linear operator that maps one finite dimensional inner product space to another finite

dimensional inner product space. For the utmost generality we consider here the

inner products over the complex numbers C. In this section we denote by Mnm(C)

the space of n × m matrices with complex entries.

Let Ui be an mi-dimensional inner product space over C, equipped with 〈·, ·〉i

for i = 1, 2. Let T : U1 → U2 be a linear operator. Let T ∗ : U2 → U1 be the

adjoint operator of T , i.e. 〈Tx,y〉 = 〈x, T ∗y〉 for all x ∈ U1,y ∈ U2. Equiv-

alently, let [a1, ..., am1] and [b1, ...,bm2

] be orthonormal bases of U1 and U2 re-

spectively. Let A ∈ Mm2m1(C) be the representation matrix of T in these bases:

[Ta1, ..., Tam1] = [b1, ...,bm2

]A. Then the matrix A∗ := AT

represents T ∗ in these

bases. The SVD decomposition of A = UΣV ∗, where U, V are unitary matrices

of dimensions m2,m1 respectively, and Σ ∈ Mm2m1(R) is a diagonal matrix with

the diagonal entries σ1 ≥ ...σmin(m2,m1) ≥ 0, corresponds to the following base free

concepts of T .

Consider the operators S1 := T ∗T : U1 → U1 and S2 := TT ∗ : U2 → U2.

Then S1, S2 are self-adjoint, i.e. S∗1 = S1, S

∗2 = S2 and nonnegative definite:

〈Sixi,xi〉 ≥ 0 for all xi ∈ Ui for i = 1, 2. The positive eigenvalues of S1 and S2,

counted with their multiplicities and arranged in a decreasing order are σ21 ≥ ... ≥

22

σ2r > 0, where r = rankT = rankT ∗. Let

S1vi = σ2i vi, i = 1, ..., r, 〈vi,vj〉1 = δij, i, 1, ..., r.

Then ui := σ−1i Tvi for i = 1, ..., r is an orthonormal set of the eigenvectors of

S2 corresponding to the eigenvalue σ2i for i = 1, ..., r. Complete the orthonor-

mal systems v1, ...,vr and u1, ...,ur to orthonormal bases [v1, ...,vm1] and

[u1, ...,um2] in U1 and U2 respectively. Then the unitary matrices U, V are the

transition matrices from basis [b1, ...,bm2] to [u1, ...,um2

] and basis [a1, ...am1] to

[v1, ...,vm1]:

[u1, ...,um2] = [b1, ...,bm2

]U, [v1, ...,vm1] = [a1, ..., am1

]V.

Let A ∈ Mm2m1(C). Then A can be viewed as a linear operator A : Cm1 →

Cm2 , where x → Ax for any x ∈ Cm1 . Let Pi be an mi × mi hermitian positive

definite matrix for i = 1, 2. We define the following inner products on Cm1 and

Cm2 :

〈x,y〉i := y∗Pix, x,y ∈ Cmi , i = 1, 2. (1.4)

It is straightforward to show that the SVD decomposition of A, viewed as the above

operator, with respect to the inner products given by (1.4) is

A = UΣV ∗, U ∗P2U = Im2, V ∗P−1

1 V = Im1, (1.5)

where Σ is an m2 × m1 diagonal matrix with nonnegative diagonal entries in a

decreasing order. A simple way to deduce this decomposition is to observe that

〈x,y〉i = (P1

2

i y)∗(P1

2

i x), where P1

2

i is the unique hermitian positive definite square23

root of Pi. The decomposition (1.5) is called the extended singular value decom-

position of A, ESVD for short. The diagonal entries of Σ are called the extended

singular values of A. ESVD of A corresponds to the standard SVD decomposition

of P1

2

2 AP− 1

2

1 .

1.4 Extended Singular Value decomposition

In its utmost generality, the SVD deals with a linear operator T : Vs → Vt, where

Vs (the source space) and Vt (the target space) are finite dimensional vector spaces

over the real numbers R (or more generally over the complex numbers C), which

are endowed with the inner products 〈·, ·〉s and 〈·, ·〉t respectively. Choose bases

e1, ..., eM and f1, ..., fN . Then Vs and Vt can be identified with the standard vector

spaces RM and RN respectively, and T is represented by the n × m matrix A =

(aij)NMi=1,j=1 as explained later on. Usually one chooses the inner products 〈·, ·〉s

and 〈·, ·〉t to be given by the standard inner products on RM and RN . That is, one

assumes

〈ej, ej〉s = 1, 〈ej, ek〉s = 0 for j 6= k, j, k = 1, ...,M, (1.6)

〈fi, fi〉t = 1, 〈fi, fl〉t = 0 for i 6= l, i, l = 1, ..., N. (1.7)

Then the SVD of T is the singular value decomposition A = UΣV T: U, Σ, V

are N × d, d × d,M × d matrices respectively, where UTU = V TV are equal to

the d × d identity matrix Id, and Σ is a diagonal matrix with the diagonal entries

σ1 ≥ σ2 ≥ ...σd ≥ 0 called the singular values of A. Furthermore

rank A ≤ d ≤ min(M,N), 0 < σrank A and σm = 0 for m > rank A. (1.8)24

In general, one does not have to assume that e1, ..., eM and f1, ..., fN are orthonormal

bases. By letting Ps := (〈ej, ek〉s)Mj,k=1 and Pt := (〈fi, fl〉t)N

i,l=1 be any positive

definite matrices, one obtains the corresponding SVD of T :

A = UΣV T, UTPtU = V TP−1s V = Id, Σ = diag (σ1, ..., σd) , (1.9)

where U and V are N × d and M × d matrices whose columns form orthonormal

sets of d vectors with respect to Pt and P−1s respectively. Σ is a diagonal matrix

with the singular values σ1 ≥ ... ≥ σd ≥ 0 of T satisfying (1.8). We call (1.9)

the singular value decomposition of A with respect to Ps, Pt, or Extended Singular

Value Decomposition (ESVD) of A. We call σ1, ..., σd the extended singular values

of A or singular values when no ambiguity arises. Let a1, ..., aM and b1, ...,bN

be orthonormal bases in VS and Vt respectively. Let T : Vs → Vt. Then the

conjugate operator T ∗ : Vt → Vs is defined by the equalities

〈Taj,bi〉t = 〈aj, T∗bi〉s , for i = 1, ..., N, j = 1, ...,M. (1.10)

Assume that

Taj =N∑

j=1

cijbi, i = 1, ...,M. (1.11)

Then

T ∗bi =M∑

j=1

cijaj, i = 1, ..., N. (1.12)

That is, the N × M matrix C := (cij)N,Mi=1,j=1 represents T in the orthonormal bases

a1, ..., aM and b1, ...,bN , and CT represents T ∗ in these bases. In particular

T ∗T : Vs → Vs, TT ∗ : Vt → Vt (1.13)

25

are presented by CTC and CCT in the bases a1, ..., aM and b1, ...,bN respectively.

Clearly

rank T = rank A = rank AT = rank T ∗ = rank ATA = rank AAT = rank A∗T = rank TT ast

Since T ∗T is self-adjoint ((T ∗T )∗ = T ∗T ) and nonnegative definite (〈T ∗Tx,x〉s ≥

0 for any x ∈ Vs) it follows that there exists an orthonormal basis c1, ..., cM in Vs

such that

T ∗Tci = σ2i ci, σi ≥ 0, 〈ci, ck〉s = δik, i, k = 1, ...,M. (1.14)

Let di := 1σi

Tci, i = 1, ..., rank T . Then d1, ...,drank T is an orthonormal system

of vectors. Extend this system to an orthonormal basis d1, ...,dN of Vt. Then SVD

of T is given by

Tci = σidi, i = 1, ...,M, 〈dj,dl〉t = δjl, j, l = 1, ..., N, di = 0 if i > N.

(1.15)

Lemma 1.3 Let Vs,Vt be finite dimensional vector spaces over R of dimension

M,N and with the inner products 〈·, ·〉s , 〈·, ·〉t respectively. Let e1, ..., eM and

f1, ..., fN be bases in Vs and Vt with the corresponding positive matrices Ps =

(〈ej, ek〉s)Mj,k=1 and Pt = (〈fi, fl〉t)N

i,l=1 respectively. Let T : Vs → Vt be a linear

transformation given by (1.11). Then the extended singular value decomposition of

A = (aij)N,Mi,j=1 with respect to the positive definite matrices Ps, Pt, corresponding

to the singular value decomposition of T in bases e1, ..., eM and f1, ..., fN is given

by (1.9). Furthermore the operator T ∗ : Vt → Vs represented by the matrix

26

A† = (a†ji)

M,Nj,i=1 in the bases e1, ..., eM and f1, ..., fN :

T ∗fi =M∑

j=1

a†jiej, i = 1, ..., N, A† = P−1

s ATPt. (1.16)

Proof. Let

ej =M∑

k=1

vjkck, dl =N∑

i=1

uilfi, j = 1, ...,M, l = 1, ..., N. (1.17)

Hence

Tej =M∑

k=1

vjkTck =d∑

k=1

vjkσkdk =

d,N∑

k=1,i=1

vjkσkuikfi.

Compare the above equalities to (1.11) to deduce the the first part of the equality

(1.9) A = UΣV T, where

U = (uik)N,di,k=1, U = (uil)

Ni,l=1 = (U,Ur),

U = (uil)Ni,l=1 := (U)−1, U0 := (uil)

d,Ni,l=1, UT = (UT

0 , UTd )

(1.18)

V = (vjk)M,dj,k=1, V = (vjk)

Mj,k=1 = (V, Vr),

V = (vjk)Mj,k=1 := (V )−1, V0 := (vjk)

d,Mj,k=1, V T = (V T

0 , V Td )

The second part of (1.9) UTPtU = Id is equivalent to the orthonormality of

the system d1, ...,dd. We now show the third part of (1.9) V TP−1s V = Id. Since

c1, ..., cM is an orthonormal basis in Vs it follows that Ps = V V T. Hence P−1s =

(V T)−1(V )−1. Note that since (V )−1V = IM it follows that (V )−1V = Q is the

M × d matrix whose d columns are the first d columns of the identity matrix IM .

Hence V TP−1s V = QTQ = Id.

From the definition of T ∗ (1.12) it follows that

T ∗di = σici, i = 1, ..., N, σi = 0 and ci = 0 for i > M. (1.19)27

As fi =∑N

l=1 ulidl we obtain

T ∗fi =N∑

l=1

uliT∗dl =

d∑

l=1

uliσlcl =

d,M∑

l,j=1

uliσlvljej, .

Compare the above equalities with the definition of A† in (1.16) to deduce that

A† = V T0 ΣU0. (1.20)

Since d1, ...,dN is an orthonormal basis it follows that Pt = UTU . As P−1s = V TV

and A = UΣV T a straightforward calculation shows that A† = P−1s ATPt.

1.5 Some Applications of SVD

1.5.1 Analysing DNA gene expression data via SVD

We now give another form of (1.3) which has a significant interpretation in mi-

croarray data. Let u1, · · · ,um denote the columns of U and v1, . . . ,vm denote the

columns of V . Then (1.1) and (1.3) can be written as

E =m∑

q=1

σquqvTq =

r∑

q=1

σquqvTq . (1.21)

If σ1 > ... > σr then uq and vq are determined up to the sign ±1 for q = 1, ..., r.

Namely uq and vq are length 1 eigenvectors of EET and ETE, respectively, corre-

sponding to the common eigenvalue σ2q . (Note the choice of a sign in vq forces the

unique choice of the sign in uq.) The vectors u1, . . . ,ur are called eigengenes,

the vectors v1, . . . ,vr are called eigenarrays and σ1, . . . , σr are called eigenex-

pressions. The rank r can be viewed as the number of different biological func-

tions of n genes observed in m experiments. The eigenarrays v1, . . . ,vr give the28

principal r orthogonal directions in Rm corresponding to σ1, . . . , σr. The eigen-

genes u1, . . . ,ur give the principal r orthogonal directions in Rn corresponding

to σ1, . . . , σr. The eigenexpressions describe the relative significance of each bio-

function. From the data given in [4], it seems that the number of significant singular

values never exceeds m2

. See the discussion on the number of significant singular

values in the beginning of §3. The essence of the FRAA algorithm, suggested in

this thesis, is based on this observation.

1.5.2 Low rank approximation of matrices

A well-known technique for dimension reduction is the low rank approximation by

the singular value decomposition [21].

Let E be the data matrix, and let E = UΣV T be the SVD of E, where U

and V are orthogonal and Σ is diagonal. Then, for a given κ, the optimal rank κ

approximation of E is given by

Eκ = UκΣκVTκ ,

where Uκ and Vκ are the matrices formed by the first κ columns of the matrices U

and V respectively, and Σκ is the κ-th principal submatrix of Σ. A key property of

this rank κ approximation is that it achieves the best possible approximation with

respect to the Frobenius norm, (and more generally with respect to any unitarily-

invariant norm), among all matrices with rank κ, .

Denote by ||E||F the Frobenius (`2) norm of E. It is the Euclidean norm of E

viewed as a vector with nm coordinates. Each term uqvTq in (1.21) is a rank one

29

matrix with ||uqvTq ||F = 1. Let R(n,m, k) denote the set of n × m matrices of at

most rank k (m ≥ k). Then for each k, k ≤ r, the SVD of E gives the solution to

the following approximation problem:

minF∈R(n,m,k)

||E − F ||F = ||E −k∑

q=1

σquqvTq ||F =

√

√

√

√

r∑

q=k+1

σ2q . (1.22)

If σk > σk+1 then∑k

q=1 σquqvTq is the unique solution to the above minima

problem. For our purposes, it will be convenient to assume that σq = 0 for any

q > m.

2 Randomized Low Rank Approximation

of a Matrix

In many applied settings where one is dealing with a very large data set, it is im-

portant to reduce the size of data in order to make an inference about the features of

data set, in a timely manner. Assume that the data set is represented by an m×n ma-

trix A ∈ Rm×n. It is important to find an approximation B ∈ Rm×n of a specified

rank k to A, where k is much smaller than m and n.

Here are several motivations to obtain such B. First, the storage space needed

for B is k(m + n), which is much smaller than the storage space mn needed for A.

Indeed, B can be represented as

B = x1yT1 + x2y

T2 + . . .xky

Tk , x1, . . . ,xk ∈ Rm, y1, . . . ,yk ∈ Rn, (2.23)

30

where x1, . . . ,xk and y1, . . . ,yk are k column vectors with m and n coordinates

respectively. We store the vectors x1, . . . ,xk,y1, . . . ,yk, which need the storage

k(m + n), and compute the entries of B, when needed, using the above expression

for B.

The second most common application is clustering algorithms as in [1], [2],

[11], [13] and [30]. Assume that our data represents n points and each point has

m coordinates. That is each point represented by a column of the matrix A. We

want to cluster the points in such a way that the distance between points in the same

cluster is much smaller than the distance between any two points from different

clusters. One way to do this is to project all the points on the k main orthonormal

directions encoded by the first k left singular vectors of A. Then cluster using

either k one dimensional subspaces or the whole k dimensional subspace. The k-

rank approximation B gives the approximation to this k dimensional subspace and

the approximation to the first k singular vectors. Another way to cluster is to use

the projective clustering. It is known that a fast SVD, i.e. fast k-rank approximation

is a main tool in this area [2] and [11].

The third application is in DNA microarrays, in particular in data imputation

[18]. Let A be a gene expression matrix. The m rows of the matrix A are indexed

by the genes, while the n columns are indexed by the number of experiments. It

is known that the effective rank of A, k, is usually much less than n. The SVD

decomposition is the main ingredient of the FRAA, (Fixed Rank Approximation

Algorithm), which is successfully implemented in [18] to impute the corrupted en-

tries of A. The fast k- approximation algorithm suggested in this paper, combined31

with the clustering of similar genes, can be implemented to improve the FRAA

algorithm.

One way to find a fast k-rank approximation is to choose at random l ≥ k

columns or rows of A and obtain from them k-rank approximations of A. This is

basically the spirit of the algorithm suggested by Frieze, Kannan and Vempala in

[19]. We call this algorithm the FKV algorithm. Assuming a statistical model for

the distribution of the entries of A, the authors give some error bounds on their

k-rank approximation.

In order to grasp the nature of the FKV algorithm, we need to discuss the ran-

dom projection method. Santosh S. Vempala has a thorough discussion of the ran-

dom projection method. [36]

2.1 Random Projection Method

Let u = (u1, . . . , un). In order to obtain the projection of u, we consider an or-

thonormal matrix R, whose entries are uniformly distributed. Then we scale RTu

with√

n/k to obtain the projection v, v = (√

n/k)RTu. The rational for choos-

ing a scaling factor of√

n/k is to have the expected value E(‖v‖2) = ‖u‖2. There

are various strategies for coming up with the random matrix R, which are data and

application dependent.

One of the interesting properties of random projection is that while it is useful

for reducing dimensionality, it preserves pairwise distances with high probability.

These properties can be represented by the following lemma.

32

Lemma 2.1 (Johnson and Lindenstrauss) For any 0 < ε < 1/2 and any set

of points S in Rn with |S| = m, upon projection to a uniform rank κ-dimensional

subspace where κ ≥ 1 +9 ln m

ε2 − (2/3)ε3, the following property holds:

With probability at least 1/2, for every pair u,u′ in S,

(1 − ε)‖u − u′‖2 ≤ ‖f(u) − f(u′)‖2 ≤ (1 + ε)‖u − u′‖2

where f(u) and f(u′) are projections of u and u′.

Random Projection Algorithm. Random projection preserves distances approxi-

mately, while reducing dimensionality. We could use the following steps to achieve

these goals.

1. Project an l-dimensional data point to k-dimensional space. (k ≥ 0)

2. Find the k largest singular values for obtaining the low-rank approximation

of the original data matrix.

Suppose we project an m×n matrix A using an m×l matrix B, B =

√

n

lRT A.

Then SVD of A and B could be written as

A =r∑

i=1

σiuivTi , B =

t∑

i=1

λiu′iv

′Ti

where ui and u′i are left singular vectors and vi and v′

i are right singular values

of A and B respectively. The following lemma indicates that the singular values are

approximately preserved.

Lemma 2.2 Let ε > 0. If l ≥ Clog n

ε2for sufficiently large constant C, then

n∑

i=1

λ2i ≥ (1 − ε)

κ∑

i=1

σ2i

33

We close this discussion on the random projection by the following theorem

which emphasizes that the matrix obtained by random projection is as good as the

least κ-rank approximation.

Theorem 2.3 Let Aκ be the best k-rank approximation of A, and let Aκ be the

matrix obtained by random projection. If l > C log n/ε2 for large enough constant

C, then

‖A − Aκ‖2F ≤ ‖A − Aκ‖2

F + 2ε‖Aκ‖2F

Low rank approximation (FKV method).

Given an m×n matrix A, we would like to approximate a few of the top singular

values and their associated singular vector. Here we present the FKV algorithm

[12].

The FKV algorithm starts by choosing l rows of A and forming an l × n matrix

L. Let p1, . . . , pm be real numbers such that 0 ≤ pi ≤ 1 and∑m

i=1 pi = 1. The

algorithm will input a matirx Am×n and integers l ≤ m and k ≤ l.

We then construct a matrix L in the following way: Choose i ∈ 1, . . . ,m

where the probability of choosing i is equal to pi. Then scale the i-th row A(i) of A

by√

lpi, to obtain A(i)/√

lpi, which will constitute the i-th row of L.

In the next step we compute LLT and its singular decomposition

LLT =l∑

t=1

σ2t u(t)u

T(t)

Let H be the matrix whose rows are

h(t) := LTu(t)/‖LTu(t)‖, t = 1, . . . , k34

and whose singular values are σ2t .

Notice that h(t) are the right singular values of L, and SVD of L is

L =l∑

t=1

σtu(t)hT(t)

The weak point of FKV algorithm is its inability to improve iteratively FKV

approximation by incorporating additional parts of A. In the following section, we

present our algorithm [17], which overcomes this shortcoming.

2.2 Fast Monte-Carlo Rank Approximation

The weak point of the FKV algorithm is its inability to improve iteratively FKV

approximation by incorporating additional parts of A. In fact in the recent paper

[11], which uses FKV random algorithm, this point is mentioned specifically in the

end of the paper: “..., it would be interesting to design an algorithm that improves

this approximation by accessing A (or parts of A) again.”

We provide here a sampling framework for iterative updates of k-rank approx-

imations of A, by reading iteratively additional columns (rows) of A, which im-

proves for sure the approximation B each time it is updated. The quality of the

approximation of B is given by the Frobenius norm ||B||F , the successive values of

which are a nondecreasing sequence under these iterations. The rate of increase of

the norms ||B||F can be used to give a stopping rule for terminating the algorithm.

Also, the updating algorithm of the k-rank approximation gives approximation to

the first k singular values of A, and the approximations to the first k left and right

singular vectors of A. We believe that this algorithm will have many applications35

in data mining, data storage and data analysis.

The following set of theorems ([17]) will be useful in the construction of the

algorithm.

Theorem 2.4 Let A ∈ Rm×n and k ∈ [1, min(m,n)]. Let x1, . . . ,xk and

y1, . . . ,yk be two sets of orthonormal vectors in Rm and Rn respectively. Then

k∑

i=1

(ATxi)T(ATxi) ≤

k∑

i=1

σ2i ,

k∑

i=1

(Ayi)T(Ayi) ≤

k∑

i=1

σ2i . (2.24)

Equality in the first or second inequality occurs if span(x1, . . . ,xk) or span(y1, . . . ,yk),

contains k linearly independent left or right singular vectors of A, corresponding

the k maximal singular values of A respectively.

The above characterization follows from the maximal, (Ky Fan), characteriza-

tion, of the sum of the first largest eigenvalues of a real symmetric matrix:

Theorem 2.5 Let S ∈ Sp(R) be a real p × p symmetric matrix. Let λ1 ≥

. . . ≥ λp be the eigenvalues S arranged in a decreasing order and listed with their

multiplicities. Let w1, . . . ,wp ∈ Rp be an orthonormal set of the corresponding

eigenvectors of S: Swi = λiwi, i = 1, . . . , p. Let k ∈ [1, p] be an integer. Then for

any orthonormal set x1, . . . ,xk ∈ Rp

k∑

i=1

xTi Sxi ≤

∑

i=1

λi =k∑

i=1

wTi Swi. (2.25)

Equality holds if and only if the subspace span(x1, . . . ,xk) contains k linearly

independent eigenvectors of S corresponding to the eigenvalues λ1, . . . , λk.

36

See for example [14] for proofs and the references. Note that x1, . . . ,xk ∈ Rm

is a system of orthonormal vectors in Rm if and only if the m×k matrix (x1, . . . ,xk)

is in Omk(R).

To obtain Theorem 2.4 from Theorem 2.5 we let p = m, (or n) and S be equal

to AAT, (or ATA). In (2.24) we emphasized the complexity of the computations of

the left-hand side of the inequalities. See also [18] for applications of Theorem 2.5

to data imputation in DNA microarrays.

Corollary 2.6 Let A ∈ Rm×n and k ∈ [1, min(m,n)] be an integer. Then

for any two k-orthornormal systems x1, . . . ,xk ∈ Rm and y1, . . . ,yk ∈ Rm the

following equalities hold:

||A −k∑

i=1

xi(xTi A)||2F = ||A||2F −

k∑

i=1

(ATxi)T(ATxi), (2.26)

||A −k∑

i=1

(Ayi)yTi ||2F = ||A||2F −

k∑

i=1

(Ayi)T(Ayi). (2.27)

In particular the best k-rank approximation Ek of A is given by∑k

i=1 ui(uTi A) and

∑k

i=1(Avi)vTi , where u1, . . . ,uk ∈ Rm and v1, . . . ,vk ∈ Rn are an orthonormal

sets of the left and right singular vectors of A corresponding to σ1, . . . , σk.

The next theorem is the key theorem for updating the k-rank approximation for

∑k

i=1 xi(xTi A), ( or

∑k

i=1(Ayi)yTi ), for some (x1, . . . ,xk) ∈ Omk(R), (or some

(y1, . . . ,yk) ∈ Onk(R)).

Theorem 2.7 Let x1, . . . ,xk ∈ Rm, (or y1, . . . ,yk ∈ Rn), be an orthonor-

mal system in Rm, (or in Rn). Let w1, . . . ,wl ∈ Rm, (or z1, . . . , zl ∈ Rn),

37

be a given set in Rm, (or in Rn). Perform the Modified Gram-Schmidt process

on x1, . . . ,xk,w1, . . . ,wl, (or y1, . . . ,yk, z1, . . . , zl), to obtain an orthonormal

set x1, . . . ,xp ∈ Rm, (or y1, . . . ,yp ∈ Rn), where k ≤ p ≤ k + l. Assume

that k < p, i.e. span(w1, . . . ,wl) * span(x1, . . . ,xk), (or span(z1, . . . , zl) *

span(y1, . . . ,xk)). Form p × p real symmetric matrix S := ((ATxi)T(ATxj))

pi,j=1,

(or S := ((Ayi)T(Ayj))

pi,j=1), and assume that λ1 ≥ . . . ≥ λk are the k-largest

eigenvalues of S with the corresponding k-orthonormal vectors o1, . . . ,ok ∈ Rp.

Let O := (o1, . . . ,ok) ∈ Opk(R) and define k-orthonomal vectors x1, . . . , xk ∈

Rm, (or y1, . . . , yk ∈ Rn), as follows:

(x1, . . . , xk) = (x1, . . . ,xp)O, (or (y1, . . . , yk) = (y1, . . . ,yp)O). (2.28)

Then

k∑

i=1

(ATxi)T(ATxi) ≤

k∑

i=1

(ATxi)T(ATxi), (2.29)

(ork∑

i=1

(Ayi)T(Ayi) ≤

k∑

i=1

(Ayi)T(Ayi)).

Furthermore

λi = (ATxi)T(ATxi), i = 1, . . . , k, (ATxi)

T(ATxj) = 0 for i 6= j,(2.30)

(or λi = (Ayi)T(Ayi), i = 1, . . . , k, (Ayi)

T(Ayj) = 0 for i 6= j.)

We now explain the essence of Theorem 2.7. View x1, . . . ,xk as an approxi-

mation to the first k-left singular vectors of A, and∑k

i=1 xi(ATxi)

T as the k-rank

approximation to A. Hence (ATxi)T(ATxi) is an approximation to σ2

i of A for

i = 1, . . . , k. Read additional vectors w1, . . . ,wl ∈ Rm such that at least one of38

these vectors is not in the subspace spanned by x1, . . . ,xk. Let X be the subspace

spanned by x1, . . . ,xk and w1, . . . ,wl. Hence x1, . . . ,xk, . . . ,xp is the orthonor-

mal basis of X obtained from the vectors x1, . . . ,xk,w1, . . . ,wl, using the Gram-

Schmidt process. Note that k < p ≤ k + l, and in general one has that p = k + l.

Consider the p × p nonnegative definite matrix S = ((ATxi)T(ATxj))

pi,j=1. Find

its first k eigenvectors to obtain x1, . . . , xk ∈ Rm. Then C :=∑k

i=1 xi(ATxi)

T

is the best approximation of A by matrix B of rank at most k, whose columns are

in the subspace X. In particular, the approximation C is better than the previous

approximation∑k

i=1 xi(ATxi)

T, which is equivalent to the (2.29). Similar situation

holds for in the second case y1, . . . ,yk, z1, . . . , zl ∈ Rn.

Outline of the Proof of Theorem 2.7. Let S = (sij)pi,j=1. Let ei = (δi1, . . . , δip)

T, i =

1, . . . , p be the standard orthonormal basis in Rp. Assume that we are dealing with

the orthonormal set x1, . . . ,xp ∈ Rm. Use the definition of A and the Ky Fan

characterization of the sum of the maximal k eigenvalues of symmetric S to deduce

k∑

i=1

(ATxi)T(ATxi) =

k∑

i=1

eTi Sei ≤

k∑

i=1

λi =k∑

i=1

oTi Soi.

Let C := AT(x1, . . . ,xp). Then S = CTC. Hence the√

λ1 ≥ . . . ≥√

λp are

the singular values of C and o1, . . . ,op are the right singular vectors of C. Thus

Coi =√

λiti ∈ Rn, where ti is the left singular vector of C corresponding to

the singular value√

λi for i = 1, . . . , p. A straightforward calculation shows that

λi = oTi Soi = (ATxi)

T(ATxi) for i = 1, . . . , k. Hence the first inequality in (2.29)

holds. Furthermore we deduce the first set of equalities in (2.30) for i = 1, . . . , k.39

The second set of equalities in (2.30) follows from the orthonormality of t1, . . . , tk.

Similar arguments apply when dealing with the orthonomal system y1, . . . ,yp.

Algorithm [17]

One starts the algorithm by choosing the first k-rank approximation to A as

follows. Let c1, . . . , cn ∈ Rm and rT1 , . . . , rT

m ∈ Rn be the n columns and the m

rows of A. (We view the rows of A as row vectors.) Choose 1 ≤ n1 < . . . <

nk ≤ n, (or 1 ≤ m1 < . . . < mk ≤ m), to be k integers. Let x1, . . . ,xq ∈

Rm, (or y1, . . . ,yq ∈ Rn), be the orthonormal set obtained from cn1, . . . , cnk

, (or

rTm1

, . . . , rTmk

), using the Modified Gram-Schmidt process. Then let

B0 :=

q∑

i=1

xi(ATxi)

T, (or B0 :=

q∑

i=1

(Ayi)yTi ). (2.31)

That is, the first k-rank B0 approximation to A is obtained as follows. Choose

at random k columns, (or rows), of the data matrix A. Apply the Gram-Schmidt

process to them to obtain the orthonormal column vectors x1, . . . ,xq with m coor-

dinates, (or the orthonormal column vectors y1, . . . ,yq with n coordinates). In gen-

eral q = k, but in some cases if x1, . . . ,xk are linearly dependent, (or if y1, . . . ,yq

are linearly dependent), q < k. Assume for simplicity of the exposition that

q = k. Then B0 is of the form (2.23) where yi = ATxi, i = 1, . . . , k, (or

xi = Ayi, i = 1, . . . , k.)

Now update iteratively the k-rank approximation of Bt−1 of A to Bt, using

Theorem 2.7, by letting w1 := cj1 , . . . ,wl = cjl∈ Rm, (or z1 := rT

j1, . . . , zl =

rTjl∈ Rn), for some l integers 1 ≤ j1 < . . . < jl ≤ n, (or 1 ≤ j1 < . . . < jl ≤ m).

40

That is, we choose another l sets of columns of A, (or rows of A), preferably that

were not chosen before, and update the k-rank approximation using the algorithm

suggested by Theorem 2.7 to obtain an improved k-rank approximation Bt of A.

Furthermore one can use the k-rank approximation Bt from the above algo-

rithm to approximate the first k-singular values σ1, . . . , σk, and the left and the

right singular vectors u1, . . . ,uk and v1, . . . ,vk as follows. First the square roots

√

λ1(S), . . . ,√

λk(S) of the matrix S are approximations for σ1, . . . , σk. If S

was obtained using x1, . . . ,xp ∈ Rm then the vectors x1, . . . , xk approximate

u1, . . . ,uk. Let vi := ATui, i = 1, . . . , k. Then ||vi|| =√

λi(S) is an approx-

imation to σi of A for i = 1, . . . , k. The renormalized vi which are given as 1||vi|| vi

approximate the right singular eigenvectors vi for i = 1, . . . , k.

If S was obtained using y1, . . . ,yp ∈ Rn then the vectors y1, . . . , yk approxi-

mate v1, . . . ,vk and the vectors Ay1

||Ay1|| , . . . ,Ayk

||Ayk|| approximate u1, . . . ,uk.

41

Fast k-rank approximation and SVD algorithm

Input: positive integers m,n, k, l, N , m × n matrix A, ε > 0.

Output: an m × n k-rank approximation Bf of A, with the ratios ||B0||||Bt|| and

||Bt−1||||Bt|| , approximations to k-singular values and k left and right singular vec-

tors of A.

1. Choose k-rank approximation B0 using k columns (rows) of A.

2. for t = 1 to N

- Choose l columns (rows) from A at random and update Bt−1 to Bt.

- Compute the approximations to k-singular values, and k left and right

singular vectors of A.

- If ||Bt−1||||Bt|| > 1 − ε let f = t and finish.

We now explain briefly the main steps of our algorithm. We read the dimensions

m,n of the data matrix A. We set N as the maximal number of iterations we are

going to execute to find the k-rank approximation of A. We read the entries of

the data matrix A, and finally the small parameter ε > 0. We choose the k-rank

approximation B0 using (2.31) as explained above. Assume that Bt−1 is the current

k-rank approximation to A. Then we pick up additional l columns, (or rows), of A

and update Bt−1 to Bt using Theorem 2.7 as explained. Recall that ||Bt−1|| ≤ ||Bt||.

If the relative improvement in ||Bt|| is less than ε, i.e. ||Bt−1||||Bt|| > 1 − ε, we are

satisfied the approximation Bt and finish our algorithm. If this does not happen42

then our algorithm stops after the N iteration.

3 Cluster Analysis

3.1 Clusters and Clustering

Clustering is the process of grouping data objects into a set of disjoint classes called

clusters, so that the objects within the class have high similarity to each other, while

objects in separate clusters are more dissimilar [25]. Clustering is an example of

unsupervised classification. Classification refers to a procedure that assigns data

objects to a set of classes. Unsupervised means that clustering does not rely on

predefined classes and training examples while classifying the data objects. Thus,

clustering is distinguished from pattern recognition or the areas of statistics known

as discriminant analysis and decision analysis, which seek to find rules for classify-

ing objects from a given set of pre-classified objects.

3.2 Various Gene Expression Clustering Procedures

Usually a microarray experiment contains 106 genes. One of the characteristics

of gene expression data is that it is meaningful to cluster both genes and samples.

We could group the co-expressed genes in a cluster according to their expression

patterns. In such gene based clustering, the genes are treated as objects, while the

samples are the features. On the other hand the samples can be partitioned into

homogeneous groups. Each group may correspond to some particular cancer types.

43

Such sample based clustering regards the samples as the objects and the genes as

the features.

3.2.1 Proximity (Similarity) Measurement

Proximity measurement for gene expression data computes the distance between

the two data points. As mentioned before, gene profile in a gene expression matrix

can be formalized as a vector xi = xij : 1 ≤ j ≤ m, where xij is the value of the

j-th

feature the i-th data object, and m is the number of features or experiments. This

similarity between two expression profiles xi and xj is measured by the distance

function of corresponding vectors xi and xj .

Euclidean distance is one of the most commonly used methods to measure the

distance between two gene expression profiles. The distance between two gene

expression profiles xi and xj in m-dimensional space is defined as

dEuclidean(xi,xj) =

√

√

√

√

m∑

k=1

(xik − xjk)2

However, for gene expression data, the overall Euclidean distance does not perform

well for shifting or scaled patterns (profiles). To circumvent this, each gene profile

is standardized with zero mean and variance one, before calculating the distance.

An alternate measure of similarity is the Pearson correlation coefficient, which

measures similarity between the shapes of two expression patterns between two

44

profiles.

Pearson(xi,xj) =

∑m

k=1(xik − µxi)(xjk − µxj

)√∑m

k=1(xik − µxi)2√

∑m

k=1(xjk − µxj)2

where µxiand µxj

are the means for gene profiles of xi and xj respectively. Pear-

son’s correlation coefficient views each gene profile as a random variable, and mea-

sures the similarity between two gene profiles by calculating the linear relationship

between their distribution. The shortcoming of Pearson correlation is that it is not

robust for outliers.

3.2.2 Hierarchical Cluster Analysis

One possibility for clustering objects is their hierarchical aggregation. Here the ob-

jects are combined according to their distances from or similarities to each other.

Cluster formation is based on splitting the whole set of objects into individual clus-

ters. With the more frequently used agglomerative clustering, one starts with single

objects and measures them to larger object groups.

To decide the number of clusters, different criteria can be used. Very often, the

number of clusters is unknown. For example, in given clinical data, the number

of clusters might be predefined by a given number of diseases. In some cases,

the number of clusters can be obtained from a predetermined distance measure or

difference between clusters. In general, the distance to a new object or cluster K is

computed by calculating the distance from the object A and B to objects i. In this

case, the weighted average linkage dki= (dAi

+ dBi)/2. The sizes of the clusters

and their weights are assumed to be equal. Several other formulas exist for the45

contruction of the distance matrix. The most important ones are considered now for

the case of aggregating two clusters.

Single Linkage

Here the shortest distance between the opposite clusters is calculated, i.e.

dki =dAi

+ dBi

2− |dAi

− dBi|

2= min(dAi

, dBi)

As a result, clusters are formed that are loosely bound. The clusters are often

linearly elongated in contrast to the usual spherical clusters. This chaining is caused

by the fusion of single objects to a cluster.

3.2.3 Clustering Using K-means Algorithm

K-means algorithm. Of all clustering procedures for microarray data, K-means

is among the simplest and most widely used, and has probably the cleanest proba-

bilistic interpretation as a form of expectation maximization (EM) on the underlying

mixture model. In a typical implementation of the K-means algorithm, the number

of clusters is fixed at some value K, based, for instance on the expected number of

regularity patterns. K representative points or centers are initially chosen for each

cluster more or less at random. In microarray data, this could reflect, for instance,

the expected number of regularity patterns. These points are also called centroids

or prototypes. Then at each step, we have the following:

• Each point in the data is assigned to the cluster associated with the closest

representative.

46

• After the assignment, new representative points are computed, for instance

by computing the center of gravity of each computed cluster.

• The two procedures above are repeated until the system converges or fluctu-

ation remains small.

We notice that using K-means clustering method requires choosing the number

of clusters and also being able to compute a distance or similarity between points

and compute a representative for each cluster given its members. The general idea

behind K-means clustering can lead to different software implementations depend-

ing on how the initial centroids are chosen, how symmetries are broken, and so

forth. A good implementation has to run the algorithm multiple times with dif-

ferent initial conditions and possibly also try different values of K automatically.

When the cost function corresponds to an underlying probabilistic mixture model,

K-means is an online approximation to the classical EM algorithm, and as such,

in general is bound to converge towards a solution that is at least a local maximum

likelihood or maximum posterior solution. A classical case is when Euclidean dis-

tances are used in conjunction with a mixture of Gaussian model.

3.2.4 Mixture Models and EM Algorithm

Let us consider a data set D = (d1, . . . , dN) and an underlying mixture model with

K components of the form

P (d) =K∑

k=1

P (Mk)P (d|Mk) =K∑

k=1

λkP (d|Mk)

47

where λk ≥ 0 and∑

k λk = 1 and Mk is the model for cluster k [7]. Mixture

distributions provide a flexible way of modelling complex distributions, combining

together simple building blocks, such as Gaussian distributions. The Lagrangian

associated with log-likelihood and normalization constraints on the mixing coeffi-

cients is given by

L =N∑

i=1

log

(

K∑

k=1

λkP (di|Mk)

)

− µ

(

K∑

k=1

λk − 1

)

with the corresponding critical equation

∂L∂λk

=N∑

i=1

P (di|Mk)

P (di)− µ = 0

multiplying each critical equation by λk and summing over k gives us the value of

the Lagrange multiplier µ = N .

Multiplying again the critical equation across by P (Mk) = λk and using Bayes’

theorem in the form

P (Mk|di) = P (di|Mk)P (Mk)

P (di)

gives

λ∗k =

1

N

N∑

i=1

P (Mk|di)

Therefore the maximum likelihood estimate of the mixing coefficients for class K

is the sample mean of the conditional probabilities that di comes from the model

K. Consider now that each model Mk has its own vector of parameters (wkj).

Differentiating the Lagrangian with respect to wkj gives

∂L∂wkj

=N∑

i=1

λk

P (di)· ∂P (di|Mk)

∂wkj

48

Combining the above equations we get

N∑

i=1

P (Mk|di)∂ log P (di|Mk)

∂wkj

= 0

for each k and j. The maximum likelihood equations for estimating the parameters

are weighted averages of the maximum likelihood equations ∂ log P (di|Mk)/∂wkj =

0. Notice that the weights are the probabilities of membership of the di in each class.

49

4 Various Methods of Imputation of Missing Values

in DNA Microarrays

4.1 The Gene Expression Matrix

In this section we will view E ∈ Rn×m, with n ≥ m as the gene expression matrix:

E =

g11 g12 . . . g1m

g21 g22 . . . g2m

......

......

gj1 gj2 . . . gjm

......

......

gn1 gn2 . . . gnm

=

gT1

gT2

...

gTj

...

gTn

= [c1 c2 . . . cm], (4.1)

gTj := (gj1, gj2, ..., gjm), j = 1, ..., n, ci =

g1i

g2i

...

gji

...

gni

, i = 1, ...,m.

The row vector gTj corresponds to the (relative) expression levels of the j th gene

in m experiments. The column vector ci corresponds to the (relative) expression

levels of the n genes in the ith experiment.

Consider the SVD of the gene expression matrix E = UΣV T . In the terminol-

ogy of [4], the columns of U are eigenarrays, the columns of V are eigengenes, and50

the singular values of E are eigenexpression levels.

In many microarray data sets, researchers have found that only a few eigen-

genes are needed to capture the overall gene expression pattern. (Here, by a “few”

we mean less than half of the number of experiments m.) The number of these

significant eigengenes is a fundamental problem in principal component analysis

[24]. Let us mention explicitly three methods to estimate the number of significant

eigengenes. The fraction criterion can be stated simply as follows. Let

pq :=σ2

q∑m

t=1 σ2t

, q = 1, ...,m, p := (p1, ..., pm)T. (4.2)

Thus pq represents the fraction of the expression level contributed by the qth eigen-

gene. Then we choose the l eigengenes that contribute about 70% − 90% of the

total expression level. Another method is to use scree plots for the σ2q . (In prin-

cipal component analysis, the pq are proportional to the variances of the principal

components, so we choose the principal components of maximum variability [26].)

According to [24], the most consistent estimates of the number of significant eigen-

genes is achieved by the broken-stick model.

If E has l significant eigenvalues, we view σq to be effectively equal to zero for

q > l. We define the matrix

El :=l∑

q=1

σquqvTq (4.3)

as the filtered part of E and consider E − El the noise part of E.

Let

1 ≥ h(p) := − 1

log m

m∑

q=1

pq log pq ≥ 0. (4.4)

51

Then h(p) is the rescaled entropy of the probability vector p. h(p) = 1 only when

pq = 1m

, q = 1, ...,m; in other words, all the eigengenes are equally expressed.

On the other hand, h(p) = 0 if and only if pq(1 − pq) = 0, q = 1, ...,m and this

corresponds to r = 1: in other words, the gene expression is captured by a single

eigengene (and eigenarray).

The following example points out a potential weakness of SVD theory in trying

to detect groups of genes with similar properties.

4.2 SVD and gene clusters

Suppose the set of genes gTj , j ∈ [n] can be grouped into k + 1 disjoint subsets

[n] = ∪k+1q=1Gq with G1, ..., Gk nonempty and m ≥ k (usually m > k). In particular,

consider the genes in each group Gq ( q = 1, ..., k) to have similar characteristics (in

other words, Gq is a cluster). Genes that have no similar characteristics are placed

in Gk+1. Denote by #Gq the cardinality of the set Gq for q = 1, ..., k + 1. Suppose

that our m experiments do not distinguish between any two genes belonging to the

same group Gq for q = 1, ..., k + 1. More precisely we assume:

gji = aqi for each j ∈ Gq and q = 1, ..., k, i = 1, ...,m, (4.5)

gji = 0 for each j ∈ Gk+1 and i = 1, ...,m,

Let A = (aqi)k,mq,i=1 ∈ Rk×m be the corresponding k × m matrix with the rows

52

rT1 , ..., rT

k :

A =

rT1

rT2

...

rTk

.

Then the row rq appears exactly #Gq times in E for q = 1, ..., k. In addition

E has #Gk+1 zero rows. Clearly the row space of E is the row space of A. So

k ≥ rank E = rank A. Hence if rank A = k then

σ1(E) ≥ ... ≥ σk(E) > σk+1(E) = ... = σm(E) = 0.

However, there is no simple formula relating the singular values of E and A. It

may happen that the rows of A are linearly dependent which indicates that several

groups out of G1, ..., Gk are somehow related, and the number of the significant

singular values of E is less than k.

Conclusion: The number of gene clusters is no less than the number of significant

singular values of the gene expression matrix.

4.3 Missing Data in the Gene Expression Matrix

We now consider the problem of missing data in the gene expression matrix E.

(Our analysis can be applied to any matrix E.) Let N ⊂ [n] denote the set of rows

of E that contain at least one missing entry. Thus for each j ∈ N c := [n]\N , the

gene gTj has all of its entries. Let n′ denote the size of N c so that the size of N is

n − n′. We want to complete the missing entries of each gTj , j ∈ N , under some

53

assumptions.

We first describe the reconstruction of the missing data in E using the SVD as

given in [4].

4.3.1 Reconsideration of 4.2

Let us reconsider Example 3.1. Assume that rank A = k. Let j ∈ N and assume

that the gene j is in the cluster Gq. Then we can reconstruct all missing entries

of gTj if Gq\N 6= ∅. Indeed, if for some gene p ∈ Gq we have the results of m

experiments, then gj = gp and we reconstructed the missing entries for gj . In this

example we can reconstruct all the missing entries in E if E ′ has the same rank as

E. Equivalently, we can reconstruct all the missing entries in E if the equality (4.1)

holds, where l and l′ are the ranks of E and E ′ respectively.

4.3.2 Bayesian Principal Component (BPCA):

BPCA consists of three processes [28]:

• Principal Component (PC) regression

• Bayesian estimation

• Iterative expectation-maximization (EM) algorithm.

Here we give a summary of each process.

PC regression

In order to explain PC, we need to mention principal component analysis (PCA).

54

Consider a D×N gene matrix Y where D represents the number of arrays (samples)

and N the number of genes. Let µ be the mean vector of y:

µ :=

(

1

N

) N∑

i=1

yi

Then construct the correlation matrix

S =1

N

N∑

i=1

(yi − µ)(yi − µ)T

Let λ1 ≥ λ2 ≥ · · · ≥ λD be the eigenvalues of S and u1,u2, . . . ,uD the cor-

responding eigenvectors. Then we scale the eigenvectors by their singular values.

The l-th principal access vector wl could be defined by wl =√

λlul and the l-th

factor score by xl = (wl/λl)Ty.

Then the variation of the gene expression of y could be represented as a linear

combination of principal access vectors wl (1 ≤ l ≤ K), K < D (there are a few

wl):

y =K∑

l=1

xlwl + ε

This is the spirit of PC regression. We now consider the missing values in the gene

expression matrix. We could estimate the missing part ymiss of gene expression y by

estimating the of observed segment of it yobs, using PCA. Denote wobsl and wmiss

l be

the principal axis wl corresponding to observe and missing data respectively. Now

construct a matrix W = (W obs,W miss), where

W obs = (wobs1 , . . . ,wobs

k ) and W miss = (wmiss1 , . . . ,wmiss

k )

respectively.55

Now we could obtain the factor score x = (x1, . . . , xk) by minimizing the

residual error err = ‖yobs − wobs · x‖2. Using the least square method,

x = [(W obs)T W obs]−1(W obs)Tyobs

Then we could estimate the missing part of the gene expression y by ymiss = wmiss ·

x.

Bayesian estimation Probabilistic extension of Principal Component Analysis (PPCA)

has found useful applications in supervised learning (tipping). In this setting the

residual error and the factor scores are normally distributed.

x ∼ N (0, Ik) and

ε ∼ N (0,1

τID)

the interesting fact is that the maximum likelihood estimation of PPCA and PCA

are identical

ln P (y,x|θ) = ln P (y,x|W ,µ, τ ) =

−τ

2‖y − Wx − µ‖2 − 1

2‖x‖2 +

D

2ln τ − K + D

2ln 2π

where θ ≡ W ,µ, τ is the parameter set. Using Bayes theorem, we can

obtain the posterior probability distribution of θ and X:

p(θ,X|Y ) ∝ (Y,X|θ)p(θ)

Iterative E-M algorithm

Having the information about the true parameter θtrue, the posterior pobability of56

the missing values is given by

q(Y miss) = P (Y miss|Y obs,θtrue).

Having the parameter posterior q(θ), then

q(ymiss) =

∫

dθq(θ)p(Y miss|Y obs,θ).

Then the variation Bayes (VB) algorithm could be used to estimate the model pa-

rameter θ and missing value Y miss. VB algorithm is similar to EM and it obtains the

posterior distribution for θ, Y miss, q(θ), q(Y miss) by using a repetitive algorithm.

Summary of VB algorithm

1. Initialize q(Y miss) by replacing missing genes by the gene-wise average.

2. Estimate q(θ) using Y obs and current q(Y miss).

3. Estimate q(Y miss) using current q(θ)

4. Update the hyperparameter α using both current q(θ) and the curent q(Y miss).

5. Repeat (2)-(4) until convergence.

Then we could impute the missing values in the gene expression matrix by

ˆY miss =

∫

Y missq(Y miss)dY miss

4.3.3 The least square imputation method

Let E ′ be the n′×m matrix containing the rows gTj , j ∈ N c of E which do not have

any missing entries, and l′ be the number of significant singular values of E ′. Let57

X ⊂ Rm be the invariant subspace of the symmetric matrix (E ′)TE ′ corresponding

to the eigenvalues σ1(E′)2, ..., σl′(E

′)2. Let x1, ...,xl′ be the orthonormal eigenvec-

tors of (E ′)TE corresponding to the eigenvalues σ1(E′)2, ..., σl′(E

′)2. Then x1, ..xl′

is a basis of X.

Let M ⊂ [m] be a subset of cardinality m − m′. Consider the projection

πM : Rm → Rm′ by deleting all the coordinates i ∈ M for any vector x =

(x1, ..., xm)T ∈ Rm. Then πM(X) is spanned by πM(x1), ..., πM(xl′).

Fix j ∈ N and let M ⊂ [m] be the set of experiments (columns) where the

gene gTj has missing entries. Let y ∈ πM(X) be the least square approximation

to πM(gj). Then any gj ∈ π−1M (y) is a completion of gj . If πM|X is 1-1 then

gj is unique. Otherwise one can choose gj ∈ π−1M (y) with the least norm. Note

that to find y ∈ πM(X) one needs to solve the least square problem for a subspace

πM(X). In principle, for each j ∈ N one solves a different least square problem.

The crucial assumption of this method is

l = l′. (4.1)

That is the completed matrix E and its submatrix E ′ have the same number of

significant singular values. This follows from the observation that the completion

of the row gj, j ∈ N lies in the subspace X. (Note that the inequalities (4.5) imply

that the assumption (4.1) can be a very restrictive assumption.)

The significant singular values of E ′ and of the reconstructed E are joint func-

tions of all the rows (genes). By trying to reconstruct the missing data in each gene

gTj , for j ∈ N , separately, we ignore any correlation between gT

j and the genes58

gTq , q ∈ N ; consequently, this will have an impact on the singular values of E.

4.3.4 Iterative method using SVD

In the recent papers [35] and [9], the following iterative method using SVD to

impute missing values in a gene expression matrix is suggested. First, replace the

missing values with 0 or with values computed from another method. Call the

estimated matrix Ep, where p = 0. Find the lp significant singular values of Ep,

and let Ep,lp be the filtered part of Ep (4.3). Replace the missing values in E by the

corresponding values in Ep,lp to obtain the matrix Ep+1. Continue this process until

Ep converges to a fixed matrix (within a given precision). This algorithm takes into

account implicitly the influence of the estimation of one entry on the other ones.

But it is not clear if the algorithm converges, nor what are the features of any fixed

point(s) of this algorithm.

4.3.5 Local least squares imputation (LLSimpute)

Kim et. al. in their recent paper introduced an algorithm which is based on a

least squares method. This method considers k genes which are similar to the gene

with missing data. In the following, we review the essence of their method. In their

method, we are concerned with two matrices A and B and a vector w. We construct

the matrix A by eliminating the elements of k nearest neighbors at q missing loca-

tions of the given genes. Then construct a matrix B by collecting all the eliminated

elements. Finally we construct a row vector w which consists of missing values in

the original gene. Then the recovery of the missing genes can be formulated as the59

following least square problem:

minx

‖ATx − w‖2

Denote u = (α1α2 . . . αq)T of q missing values. Then u could be estimated as

follows:

u = BTx = BT (AT )†w

where (AT )† is the pseudoinverse of AT .

Now we use an example which is borrowed from [27] to explain the model.

Suppose the target gene g has two missing values in the first and tenth positions in

total of ten experiments. Now we use k similar genes gTs1

, . . . ,gTsk

. Now we could

reconstruct the original gene and the k similar genes with the following matrices

gTs1

gTs2

...

gTsk

=

α1 w1 w2 · · · w8 α2

B1,1 A1,1 A1,2 · · · A1,8 B1,2

......

......

......

Bk,1 Ak,1 Ak,2 · · · Ak,8 Bk,2

The values of w could be written as

w ' x1a1 + x2a2 + · · · + xkak

Then the missing value could be estimated by

α1 = B1,1x1 + B2,1x2 + · · · + Bk,1xk

α2 = B1,2x1 + B2,2x2 + · · · + Bk,2xk

where α1 and α2 are two missing values in the original gene.60

4.4 Motivation for FRAA

We suggest a new method in which the estimation of missing entries is done simul-

taneously, i.e., the estimation of one missing entry influences the estimation of the

other missing entries. If the gene expression matrix E has missing data, we want to

complete its entries to obtain a matrix E, such that the rank of E is equal to (or does

not exceed) d, where d is taken to be the number of significant singular values of E.

The estimation of the entries of E to a matrix with a prescribed rank is a variation

of the problem of communality (see [16, p. 637].) We give an optimization algo-

rithm for finding E using the techniques for inverse eigenvalue problems discussed

in [14].

It is likely that our algorithm can be used to estimate missing entries in data sets

other than gene expression data. Such a data set should be represented by an n×m

matrix whose rank is smaller than min(m,n).

4.4.1 Additional matrix theory background for FRAA

To compute the decomposition (1.21), it is enough to know vq and σquq. If σq

repeats k > 1 times in the sequence σ1 ≥ ... ≥ σr > 0, then the choice of the

corresponding k eigenvectors vj is not unique: any choice of the orthonormal basis

in the eigenspace of ETE corresponding to the eigenvalue σ2q is a legitimate choice.

In what follows we will use yet another equivalent definition of the singular

values of E. Let Rn×m denote the space of all real n × m matrices and let Sm(R)

61

denote the space of all real m × m symmetric matrices. For A ∈ Sm(R), we let

λ1(A) = λ1 ≥ ... ≥ λm(A) = λm, Azq = λqzq, zTq zt = δqt, q, t = 1, ...,m,

(4.2)

be the eigenvalues and corresponding eigenvectors of A, where the eigenvalues are

counted with their multiplicities, and the eigenvectors form an orthonormal basis in

Rm.

Consider the following (n + m) × (n + m) real symmetric matrix:

Es :=

0 E

ET 0

. (4.3)

It is known [22, §7.3.7]

σq(E) := σq = λq(Es) = −λn+m+1−q(E

s), for q = 1, ...,m, (4.4)

λq(Es) = 0 for q = m + 1, ..., n.

The Cauchy interlacing property for Es implies [22, §7.3.9]

Let [n] := 1, 2, . . . , n, and let N ⊂ [n], M ⊂ [m] denote sets of cardinalities

n − n′,m − m′ ≥ 0 respectively.

Proposition 4.1 Let E ∈ Rn×m and denote by E ′ ∈ Rn′×m′

the matrix obtained

from E by deleting all rows i ∈ N and all columns j ∈ M. Then

σq(E) ≥ σq(E′) for q = 1, ...,m, (4.5)

σq(E′) ≥ σq+n−n′+m−m′(E) for q = 1, ...,m′ + n′ − n.

62

The significance of this proposition is explained in §4 and §5.

4.4.2 The Optimization Problem

We suggest a new method in which the estimation of missing entries is done simul-

taneously, i.e., the estimation of one missing entry influences the estimation of the

other missing entries. If the gene expression matrix E has missing data, we want to

complete its entries to obtain a matrix E, such that the rank of E is equal to (or does

not exceed) d, where d is taken to be the number of significant singular values of E.

The estimation of the entries of E to a matrix with a prescribed rank is a variation

of the problem of communality (see [16, p. 637].) We give an optimization algo-

rithm for finding E using the techniques for inverse eigenvalue problems discussed

in [14]. We now show that the estimation problem discussed in the previous section

can be cast as the following optimization problem:

Problem 4.2 Let S be a given subset of [n] × [m]. (S is the set of uncorrupted

entries of the gene expression matrix E given by (4.1).) Let e(S) := eji, (j, i) ∈

S be a given set of real numbers. (e(S) is the set of uncorrupted (known) values

of the entries of E.) Let M(e(S)) ⊂ Rn×m be the affine subset of all matrices

A = (aji) ∈ Rn×m such that aji = eji for all (j, i) ∈ S . (M(e(S)) all possible

choices for E.) Let ` be a positive integer not exceeding m. Find E ∈ M(e(S))

with the minimal σ`.

Let E = (gji) denote the gene expression matrix with missing values. We

choose the S in Problem 1 to be the set of coordinates (j, i) for which the entry gji

63

is not missing. Recall that N ⊂ [n] denotes the set of rows of E, such that each row

j ∈ N contain at least one missing entry. The cardinality of N is n − n′. Thus the

set S contains all elements (j, 1), ...(j,m) for each j ∈ N c. The complement of S

is the set of coordinates Sc = (j, i) | gji is missing ⊂ N × [m] . Let o denote the

total number of missing entries in E. Then o ≥ n − n′.

Let E ′ be the matrix as in §4.3.3 with l′ significant singular values. Note that

(4.5) yields σq(E) ≥ σq(E′) for q = 1, ...,m. Thus if we want to complete E such

that the resulting matrix still has exactly l′ significant singular values, we should

consider Problem 4.2 with ` = l′ + 1.

A more general possibility is to assume that the number of significant singular

values of a possible estimation of E is l = l′ + k where k is a small integer, e.g.

k = 1 or 2. That is, the group of genes gTj for j ∈ N contributes to l′ + 1, ..., l′ + k

significant eigengenes of E. Then one considers Problem 4.2 with ` = l′ + k + 1.

We now consider a modification of Problem 4.2 which has a nice numerical

algorithm.

Problem 4.3 Let S ⊂ [n]× [m] and denote by e(S) a given set of real numbers

eji for (j, i) ∈ S . Let M(e(S)) ⊂ Rn×m be the affine subset of all matrices A =

(aji) ∈ Rn×m such that aji = eji for all (j, i) ∈ S . Let ` be a positive integer not

exceeding m. Find E ∈ M(e(S)) such that∑m

q=` σ2q is minimal.

Clearly, we can find E ∈ M(e(S)) with a “small” σ2` (E) if and only if we can find

E ∈ M(e(S)) with a “small”∑m

q=` σ2q (E).

64

4.5 Fixed Rank Approximation Algorithm

4.5.1 Description of FRAA

We now describe one of the standard algorithms to solve Problem 4.3. Mathemati-

cally it is stated as follows:

Algorithm 4.4 Fixed Rank Approximation Algorithm (FRAA)

Let Ep ∈ M(e(S)) be the pth approximation to a solution of Problem 4.3. Let

Ap := ETp Ep and find an orthonormal set of eigenvectors for Ap, vp,1, ...,vp,m as

in (4.2). Then Ep+1 is a solution to the following minimum of a convex nonnegative

quadratic function

minE∈M(e(S))

m∑

q=`

(Evp,q)T(Evp,q). (4.1)

The flow chart of this algorithm can be given as:

Fixed Rank Approximation Algorithm (FRAA)

Input: integers m,n, L, iter, the locations of non-missing entries S , initial approx-

imation E0 of n × m matrix E.

Output: an approximation Eiter of E.

for p = 0 to iter − 1

- Compute Ap := ETp Ep and find an orthonormal set of eigenvectors for Ap,

vp,1, ...,vp,m.

- Ep+1 is a solution to the minimum problem (4.1) with ` = L.

65

4.5.2 Explanation and justification of FRAA

We now explain the algorithm and show that in each step, we decrease the value of

the function we minimize:

m∑

q=`

σ2q (Ep) ≥

m∑

q=`

σ2q (Ep+1). (4.2)

For any integer k ∈ [m], let Ωk denote the set of all k orthonormal vectors y1, ...,yk

in Rm. Let A be an m×m real symmetric matrix and assume (4.2). Then the min-

imal principle (the Ky-Fan characterization for −A) is:

m∑

q=`

λq(A) =m∑

q=`

zTq Azq = min

y`,...,ym∈Ωm−`+1

m∑

q=`

yTq Ayq. (4.3)

See for example [14].

Let E = Ep + X ∈ M(e(S)). Then X = (xji)n,mj,i=1 where xji = 0 if (j, i) ∈ S

and xji is a free variable if (j, i) 6∈ S .

Let x = (xj1i1 , xj2i2 , . . . , xjoio)T denote the o × 1 vector whose entries are in-

dexed by Sc, the coordinates of the missing values in E. Then there exists a unique

o × o real valued symmetric nonnegative definite matrix o × o matrix Bp which

satisfies the equality

xTBpx =m∑

q=`

vTp,qX

TXvp,q. (4.4)

Let F (j, i) be the n×m matrix with 1 in the (j, i) entry and 0 elsewhere. Then

the (s, t) entry of Bp is given by

bp(s, t) =1

2

m∑

q=`

vTp,q(F (js, is)

TF (jt, it) + F (jt, it)TF (js, is))vp,q, (4.5)

s, t = 1, . . . o.66

Let N ⊂ [n]. Let S(j) denote the set of coordinates in row j with known values

in E so that S(j)c denotes the set of coordinates of the missing values in row j.

Sc = ∪j∈NS(j)c, S(j)c = (j, i(j, 1)), ..., (j, i(j, oj)), (4.6)

m ≥ i(j, oj) > ... > i(j, 1) ≥ 1 for j ∈ N ,

o :=∑

j∈Noj. (4.7)

Note that the set Oj described just after (4.13) is given by Oj := i(j, 1), ..., i(j, oj).

Theorem 4.5 The o × o symmetric nonnegative definite matrix Bp given by

(4.4) decomposes into a direct sum of #N = n−n′ symmetric nonnegative definite

matrices indexed by the set N :

Bp = ⊕j∈NBp,j, Bp,j = (bp,j(q, r))oj

q,r=1) is oj × oj for j ∈ N , (4.8)

and

xTBpx =∑

i∈NxT

j Bp,jxj. (4.9)

More precisely, let vp,k = (vp,k,1, ..., vp,k,m)T, k = 1, ...,m be given as in Algorithm

4.4. Then

bp,j(q, r) =m∑

k=`

vp,k,i(j,q)vp,k,i(j,r), q, r = 1, ..., oj . (4.10)

Equivalently, let Wp be the following m × m idempotent symmetric matrix (W 2p =

Wp) of rank m − l + 1:

Wp =m∑

k=`

vp,kvTp,k = TpT

Tp , Tp = [vp,`, ...,vp,M ] ∈ Rm×(m−`+1). (4.11)

Then Bp,j is the submatrix of Wp of order oj with respect to the rows and columns

in the set Oj for j ∈ N . In particular, if in each row of E there is at most one

missing entry then Bp is a diagonal matrix.67

Proof. View the rows and the columns of Bp as indexed by (s, i(s, q)) and

(t, i(t, r)) respectively, where s, t ∈ N and q = 1, ..., os, r = 1, ..., ot. (For the

purposes of this proof, the notation here is slightly different from that in the body

of the paper.) So Bp = (bp((s, i(s, q)), (t, i(t, r)))). Let F (j, i) be the n×m matrix

which has 1 on the (j, i) place and all other entries are equal to zero. Then

bp((s, i(s, q)), (t, i(t, r))) =

1

2

m∑

k=`

vTp,k(F (s, i(s, q))TF (t, i(t, r)) + F (t, i(t, r))TF (s, i(s, q)))vp,k,(4.12)

s, t ∈ N , q = 1, ..., os, r = 1, ..., ot.

It is straightforward to show that F (s, i(s, q))TF (t, i(t, r)) = 0 if s 6= t. Further-

more, for s = t the matrix F (s, i(s, q))TF (t, i(t, r)) + F (t, i(t, r))TF (s, i(s, q))

has 1 in the places (i(s, q), i(t, r)) and (i(t, r), i(s, q)) for r 6= q, and has 2 in the

place (i(s, q), i(s, q)) if r = q and zero in all other positions. Hence

bp((s, i(s, q)), (t, i(t, q))) = 0 unless s = t. If s = t then a straightforward calcu-

lation yields (4.10). Other claims of the theorem follow straightforward from the

equality (4.10).

Observe that Bp can be decomposed into the direct sum of o symmetric non-

negative definite matrices indexed by N . Hence the function minimized in (4.1) is

68

given by

m∑

q=`

vTp,qE

TEvp,q =m∑

q=`

vTp,q(Ap + ET

p X + XTEp + XTX)vp,q =

xTBpx + 2wTp x +

m∑

q=`

λq(Ap) =

∑

i∈N(xT

j Bp,jxj + 2wTp,jxj) +

m∑

q=`

λq(Ap), (4.13)

where wp := (wp,1, . . . , wp,o)T, and

wp,t =m∑

q=`

vTp,qE

Tp F (jt, it)vp,q, t = 1, ..., o.

For j ∈ N the vector xj ∈ Roj contains all oj missing entries of E in the row j of

the form xjit , it ∈ Oj for the corresponding set Oj ⊂ [m] of cardinality oj . (See

Appendix.) Since the expression in (4.1), and hence in (4.13), is always nonneg-

ative, it follows that wp is in the column space of Bp. Hence the minimum of the

function given in (4.13) is achieved at the critical point

Bpxp+1 = −wp, (4.14)

and this system of equations is always solvable. (If Bp is not invertible, we find the

least-squares solution).

We now show (4.2). The vector xp+1 contains the entries for the matrix Xp+1.

Then Ep+1 := Ep + Xp+1. From the definition of Ap+1 := ETp+1Ep+1 and the

69

minimality of xp+1 we obtain

m∑

q=`

σq(Ep)2 =

m∑

q=`

vTp,q(Ep + 0)T(Ep + 0)vp,q ≥

m∑

q=`

vTp,q(Ep + Xp+1)

T(Ep + Xp+1)vp,q =m∑

q=`

vTp,qAp+1vp,q ≥

m∑

q=`

λq(Ap+1) =m∑

q=`

σq(Ep+1)2.

We conclude this section by remarking that to solve Problem 4.2, one may use

the methods of [16].

4.5.3 Algorithm for (4.14)

From Theorem 4.5, the system of equations Bpx = −wp in o unknowns is equiva-

lent to n − n′ smaller systems

Bp,jxp+1,j = −wp,j j ∈ N . (4.1)

Thus the big system of equations in o unknowns, the coordinates of xp+1, given

(4.14) splits to n − n′ independent systems given in (4.1). That is, in the iterative

update of the unknown entries of E given by the matrix Ep+1, the values in the row

j ∈ N in the places S(j)c are determined by the values of the entries of Ep in the

places S(j)c and the eigenvectors vp,`, ...,vp,m of ETp Ep.

We now show how to efficiently solve the system (4.14).

Algorithm 4.6 For j ∈ N let Tp,j is the oj × (m− ` + 1) matrix obtained from

Tp, given by (4.11), by deleting all rows except the rows i(j, 1), ..., i(j, oj). Then70

(4.1) is equivalent to

Tp,jTTp,jxp+1,j = −wp,j, i ∈ N , (4.2)

which can be solved efficiently by the QR algorithm as follows. Write Tp,j as

Qp,jRp,jPp,j , where Qp,j is an oj × dp,j matrix with dp,j orthonormal columns, Rp,j

is an upper triangular dp,j × oj matrix of rank dp,j nonzero rows, where the rank

Vp,j = dp,j , and Pp,j is a permutation matrix. (The columns of Qp,j are obtained

from the columns of Vp,j using Modified Gram-Schmidt process.) Then

QTp,jxp+1,j = −(Rp,jR

Tp,j)

−1QTp,jwp,j

and

xp+1,j = −Qp,j(Rp,jRTp,j)

−1QTp,jwp,j , j ∈ N (4.3)

is the least square solution for xp+1,j .

4.6 Simulation

We implemented the Fixed Rank Approximation Algorithm (FRAA) in Matlab and

tested it on the microarray data Saccharomyces cerevisiae [33] as provided at

http://genome-www.stanford.edu (the elutriation data set). The dimension of the

complete gene expression matrix is 5986×14. We randomly deleted a set of entries

and ran FRAA on this “corrupted” matrix to obtain estimates for the deleted entries.

The FRAA requires four inputs: the matrix E with N rows and M columns with

missing entries, an initial guess for the missing entries, a parameter L–the number

of significant singular values, and the number of iterations. We set the initial guess71

to the missing data matrix with 0’s replacing the missing values, the number of

significant values to L = 2, and ran the algorithm through 5 iterations. (There was

no significant change in the estimates when we replaced L = 2 with L = 3.)

We compared our estimates to estimates obtained by three other methods: re-

placing missing values with 0’s (zeros method), row means (row means method),

or the values obtained by the KNNimpute program [35]. We used a normalized

root mean square as the metric for comparison: if C represents the complete matrix

and Ep represents an estimate to the corrupted matrix E, then the root mean square

(RMS) of the difference D = C−Ep is ||D||F√N

. We normalized the root mean square

by dividing RMS by the average value of the entries in C.

In simulations where 1% − 20% of the entries were randomly deleted from

the complete matrix C, the FRAA performed slightly better than the row means

method, and significantly better than the zeros method. However, the KNNimpute

algorithm (with parameters k= 15, d= 0) produced the most accurate estimates,

with normalized RMS errors that were smaller than the normalized RMS errors

from the other three methods. Figure 7.1 displays the results of one set of experi-

ments estimating the elutriation matrix when each of 1, 5, 10, 15, 20% of entries was

removed: the normalized RMS errors are plotted against percent missing. When 25

simulations of deleting and then estimating 5% of the the entries was conducted, we

found the average normalized RMS to be approximately 0.19 for KNNimpute and

0.24 for FRAA, with standard deviation to be approximately 0.02 for both methods.

Not surprisingly, normalized RMS’s increase with increasing percentage of missing

values.72

Percent missing

norm

aliz

ed R

MS

err

or

1 5 10 15 20

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

FRAAKNNimputeRow means

Elutriation

Fig. 7.1 Comparison of normalized RMS against percent missing

for threee methods: FRAA, KNNimpute, and row means methods.

The normalized RMS for the zeros method is not displayed, but

the values are 0.397, 0.870, 1.24, 1.52, 1.76, for 1, 5, 10, 15, 20%

percent missing, respectively.

In [35], the authors caution against using KNNimpute for matrices with fewer

than 6 columns. We randomly selected four columns from the elutriation data set

to form a truncated data set, then randomly deleted from 1% − 20% of the entries

from this newly formed matrix. Figure 7.2 gives a comparison of the normalized

RMS errors against percent missing in one run of the simulation at each of the

percentages. When 25 simulations at 10% missing was run, we found the average

normalized RMS to be approximately 0.143 for FRAA and 0.166 for KNNimpute,

with standard deviations of approximately, 0.001 and 0.003, respectively.

73

Percent missing

norm

aliz

ed R

MS

err

or

1 5 10 15 20

0.00

00.

050

0.10

00.

150

0.20

00.

250 FRAA

KNNimputeRow means

Elutriation: 4 columns

Fig. 7.2 Four columns of the full elutriation matrix were randomly

selected. Entries were then randomly deleted from this truncated

matrix. Plot of normalized RMS against percent missing.

For one simulation in which we randomly deleted and then estimated 10%

(4200) of the entries from the full elutriation matrix,we compared the raw errors

(true value - estimated value) for each of the 4200 imputed entries obtained us-

ing either KNNimpute or FRAA. Figure 7.3 shows a scatter plot of the raw errors

from the estimate using KNNimpute against the raw errors from the estimate using

FRAA. This plot seems to suggest that the algorithms KNNimpute and FRAA are

rather consistent in how they are estimate the missing values.

74

FRAA: raw errors

KN

Nim

pute

: raw

err

ors

-1 0 1 2 3 4 5

01

23

45

Scatter plot of raw errors 10% missing

Fig. 7.3 Scatter plot of the raw errors (true - estimate) of each of the 4200 imputed

entries in one simulation using KNNimpute and FRAA. The correlation between

the two sets of raw errors is .84.

We ran similar simulations on the Cdc15 data set available on the web, ( http://genome-

www.stanford.edu/SVD/htmls/spie.html), and on subsets of this data set (using 4

columns). We also ran a couple of simulations on one of the data sets included

by [28]. The outcomes were similar to that using the Elutriation data set, with

the FRAA algorithm outperforming KNN on the matrices with a small number of

columns.

4.7 Discussion of FRAA

The Fixed Rank Approximation Algorithm uses the singular value decomposition

to obtain estimates of missing values in a gene expression matrix. It uses all the

known information in the matrix to simultaneously estimate all missing entries.

Preliminary tests indicate that, under a normalized root mean square metric, FRAA

is more accurate than replacing missing values with 0’s or with row means. The

75

KNNimpute algorithm was more accurate when estimating missing entries deleted

from the full elutriation matrix, but FRAA might be a feasible alternative in cases

when the number of columns is small.

FRAA is another option, in addition to KNN, Bayesian estimations or local least

squares imputations, for estimating missing values in gene expression data. FRAA

by itself is a very useful tool for gene data analysis without using clustering meth-

ods. Experimental results on various data sets show that FRAA is robust. FRAA

has been used by several computational biologists, who confirmed the accessibility

of the algorithm.

To improve the results given by FRAA one needs to combine it with an algo-

rithm for gene clustering. A possible implementation is as follows: First, apply

FRAA to the corrupted data set; next, using this estimated data set, subdivide the

genes into clusters of genes with similar traits; now apply FRAA again to the miss-

ing entries of genes in each cluster. We intend to apply these steps in a future paper.

Our final remark is that the biology of the data should guide the researcher in

determining the best method to use for imputing missing values in these data sets.

4.8 IFRAA

4.8.1 Introduction

The aim of this section to introduce IFRAA, an improved version of FRAA, which

improves significantly the performance of FRAA. IFRAA is a successful combina-

tion of FRAA and a good clustering algorithm. IFRAA works as follows. First we

76

use FRAA to find a completion G. Then we use a clustering algorithm (we used

K-means here) to find a reasonable number of clusters of similar genes. For each

cluster of genes we apply FRAA separately to recover the missing entries in this

cluster. It turns out that this modification results in a very efficient algorithm for

reconstructing the missing values of the gene expression matrix. We also note that

FRAA and IFRAA are very effective in reconstructing missing values of n × m

matrices, which are expected to have low effective ranks.

4.8.2 Computational comparisons of

BPCA, FRAA, IFRAA and LLSimpute

For comparison of different imputation algorithms, five different types of data set

were used, including two microarray gene expression data and three randomly gen-

erated synthetic data [15]. Microarray data sets were obtained from studies for

the identification of cell-cycle regulated genes in yeast (Saccharomyces cerevisiae)

[33]. The first gene expression data set is a set of complete matrix of 5986 genes

and 14 experiments based on the Elutriartion data set in [33]. This data set does

not have any missing value since all the genes that originally had missing values

were deleted to assess the performance of the imputation algorithms. The second

microarray data set is based on the Cdc15 data set in [33], which contains 5611

genes and 24 experiments. We obtained the complete data set in the same way as

for first data set. Three synthetic data sets were randomly generated matrices of

size 2000 × 20 and ranks 2, 4, 8.

To assess the performance of missing value estimation methods, we performed77

simulations where 1%, 5%, 10% and 20% of the entries were randomly deleted from

the complete matrix C. Then we estimated the various completions of the missing

values by BPCA, FRAA, IFRAA and LLSimpute. We used the L2-norm version of

LLSimpute. We set the K-value parameter (number of similar genes) for LLS such

that there was no increase in accuracy of LLS by increasing the K-value.

We used a normalized root mean square error (NRMSE) as a metric for compar-

ison. If C represents the complete matrix and C represents the completed matrix

using an estimate to the corrupted entries in C, then the root mean square error

(RMSE) is ‖D‖F√N

, where D = C − C. We normalized the root mean square error by

dividing RMSE by the average value of the entries in C.

The random matrices of order 2000 × 20 and of rank k = 2, 4, 8 appearing in

Figures 1,2 and 3 were generated as follows. One generates 2k random column

vectors x1, . . . ,xk ∈ R2000,y1, . . . ,yk ∈ R20, where the entries of these vectors

are chosen according to a uniform distribution in the interval [0, 1]. Then C =

∑k

i=1 xiyTi .

Figure 1 represents the comparisons of BPCA, FRAA and LLSimpute for 2000×

20 random matrix of rank 2. In this case FRAA completely reconstructed C. The

performance of IFRAA was identical to FRAA, and we did not plot the performance

of IFRAA. BPCA performed excellent for 1% and its performance somewhat de-

teriorated with the increase of the percentage of missing data. The performance of

LSSimpute was excellent for 1% of missing data, and its performance significantly

deteriorated with the increase of the percentage of missing data.

Figure 2 represents the comparisons of BPCA, FRAA, IFRAA and LLSimpute78

for 2000 × 20 random matrix of rank 4. Here BPCA and IFRAA performed ex-

tremely well. IFRAA slightly outperformed BPCA in particular in the case with

20% of missing data. FRAA performed reasonably well, but not as good as IFRAA

or BPCA. The behavior of LSSimpute was similar to its performance on Figure 1.

Figure 3 represents the comparisons of BPCA, FRAA, IFRAA and LLSimpute

for 2000 × 20 random matrix of rank 8. The behavior of BPCA, IFRAA and LL-

Simpute are similar to their behavior on Figure 2. In this case, however the per-

formance of the FRAA is good for percentage of missing entries less than 20% but

FRAA does not perform well when 20% of entries are missing.

Figure 4 and 5 compare BCPA, IFRAA and LLSimpute for gene expression data

matrices,the Elutriation data set and the Cdc15 data set, respectively. Since some of

the methods are local and others global, to be fair in comparison we first clustered

the data set using the clustering algorithm, K-means, and then we performed the

missing value estimation on the clustered data. In our simulation we randomly

deleted 1%, 5%, 10%, 15%, and 20% entries of data matrix. In IFRAA we chose

the parameter L, which means the number of significant singular values +1, to be

equal to 2 for Elutriation data set and 3 for Cdc15 data set. The initial guess for the

missing entries in each gene was chosen to be the row average of its corresponding

row. FRAA was not performing as well as the other three methods, so we did not

include FRAA graphs in Figures 4 and 5.

Figure 4 depicts the comparison of BPCA, IFRAA and LLSimpute Elutriation

data set in [33]. The corresponding gene expression data set is a complete matrix of

n = 5986 genes and m = 14 experiments. In this case BPCA performed the best,79

0 2 4 6 8 10 12 14 16 18 200

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

% of missing

NE

RM

S

Data matrix of rank 2

FRAA

Bayesian

least square

Figure 1: Comparison of NRMSE against percent of missing entries for three meth-

ods: FRAA, BPCA and LLS. Data set was a 2000 × 20 randomly generated matrix

of rank 2.

IFRAA was slightly inferior to BPCA, and LLSimpute weaker than IFRAA.

Figure 5 depicts the comparison of BPCA, IFRAA and LLSimpute Cdc15 data

set in [33] which contains 5611 genes and 24 experiments. In this case BPCA

performed the best and again IFRAA performed slightly better than LLSimpute.

80

0 2 4 6 8 10 12 14 16 18 200

0.01

0.02

0.03

0.04

0.05

0.06

NR

MS

E

% of missing


BPCA

FRAA

LLS

IFRAA

Figure 2: Comparison of NRMSE against percent of missing entries for four meth-

ods: FRAA,IFRAA, BPCA and LLS. Data set was a 2000×20 randomly generated

matrix of rank 4.

0 2 4 6 8 10 12 14 16 18 200

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

% of missing

NR

MS

E


IFRAA

BPCA

FRAA

LLS

Figure 3: Comparison of NRMSE against percent of missing entries for four meth-

ods: FRAA, IFRAA, BPCA and LLS. Data set was a 2000×20 randomly generated

matrix of rank 8.

81

0 2 4 6 8 10 12 14 16 18 200.02

0.03

0.04

0.05

0.06

0.07

0.08

% of missing

NR

MS

E

Elutriation data set

BPCA

IFRAA

LLS


ods: IFRAA, BPCA and LLS. Data set was a clustered data from Elutriation data

set in [33] with 14 samples.

0 2 4 6 8 10 12 14 16 18 200.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

% of missing

NR

MS

E

Cdc15, 24 samples

BPCA

IFRAA

LLS


ods: IFRAA, BPCA and LLS. Data set was a clustered data from Cdc15 data set in

[33] with 24 samples.

82

4.9 Conclusions

We applied the four algorithms on several data matrices including two microarray

data sets. We corrupted, at random, certain percentages of these data sets and let the

four algorithms recover them. We found BPCA and IFRAA to be the most reliable

approaches. The LLSimpute also performed quite well. FRAA by itself was infe-

rior to all four methods, unless the gene expression matrix has a small rank. We also

applied the four algorithms to synthetic data sets, which were random 2000 × 20

matrices of small ranks k = 2, 4, 8, where we again corrupted at random certain

percentages of these data sets. Not surprisingly IFRAA and BPCA were able to

recover the data quite well. Where the percentage of the corrupted data was not too

large, or the matrix had rank 2, FRAA was able to reconstruct the data almost per-

fectly. LLS-impute performed well when the data set has 1% percentage of missing

entries. The performance of LSSimpute deteriorated gradually with increasing per-

centage of missing entries. In conclusion BPCA and IFRAA, combined with a good

clustering algorithm such as K-means used here, appear to be reliable methods for

recovering DNA microarray gene expression data, or any other noisy data matrix

which is effectively low-rank.

These results and the results in [27] and [28] show that the performances of

BPCA, IFRAA and LSSimpute depend on the structure and also the size of the data

matrices and none of the above mentioned three algorithms outperforms the others

in all cases.

Our studies also show that the performance of FRAA and hence IFRAA is de-

83

pendent on missing entries distribution. Since the performance of the FRAA de-

pends on effective rank of the submatrix formed by rows which have no missing

entries, it is important to have enough number of rows with this property. We can

argue that the reason BPCA in some cases works slightly better than IFRAA is

because of dependency of IFRAA on missing entries distribution. However, in a

real application, when the distribution of missing entries is not uniform, contrary

to what we had in our simulations, where not many rows have missing entries we

expect IFRAA to perform better than BPCA. For local methods like KNN and LLS

where similar genes should not have missing entries in the same column, uniform

distribution of missing entries does not have any degradation effect because it is

unlikely to have missing entries in the same column for similar genes.

4.10 Matlab code

function Ep1 = fraa(E,Ep,L,iter)

%Fixed rank algorithm -- estimate missing values

%Usage: fraa(E,Ep,L,iter)

%E: matrix with missing values

%Ep: initial solution

%L: parameter (number of significant singular values + 1)

%iter: number of iterations to perform

%Note: Any rows with all missing values must be removed

%%%%%%%%%% THIS IS THE SET-UP

84

%Get size of E

[N,M]=size(E);

if (L > M)

error(’need L<=#columns of E ’)

end;

%get index of missing values

missing=find(isnan(E));

%Number of missing values

m=length(missing);

m2=m*m;

%%%%%%%%%%% NOW WE WORK WITH THE ALGORITHM

Xp1=zeros(N,M);

track=iter;

while(iter > 0)

A=Ep’*Ep;

%Find singular value decomposition of A

[U,S,V]=svd(A);

%Singular values of Ep

sigma2=S(S˜=0);

singular=sqrt(sigma2);

partial_sig2=sum(sigma2(L:M));

total_sig2=sum(sigma2(1:M));

fprintf(’\n iteration %3.0f \n’, track-iter+1)85

fraction=partial_sig2/total_sig2;

fprintf(’ partial sum/total sum of sq. singular values

\n %1.8f’, fraction)

fprintf(’\n’)

%Construct B=Bp

B=sparse(m,m); %pre-allocate space

[is,js]=ind2sub([N,M],missing(1:m));

for s=1:m

for t=s:m

if (i(s)==i(t))

B(s,t)=sum(U(js(s),L:M)*U(js(t),L:M)’);

B(t,s)=B(s,t); %B is symmetric

end %end if

end %end For t

end %end for s

%%%NOW CONSTRUCT THE VECTOR Wp

W=sparse(m,1); %pre-allocate space

for t=1:m

K=sparse(N,M);

K(missing(t))=1;

W(t)=sum(diag(U(:,L:M)’*Ep’*K*U(:,L:M)));

end %end for

%Solve Bx_(p+1)= -W86

xp1=-B\W;

%Create matrix B_p+1

Xp1(missing)=xp1;

%Update solution

Ep=Ep+Xp1;

%set counter

iter=iter-1;

end %End while

fprintf(’\n’)

fprintf(’ singular values (final iteration):\n’)

fprintf(’%16.6f’,singular)

Ep1=Ep;

For the Matlab m file or a version of this algorithm for R, see

http://people.carleton.edu/˜lchihara/LMCProf.html

87

References

[1] C.C. Aggrawal, C.M. Procopiuc, J.L. Wolf, P.S. Yu and J.S. Park, Fast algo-

rithms for projected clustering, Proc. of ACM SIGMOD Intl. Conf. Manage-

ment of Data 1999, 61-72.

[2] R. Agrawal, J.Gerhrke, D.Gunopulos, and P. Raghavan, Automatic subspace

clustering of high dimensional data for data mining applications, Proc. ACM

SIGMOD Conf. on Management of Data, 1998, 94-105.

[3] U. Alon et al., Broad patterns of gene expression revealed by clustering analy-

sis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc.

Natl. Acad. Sci. USA 96 (1999), 6745-6750.

[4] O. Alter, P.O. Brown and D. Botstein, Processing and modelling gene ex-

pression expression data using the singular value decomposition, Proceedings

SPIE, vol. 4266 (2001), 171-186.

[5] O. Alter, P.O. Brown and D. Botstein, Generalized singular decomposition for

comparative analysis of genome-scale expression data sets of two different

organisms, Proc. Nat. Acad. Sci. USA 100 (2003), 3351-3356.

[6] O. Alter, G.H. Golub, P.O. Brown and D. Botstein, Novel genome-scale cor-

relation between DNA replication and RNA transcription during the cell cycle

in yeast is predicted by data-driven models, 2004 Miami Nature Winter Sym-

posium, Jan. 31 - Feb. 4, 2004.

88

[7] P. Baldi and G. Wesley Hatfield, DNA Microarrays and Gene Expression,

Cambridge University Press, 2002

[8] T.H. Bo, B. Dysvik and I. Jonassen, LSimpute, Accurate estimation of missing

values in microarray data with least squares methods, Nucleic Acids Research,

32 (2004), e34.

[9] H. Chipman, T.J. Hastie and R. Tibshirani, Clustering micrarray data In:

T. Speed, (Ed.), Statistical Analysis of Gene Expression Microarray Data, ,

Chapman & Hall/CRC, 2003 pp. 159-200.

[10] E. Domany, Cluster Analysis of Gene Expression Data, Journal of Statistical

Physics, 110 (2003), 1117-1139

[11] P. Drineas, A. Frieze, R. Kannan, S. Vempala and V. Vinay, Clustering large

graphs via the singular value decomposition, Journal of Machine Learning,

56 (2004), 9-33.

[12] Petros Drineas, Eleni Drinea, Patrick S. Huggins,An Experimental Evaluation

of a Monte-Carlo Algorithm for Singular Value Decomposition, Panhellenic

Conference on Informatics 2001: 279-296

[13] M. Ester, H.-P. Krieger, J. Sander and X.Xu, A density-based algortihm for

discovering clusters in large spatial databases with nose, Proc. 2nd Intl. Conf.

Knowledge Discovery and Data Mining, 1996, 226-231.

[14] S. Friedland, Inverse eigenvalue problems, Linear Algebra Appl., 17 (1977),

15-51.89

[15] S. Friedland, M. Kaveh, A. Niknejad, H. Zare, An Improved Fixed Rank Ap-

proximation Algorithm for Missing Value Estimation for DNA Microarray

Data, submitted to the 2005 IEEE Symposium on Computional Intelligence

in Bioinformatics and Computational Biology

[16] S. Friedland, J. Nocedal and M. Overton, The formulation and analysis of

numerical methods for inverse eigenvalue problems, SIAM J. Numer. Anal.

24 (1987), 634-667.

[17] S. Friedland and A. Niknejad, Fast Monte-Carlo low rank approximations for

matrices, preprint, 9 pp..

[18] S. Friedland, A. Niknejad and L. Chihara, A Simultaneous Reconstruction of

Missing Data in DNA Microarrays, Linear Algebra Appl., to appear, (Institute

for Mathematics and its Applications, Preprint Series, No. 1948).

[19] A. Frieze, R. Kannan and S. Vempala, Fast Monte-Carlo alogrithms for find-

ing low rank approximations, Proceedings of the 39th Annual Symposium on

Foundation of Computer Science, 1998.

[20] X. Gan, A.W.-C. Liew and H. Yan, Missing Microaaray Data Estimation

Based on Projection onto Convex Sets Method, Proc. 17th International Con-

ference on Pattern Recognition, 2004.

[21] G.H. Golub and C.F. Van Loan, Matrix Computations, John Hopkins Univ.

Press, 1983.

[22] R.A. Horn and C.R. Johnson, Matrix analysis, Cambridge Univ. Press, 1987.90

[23] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov,

H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and

E.S. Lander, Molecular classification of cancer: class discovery and class pre-

diction by gene expression monitoring, Science 286 (1999), 531-537.

[24] D.A. Jackson, Stopping rules in principal component analysis: a comparison

of heuristical and statistical approaches, Ecology 74 (1993), 2204-2214.

[25] D. Jiang , C. Tang and A.Zhang, Cluster Analysis for Gene Expression Data:

A Survey, IEEE Transactions on Knowledge and Data Engineering (TKDE),

Volome 16(11), page 1370 - 1386, 2004

[26] R.A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis,

Prentice Hall, New Jersey, 4th edition (1998).

[27] H. Kim, G.H. Golub and H. Park, Missing value estimation for DNA microar-

ray gene expression data: local least squares imputation, Bioinformatics 21

(2005), 187-198.

[28] S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara and S. Ishii, A

Baesian missing value estimation method for gene expression profile data,

Bioinformatics 19 (2003), 2088-2096.

[29] C.C. Paige and M. A. Saunders, Towards a generalized singular value decom-

position, SIAM J. Numer. Anal. 18 (1981), 398-405.

91

[30] C.M. Procopiuc, P.K. Agarwal, M. Jones and T.M. Murali, A Monte Carlo

algorithm for fast projective clustering, Proc. of ACM SIGMOD Intl. Conf.

Management of Data 2002.

[31] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C. Aguiar,

M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus et al., Diffuse large B-cell

lymphoma outcome prediction by gene-expression profiling nad supervised

machine learning, Nat. Med. 8 (2002), 68-74.

[32] A. Schulze and J. Downward, Nature Cell. Biol. 3:190 (2001)

[33] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen,

P.O. Brown, D. Botstein and B. Futcher, Comprehensive identification of cell

cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray

hybridization, Mol. Biol. Cell, 9 (1998), 3273-3297.

[34] G.W. Stewart, A method for computing the generalized singular value decom-

position, Matrix Pencils, B. Kagstrom and A. Ruhe, Lecture Notes in Mathe-

matics, 973 (1982), 207-220.

[35] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani,

D. Botstein and R. Altman, Missing value estimation for DNA microarrays,

Bioinformatics 17 (2001), 520-525.

[36] S. Vempala, The Random Projection Method, DIMACS Vol. 65, American

Mathematical Society, 2004

92

application of singular value decomposition to dna...

Documents