missing value expectation of matrix data by fixed rank ...mtamura/makiotamuramasterproject.pdf ·...

Missing Value Expectation of Matrix Databy Fixed Rank Approximation Algorithm

Master Project ReportUniversity of Illinois at ChicagoComputer Science Department

May 2006

Makio TamuraDoctor of Engineering in Biotechnology

Tokyo Institute of Technology, Japan 2001

Approved by

Dr. Shmuel FriedlandUniversity of Illinois at ChicagoMathematics, Statistic, and Computer Science Department

Dr. Robert SloanUniversity of Illinois at ChicagoComputer Science Department

Dr. Ugo BuyUniversity of Illinois at ChicagoComputer Science Department

Makio TamuraMaster of Science in Computer Science

University of Illinois at Chicago

Summary

Missing values make statistical and machine learning analysis less reliable and sometimes impossible and unstable learning and analysis methods are easily influenced by the tentatively estimated values of missing entries. Fixed Rank Approximation Algorithm (FRAA) is proposed by Professor Shmuel Friedland and et al. to predict multiple missing values in a matrix data simultaneously and the algorithm shows comparable strength with other prediction methods such as BPCAimpute and LLSimpute that have been reported recent year for the missing value prediction of DNA/RNA microarray data. A modification of FRAA, Scanned Fixed Rank Approximation Algorithm (SFRAA) is proposed in this report.

Section 1 introduces the background of missing value prediction methods that were developed for DNA/RNA microarray data. The microarray experiment itself is relatively new biological experimental technique and several missing value prediction methods are introduced after 2000.

Section 2 describes the theoretical bases of FRAA to predict missing values in a matrix data. A new methods developed during this project to utilize the FRAA called Scanned Fixed Rank Approximation Algorithm is introduced.

Section 3 discuss the comparison of prediction accuracy between SFRAA and FRAA with BPCAimpute and LLSimpute which use mathematically distinctive approaches. A systematic experiment demonstrates that SFRAA has better performance against other methods.

Section 4 shows a function and implementation of SFRAA application running on Microsoft Window environment called “Seed”. The sub sections of requirements and functions can be used as a manual of the Seed for a user. The design of application is depicted by model based Object-Oriented representation.

2



Index

1 Introduction..................................................................................................................4

2 Theory...........................................................................................................................6

2.1 High Rank Eigengenes...............................................................................62.2 The number of High Rank Eigengenes and Expression Cluster.............72.3 Missing Value Prediction by Singular Value Decomposition...................82.4 Optimization with Fixed High Rank Eigengenes.....................................92.5 Fixed Rank Approximation Algorithm - FRAA.........................................92.6 Scanned Fixed Rank Approximation Algorithm - SFRAA.......................12

3 Comparison of Prediction Accuracy.............................................................................14

3.1 Method........................................................................................................143.2 Result..........................................................................................................15

4 SFRAA Implementation...............................................................................................17

4.1 Requirements and Functions.....................................................................174.2 Design..........................................................................................................264.3 Source Codes...............................................................................................31

5 Conclusion.....................................................................................................................35

Acronyms

BPCA Bayesian Principal Component AnalysisFRAA Fixed Rank Approximation AlgorithmIFRAA Improved Fixed Rank Approximation AlgorithmKNN K Nearest Neighbor LLS Local Least SquaresNRMSE Normalized Root Means Square ErrorSFRAA Scanned Fixed Rank Approximation AlgorithmSVD Singular Value Decomposition

3



1. Introduction

The microarray experiment enables us to get the overview of the on-off switching of gene activities for a series of different conditions such as the time course after a certain drug dosage, consecutive environmental stimulation changes, or different physiological conditions such as normal and cancer cell or different cell development states. A microarray data is usually a large matrix of gene expression quantity where each row is gene and each column is different condition. Recent technology improvement makes us possible to quantify about 500,000 genes activity in a single microarray experiment, and this number is almost equivalent with the number of protein coding genes of human beings.

Since one microarray contains a huge number of spots, there are often missing values or unreliable value due to insufficient image resolution, image corruption, dust or scratch on a plate. Standard supervised statistical microarray analysis such as hierarchical clustering, k-means clustering, support vector machine classification, principal component analysis, or singular value component analysis can not be applied to data set with missing values. One solution to deal with the missing values is to do the same experiment and replicate the data. This extra labor work strategy has been used in many experimental scientists and wet laboratories so far. If the cost of the experiment is not expensive, it may be a practical solution, but certain type of experiments such as patient specific time course experiment is very expensive or impossible to be reproduced.

Less labor work and simple tentative solution is to fill the missing value by zero, average of the gene expression, or average of overall expression values. Of course, the values may not be optimal and unreliable. Recently, more sophisticate methods have been proposed. There are two types in these approaches. One is to use the similarity in plain expression pattern, and the representatives of this type are the KNNimpute [O. Troyanskaya et al.] that uses k-nearest neighbor clustering, and the LLSimpute [H. Kim et al.] that uses k-nearest neighbor clustering, least square, and Bayesian optimization. The basic strategy of this type is to find similar expression patterns with a expression pattern having missing values by clustering methods, and then to predict the missing value from the corresponding values in the same cluster. The other one is to use high rank Eigengenes in a hidden concept space to predict the missing value. The representative of this type is the SVDimpute [O. Alter et al.] that uses singular value decomposition, and the BPCAimpute [S. Oba et al.] that used principal component analysis and Bayesian optimizations. The basic strategy of this type is to find bases of expression space, and then to reconstruct a matrix with the dominant bases. During the reconstruction process, the missing values are filled. The basis is called Eigengene, and Eigengene shows a gene expression fluctuation which is orthogonal each other in a expression patter space. There is not consensus about which type of algorithms is better, and the past experiments [S. Oba et al., H. Kim et al.] show that the BPCAimpute and the LLSimpute predict generally better than the others and the performances of these two methods are almost comparable with depending on data sets.

Fixed Rank Approximation Algorithm (FRAA) is a method [S. Friedland et al., A Nikneijad] to predict missing entries by using Eigengenes, and in this point the FRAA is similar to SVDimpute and BPCAimpute type prediction. The number of high rank Eigengene should be close to the rank of the perfect matrix, but it is hard to guess the correct

4



number of high rank Eigengenes from a data with missing entries. To find the optimal number, BPCAimpute uses Bayesian statistics to choose while SVDimpute uses a given fixed number. FRAA also requires the fixed number of major Eigengenes, but the uniqueness of FRAA is that it has iteration process which can increase the importance of these high rank Eigengenes in a reconstructed matrix on each step.

However, it is difficult to guess the correct number of Eigengne or rank of perfect matrix and therefore even FRAA itself is powerful but not useful in a practical case. The other drawback of FRAA is that the result heavily deepens on initial tentative values for missing entries. To deal with these problems, Scanned FRAA (SFAA) is proposed in this project. SFRAA automatically find the optimal number of high rank Eigengene and avoid local optimal solution by scanning that number from small to large. SFRAA shows better prediction accuracy than BPCAimpute and LLSimpute on various 100 × 100 synthetic matrixes.

To make the SFRAA algorithm available to wet laboratory scientists, I implemented the algorithm as a Windows application called “Seed” by C# language. Since a spread sheet application such as Microsoft Excel fit to represent the numerical value of the matrix data, the application is integrated with Microsoft Excel. The user can take advantage to access various analyses methods provided by Microsoft Excel as well as the SFRAA. The other prediction algorithm can be added into the “Seed” since prediction algorithmic code is implemented as an independent module.

5



2. Theory

A whole data set of gene expression profile is represented by a numerical m x n matrix G, where m is the number of genes and n is the number of experiments and m > n.

mnR ×∈G is called an expression matrix. The (i, j) component of the matrix, gij, denotes the expression level of the i-th gene expression on the j-th experiment, which is typically a logarithm of the expression ratio between the control and the objective samples. The i-th row vector of G, gi

T, represents expression of i-th gene on n experiments.

nm

m

i

mnmm

inii

n

n

R

ggg

ggg

ggg

ggg

×∈

=

=

T

T

T1

T1

21

21

22221

11211

g

g

g

g

G

(2.1)

The expression matrix G is decomposed into

∑=Σ==

r

qqqq

1

TT vuVUG σ (2.2)

by Singular Value Decomposition (SVD) where r is the rank of G and the singular value σq ≥ σq+1. The each column vectors of U, uq, is called an Eigenarray, each column vector of V, vq, is called an Eigengene, and the each singular value of G, σq, is called Eigenexpression.

2.1 High Rank Eigengenes

The Frobenius (l2) norm FG is the Euclidean norm of G viewed as a vector with mn

coordinates. Each term of equation (2.2) is rank one matrix with 1T =F

vu qq . Let Gl be a dimensionally reduced expression matrix of G with l Eigenexpressions, and corresponding Eigengenes and Eigenarraies.

∑==

l

qqqql

1

TvuG σ (2.3)

The Frobenius distance between G and Gl is

6



∑=∑−∑=−+===

r

lqq

l

qqqq

r

qqqql

1

2

1

T

1

T σσσF

F vuvuGG (2.4)

We can approximate the original expression matrix G by Gl when the Frobenius distance is small enough. The fractional contribution pq of each Eigengene vq over the whole expression matrix can be stated with a corresponding Eigenexpression σq.

T1

12

2

1 )p,.....,(pr,.....,q,p rrt t

qq ==

∑=

=

pσ

σ(2.5)

Then we choose the l Eigengenes that contribute about 70%-90% of the total expression level. These Eigengenes are called high rank Eigengenes. Another method is to use scree plots for the σq. In principal component analysis, the pq is proportional to the variance of the corresponding principal component of maximum variability. The most consistent estimation of the number of significant Eigengene is achieved by the broken-stick model (D.A.Jackson).

The rescaled entropy of the expression matrix h(p) is derived from the vector p.

∑−=≥ =rq qq pp

rh 1 log

log1:)(1 p (2.6)

When all the Eigengenes equally contribute the expression matrix, the rescaled entropy becomes one. On the other hand, when the single Eigengene contributes, the rescaled entropy becomes zero.

2.2 The number of High Rank Eigengenes and Expression Cluster

Suppose that a set of gene expression patterns gj, ∈ Cq have a similar expression pattern and can be grouped into the cluster Cq, and there are k clusters in the expression matrix G. Gene expression patterns with no similar characteristics are placed into Ck+1

cluster and therefore the sum of each cluster size is the same as total number of genes q

kqm G1

1+== . Now we assume that it is possible to reduce a set of gene expression patterns

in the same cluster into a single expression pattern and expression level is zero for genes in Ck+1.

n,....,jig

n,....,jk,,...,qiag

kij

qqjij

1 and eachfor 0

11 and eachfor

1 =∈=

==∈=

+G

G(2.7)

Let nknk,1iq,ij )(a ×

= ∈= RA be the corresponding k×n matrix with rows a1T,....,ak

T. The

row aq appears exactly qC times in the original expression matrix G. Clearly the row space of A is the row space of G. And therefore k ≥ rank(G )= rank(A). The equation is not true

7



when the row of A are linearly dependent which means that several a1T,....,ak

T are linearly related. In this case, the number of the significant singular value of G is less than the number of cluster k.

2.3 Missing Value Prediction by Singular Value Decomposition

Let LMiss denote a set of indices of genes that contains at least one missing entry and LComp denote a set of indices of genes that contains no missing entry in the expression matrix G. We show how to fill the missing values of gene expressions gi

T : i∈ LMiss. by Singular Value Decomposition as given in [O. Alter].

Let G’ be the ncomp ×L matrix containing gene expressions giT : i ∈ LComp,

eigGene’1,...,eigGene’r be the Eigengne expressions of G’’, and l’’ be the number of high rank Eigengenes of G’. For a gene expression gmiss : miss ∈ Lmiss, let S be a set of known value entry index of the gmiss and S’ be a set of missing value index of the gmiss, then gmiss contains gmiss(s) : s ∈ S known experimental entries and gmiss(s’) : s’ ∈ S’ missing experimental entries. Consider a projection Τs: Rn → RS that deletes all the coordinate s’ ∈ S’ for any vector x = (x1,....,xn)T. Then Τs(gmiss) can be represented by a linear combination of Τs(eigGene’1),...., Τs(eigGene’l’). Let EigGene’ be a matrix = [ Τs(eigGene’1),...., Τs(eigGene’l’)], and the linear combination is the least square solution y.

2)(T min missy

gsyEigGene' −× (2.8)

The least square solution is y = (EigGene’)† ×Τs(gmiss) where (EigGene’)† is the Moore-Penrose generalized inverse of EigGene’. Then, the missing value at the s’ ∈ S’ experiment of gmiss

T, gmissT(s’), is estimated by the linear relation.

gmiss(s’) = [ eigGene’1(s’),...., eigGene’l’ (s’)] × y (2.9)

where eigGene’k(s’) is the s’ element of eigGene’k.

The critical assumption of this method is

l = l ‘ (2.10)

That the reconstructed matrix Grecon and its sub matrix G’’ have the same number of high rank Eigengenes since the completion of gmiss : miss ∈ Lmiss, lies in the subspace [ eigGene’1,...., eigGene’l’]. However, this is very restrictive assumption.

H. Chipman and et al., and O. Troyanskaya and et al., proposed an iterative method using SVD to predict missing values in a gene expression matrix. First, replace the missing values with zero or with values computed from another method. Let Gt be the t-th iteration of the reconstructed matrix and the initial is t=0. Find the lt high rank singulars values from Gt and let Gtlt be the reconstructed matrix by lt high rank singular values and corresponding

8



Eigengenes and Eigenarraies. Replace the missing values in G0 by the value from Gtlt to obtain Gt+1. Continue this process until Gt converges to a fixed matrix. This algorithm takes into account implicitly the influence of the estimation of one entry on the other ones. However, it is neither clear if the algorithm converges, nor what are the features of fixed point of this algorithm.

2.4 Optimization with Fixed High Rank Eigengenes

PrepositionLet E ∈ Rm×n and E’ ∈ Rm’×n’, the matrix obtained from E by deleting rows and

columns into a m’×n’ matrix. Then the q-th singular value of E, σq(E) and of E’, σq(E’), is

( ) ( )( ) ( ) nn'm',....,q

m',....,q

m'mn'nqq

qq

−+=≥

=≥

−+−+ 1for

1for

EE'

E'E

σσ

σσ(2.11)

If we complete Grecon such that the resulting matrix still has exactly l’ high rank Eigengenes, the singular value of l’+1, σl’+1(Grecon), must be small enough. A more general possibility is to assume that the number of significant singular values of a possible reconstructed matrix Grecon may be l’+k where k is a small integer, e.g. k=1 or 2. This may happen when the gi : i ∈ LMiss. contribute l’+1-th,..,l’+k-th high rank Eigengenes of reconstructed matrix Grecon. Then the l’+k+1-th singular value of Grecon, σl’+k+1(Grecon), should be small enough. We can reconstruct Grecon with small σl’+k+1(Grecon) if and only if we can reconstruct Grecon with the small ( )reconn

1kl'q q G∑ ++=2σ .

2.5 Fixed Rank Approximation Algorithm

Let Ga = (gai,j) be the affine subset of all matrices such that ga

i,j = gi,j : (i, j) ∈ H, a set of indices of all observed entries.

Fixed Rank Approximation Algorithm (FRAA) is an iterative algorithm. Let Gp ∈ Ga be a reconstructed matrix after p-th steps, and the next step solution, Gp+1, is the optimal matrix satisfying the following objective function.

( ) ( )

. of Eigengene th- the is where

min 1T

11

pqp,

qp,pn

1kl'qqp,p

q

ap

Gv

vGvGGG

××∑ × +++=

+∈+ (2.12)

Let Gp+1 = Gp+X where ( ) nm,1ji,ijx

==X and xij=0 if (i, j) ∈H , a set of indices of all

9



observed entries in G, and xij is a free variable if (i, j) ∉H. Let ( )Too2211 jijiji x,...,x,x=x

denote the o × 1 vector whose entries are the coordinate of the missing values in G indexed by Hc. Then there exists a unique o × o real valued symmetric nonnegative definite matrix Bp which satisfies the equality

qp,n

1kl'qqp,p XvXvxBx TTT ∑=

++=(2.13)

Let F(i, j) be the m × n matrix with 1 for the (i, j) ∈ Hc and zero for the (i, j) ∈H. Then the (s, t) entry of Bp, bp(s, t), is given by

( ) ( ) ( ) ( ) ( )( ) ots,xxxxts,b qp,jijijijin

1kl'qqp,p ssttttss

1,..., 21 TTT =+∑=

++=vFFFFv (2.14)

The crucial observation is that Bp can be decomposed into the direct sum of o symmetric nonnegative definite matrices indexed by LMiss, a set of indices of genes that contains at least one missing entry. The objective function (2.12) is given by

( )

( )

( ) ( )

( ) otj,iw

w,.........w

qp,ttpqp,n

1kl'qtp,

op,p,1p

pn

1kl'qq

iiip,ipi

pn

1kl'qqpp

qp,ppppn

1kl'qqp,

qp,1p1pn

1kl'qqp,

miss

, ... 1,

and ,)(: where

2

2

TTT

T

2TT

2TT

TTTTT

TT

=∑=

=

∑+∑ +=

∑++=

+++∑=

∑

++=

++=∈

++=

++=

++++=

vFGv

w

GxwxBx

GxwxBx

vXXGXXGGGv

vGGv

Lσ

σ(2.15)

For i ∈ LMiss, the vector xi ∈R oi contains all oi missing entries of G in the row i. Since the expression in (2.12) and (2.15) is always nonnegative, and hence it follows that wp is in the column space of Bp. And therefore the minimum of the function given in (2.15) is achieved at the critical point of xwxBx TT 2 pp + . The derivative of this indicates that

qp,1p1pn

1kl'qqp, vGGv ++

++=∑ TT has the minimum value at the following condition.

ppp wxB −=+ 1 (2.16)

This system of equation is always solvable (If Bp is not invertible, we find the least squares solution by the Moore-Penrose generalized inverse). The vector xp+1 contain the

10



entries for the matrix X. This suggest

( )

( ) ( )

( ) ( )

qp,1p1pn

1kl'qqp,

qp,ppn

1kl'q q

n1kl'q qp,pp

n

1kl'qqp,

pn

1kl'q q

vGGv

vXGXG

vGGv

G

++++=

++=

++=++=

++=

∑=

++∑≥

∑ ++∑=

∑

TT

T2

TT

2

00

σ

σ

(2.17)

By the Ky-Fan characterization

( ) ( )

value eigen is where

12TTT

λ

σλ +++=++++=

++++=

∑=∑≥∑ pn

1kl'q q1p1pn

1kl'qqqp,1p1p

n

1kl'qqp, GGGvGGv

(2.18)

And therefore, the (2.12) of the Fixed Rank Approximation Algorithm iteration step can update the Gp toward the optimization point suggested by the previous section. However, it does not guarantee that the Gp is updated toward the optimization point. At the optimum point, the following value is minimized.

( )( )∑

∑

=

++=

∈n

1qrecon

q

n

1kl'qrecon

q

arecon G

G

GG 2

2

minσ

σ(2.19)

However, the iteration step is designed for the objective function (2.12) that is the numerator of (2.12). Hence, this value should be monitored during the iteration step.

FRAA Algorithm

Input: L=l’+k+1, G0, iter (iteration number)Output: reconstructed matrix Giter

Pseudo Code:for p=0 to iter -1

update the Gp to solve the objective function (2.12) with l’+k+1=Lendfor

11



2.6 Scanned Fixed Rank Approximation Algorithm SFRAA

An improved version of FRAA (IFRAA) is developed by N. Amir, which demonstrate significant improve of the prediction performance comparing to FRAA. IFRAA algorithm consists of the combination of FRAA and k-nearest neighbor clustering method. The system works is as follows:

1. Apply K-nearest neighbor clustering to find similar expression clusters2. For each cluster that is a matrix, FRAA is applied to predict the missing entries.

IFRAA is a heuristic combination of FRAA and clustering techniques, and actually, FRAA can indeed be combined with LLSimpute or BPCAimpute. Scanned Fixed Rank Approximation Algorithm is introduced in this project, and this method may maximize the usability of FRAA without the help of clustering or statistical methods. As this nature, SFRAA can be combined with clustering techniques, though.

One of the drawbacks of FRAA is that it requires the fixed rank L as input and it must be closer to the number of high rank Eigengene the perfect matrix to get the accurate matrix. However, the initial input may simply depend on our guess or luck. SFRAA does not require the fixed rank L as input and only need iteration number as input while IFRAA require extra parameters for the clustering. The algorithm automatically scans rank L of the FRAA input from low to high while conducting FRAA with each iteration step, and consequently reconstruct matrix with a optimal number of high rank Eigengenes.

Another drawback of FRAA is that the initial value has the unneglectable influence for the outcome. Let’s discuss this point using an example. Assume that you have the following matrix where NaN means the missing entry. The perfect matrix has 5 and 9 in the missing entries, respectively. The perfect matrix is rank 3, yet the number of high rank Eigengenes is 2.

Example matrix

1 2 36 NaN 47 8 NaN12 11 10

You may guess that 2 is the number of high rank Eigengene with 0 as initial value, and apply FRAA for the matrix. The outcome is

1.0000 2.0000 3.00006.0000 5.5861 4.0000

7.0000 8.0000 -129.501212.0000 11.0000 10.0000

12



The matrix may be one of a local optimal solution but this is unacceptable results. This simple result shows the dependency on initial value of missing entries. Even the correct matrix has 2 high rank Eigengenes, if you used 1 instead of 2 as the number of high rank Eigengene with 0 as initial value, the FRAA reconstructed the matrix as follows:

1.0000 2.0000 3.00006.0000 5.3221 4.00007.0000 8.0000 6.4441

12.0000 11.0000 10.0000

This is much better and the first one, yet the number of high rank Eigengene are smaller than the correct one. Actually, if you use this matrix for the initial input values with 2 as the number of high rank Eigengene, then the FRAA returns the perfectly correct matrix. This observation gives me a hint that the result of FRAA with smaller number of high rank Eigengenes may have higher tolerance against falling into a local optimal pit fall, may keep staying around the global optimal point, and is rather irrelevant with the initial guess for missing values. On the contrary, the result with the correct number of high rank Eigengenes may fall in a local pit fall easily depending on the initial guess. Based on this hypothesis, the SFRAA is designed to scan the number of L from small to large value with monitoring the gradient of decreasing of ratio (2.19) from the previous step for the updating the value of L.

Pseudo Code SFRAA Algorithm

Input: G0 ∈ Rm×n, iter (iteration number)Output: reconstructed matrix Giter

L=1;Lmax = min(n, m);while (p < iter-1)

gradient = 1;

Gp+1=FRAA(L, Gp);

ratio = calculate (2.19);

# gradient of decrees of the ratio is used whether the L is updated or not.

gradient = (previous_ratio-ratio)/previous_ratio;

if(gradient is small enough & gradient > 0)L=L+1;previous_rato =1;

elseif(gradient < 0)Giter ← Gpbreak;

elseprevious_rato = ratio;

end

if(L = = Lmax)Giter ← Gpbreak;

endifendwhile

13



3. Comparison of Prediction Accuracy

It was reported that the prediction accuracy of IFRAA was better than FRAA and performance of IFRAA was comparable with LLSimpute and BPCAimpute by S. Friedland et al.. For the rank 2, 4, 8 random 2000 × 20 matrixes with missing value percentages up to 20%, the IFRAA demonstrated better prediction accuracy then normal FRAA and LLSimputet methods while the BPCAimpute showed the same magnitude of prediction error with IFRAA. However, the prediction accuracy of IFRAA, BPCAimpute, and LLSimpute on a real 5986 × 14 microarray data was almost the same while that of FRAA was behind those algorithms.

This may be because the array size (column size: 14) is too small to the gene size (row size:5986) in the expression matrix. The maximum rank of 14 in the expression matrix may be too small to represent the independent real gene expression variations for the 5986 genes. In this case, the algorithm using clustering techniques may fit for the data set, since it can take advantage to find similar patterns from the excess amount of data (5986) against the limited independent variations, 14, in the given matrix. The biological experiment data are often this type of unbalanced data where row or column size is much bigger than the other. In this context, LLSimpute or IFRAA using clustering may have some advantage.

However, FRAA and SFRAA can be apply for any kind of matrix data set, and the accuracy on unbalanced the microarray data only does not shows their nature of prediction performance. I am planning to apply these methods to predict the “Stock price change” in stock markets such as New York Stock Exchange, NASDAQ, or Tokyo Stock Exchange. We can construct a matrix where the entry is price change percentage, the row is stock, and column is day. In this case, the size of low may be almost the same as the size of column, and the rank of the matrix may reflect the real number of variations of row and column space vector. This data may be more favorable for FRAA and FRAA than that of the microarray data.

3.1 Methods

The prediction performance of these methods may be different on the different rank yet the same row and column size matrixes. The data with higher rank indicates that the patterns in row or column vectors consist of many independent factors while that with smaller rank reflect that these patterns are governed by small number of factors. To evaluate the dependency of prediction error on the rank of matrix and the percentage of missing value, 2 dimensional systematic experiment is performed using a series of 100 × 100 synthetic matrixes where the rank is varied from 5 up to 45 and the entries is randomly deleted with varying the percentage from 2% up to 20%. I use square matrix to eliminate the advantage due to the unbalance of row and column size.

Since IFRAA uses the clustering technique and this is a kind of heuristics. LLSimpute and BPCAimputer can also combine with another technique to increase the accuracy, and therefore IFRAA may not be suitable to compare the performance of FRAA and

14



other prediction methods which use mathematically unique and distinctive approaches. SFRAA only uses the FRAA algorithm and does not use clustering technique such as LLSimpute and probabilistic technique such as BPCA, and it is a good candidate to compare the FRAA with other algorithms.

The prediction error is evaluated by the normalized root mean square error (NRMSE). xpredict

is the predicted value and xcorrect is the original value before deleting.

]variance[])-mean[(

NRMSEcorrect

2correctpredict

xxx

= (4.1)

3.2 Results

The following figure shows NRMSE of reconstructed 100 × 100 matrixes by the SFRAA and LLSimpute with changing missing value percentage by [2%, 5%, 10%, 20%] and varying the rank of correct matrix by [5, 15, 30, 45]. The rounded marker mesh is for the SFRAA prediction and the triangle marker mesh is for the LLSimpute predicion

On the contrary to the prediction error of LLSimpute that increase with increasing

15



the percentage of missing value and rank of matrix, the prediction error of SFRAA remain very low during the experiment. Following figure shows the BPCAimpute result overlaying on the previous figure. The square marker with broken line mesh is for the BPCAimpute prediction.

Performance of the BPCAimpute is very unstable and the prediction error is generally higher than others on our randomly generated synthetic matrixes while the performance of the SFRAA is very stable. The reconstructed matrixes of SFRAA have lower NRMSE, and this indicates that the SFRAA has higher prediction power than these of LLSimputer and BPCAimpute. The SFRAA demonstrates that it can even predict the missing values of matrix whose rank is almost the half of its row and column size. From this observation, the SFRAA may have much stronger prediction power of data set than the LLSimpute and the BPCAimpute.

16



4 SFRAA Implementation

4.1 Requirements and Functions

The name of application is “Seed” with a hope that the application may be a seed growing up a big tree.

The application must be easy to use for user yet provide the maximum usability of Scanned Fixed Rank Approximation Algorithm (SFRAA) algorithm to predict the missing value. For future, other algorithms such as LLSimpute and BPCA will be implemented into the Seed.

Application is implemented by C# language so that it can run on Microsoft Windows environment which is the most popular operation systems among the scientist. For the nature of matrix, spread sheet is a good format to show the numerical data of matrix, and therefore the application uses the Microsoft Excel for this purpose.

The application also use a Graphic Sheet to represent matrix data where each colored rectangle spot is a data entry and the indices of column and row in a Graphic Sheet are the same as these in the Excel sheet. Vivid green color spot in a Graphic Sheet shows the entry with the maximum positive value, vivid red color shows the entry with the minimum minus value, and black color spot shows the entry with zero value in the data.

Start the Program

Click the Green Red Matrix Icon named SeedIcon of Seed Application

Open a Microarray Data

By Using MatrixAnalyzer

The data file is a simple tab-delimited text file, with each row representing the dependent variable measurements (e.g., expression levels or ratios) for one set of observations. Each column represents an observation or sample (e.g., a microarray). Missing values should be indicated by blanks, not by "NA" or other non-numerical characters. Every column has a experiment name, and all row has gene name. File format is as follows:

17



Sample File Format

cdc15_10 cdc15_30 cdc15_50 cdc15_70

YAL001C -0.16 0.09 -0.23 0.03

YAL002W -0.58

YAL003W -0.37 -0.22 -0.16 0.04

YAL004W -1.5

YAL005C -0.43 -1.33 -1.53 -1.53

YAL007C 0.14

YAL008W -0.2 0.04 -0.27 -0.4

YAL009W -0.02 0 0.07

YAL010C -0.24 -0.06 -0.11 -0.25

YAL011W -0.29

One can open an expression matrix file from the “Open” pull down menu from the “File” menu of MatrixAnalyzer window. Missing entry in the Excel sheet has 999.99 value.

Open File

By Using Microsoft Excel

In this case, you can open a matrix data file formatted by what the Microsoft Excel can read.

1st Create a new MatrixAnalyzer Graphic Sheet.

You can create a new MatrixAnalyzer Graphic Sheet by the “New” pull down menu

18



from the “File” menu of MatrixAnalyzer window.

2nd Open a file by Excel and select the target matrix range

Open the target file by Excel which automatically starts when you start the Seed Application. And then, select entries of the matrix data without heading and name of experiment and gene, respectively.

19



3rd Update the selected range into a MatrixAnalyzer Graphics

One can update the empty MatrixAnalyzer Graphic Sheet by the “Update Graphic Sheet...” pull down menu from the “Excel” menu of MatrixAnalyzer window.

Vivid green color spot shows the entry with the maximum positive value, vivid red color shows the entry with the minimum minus value, and black color spot shows the entry with zero value. Missing entry in the Excel sheet has 999.99 value.

20



SFRAA to Reconstruct a Matrix

1st Update the Graphic Sheet

The sample data in the Excel Sheet.Each row has at least one missing value.

Corresponding Graphics Sheet

21



2nd Apply the SFRAA

One can predict the missing entries in the data by “Scanned Fixed Rank Approximation Algorithm” pull down menu from the “Expectation” menu of MatrixAnalyzer window.

Then the SFRAA control window shows up. Input the iteration number (in this sample, 100), and click the “Run” button to execute the SFRAA algorithm on the data.

22



You will see the reconstructed matrix both in the Graphic sheet and Excel sheet

Reconstructed data in the Excel Sheet

Reconstructed data in the Graphics Sheet

23



Coloring missing entries of data in the Excel Sheet

One can color the missing entries in the Excel sheet by “Coloring Missing Value..” pull down menu from the “Excel” menu of MatrixAnalyzer window.

24



25



4.2Design

Application source consists of 5 packages: ExcelDriver, FileIO, GraphicsSheet, Matrix, and Seed packages.

Package Dependency DiagramThe arrow indicates the dependency

26



Class Diagrams

Matrix Package Class DiagramMatrix, SingularValueDecomposition, and EigenvalueDecompotition classes are from

the Mapack for .NET by Mr. Lutx Roeder (http://www.aisto.com/roeder/dotnet/)

FileIO Package Class Diagram

FileManager is in FileIO package, and this uses the MatrixWithTitle classin the Matrix Package, which is represented by Matrix.MAtrixWithTitle.

27

http://www.aisto.com/roeder/dotnet/



ExcelDriver Package Class Diagram

Microsoft Interop.Excel.dll is used to access Excel application and sheet

GraphicSheet Class Diagram

ColorMatrix class is a entity of green-red colored graphical matrix in the MatrixAnalyzer window and it is the container of the MatrixWithTitle class.

28



Seed Package Class Diagram

This package is the kernel of this application.MainInterface class is the entity which is the main user interface window and has the

application life time while the Control class control the information flow for each function.

29



Collaboration Diagrams

SFRAA to Reconstruct a Matrix

30



4.3 Source Codes

Matlab source codes of SFRAA

function Ep1 = sfraa(E,iter)

%%Scanned Fixed Rank Aproximation Algorithm%Usage: fraa(E,iter)%E=matrix with missing values where missing entries are NaN%iter number of iterations to perform

%April 2006, Makio Tamura%From FRAA of version at 8 September 2005 Laura Chihara

%

%%%%%%%%%% THIS IS THE SET-UP %Get size of E

[N,M]=size(E); maxL=min(N,M) %final rank by makio minL=2; % starting rank by makio L=minL; % starting rank

% if (L> M)% error('L must be less than or equal to the number of columns of E')% end;

%get index of missing values (vectorized) missing=find(isnan(E)); %Number of missing values m=length(missing); m2=m*m; %get mean value tmpCount=0; tmpTotal=0; for i=1:N for j=1:M if(isnan(E(i,j))==0) tmpTotal=tmpTotal+E(i,j);

31



tmpCount=tmpCount+1; end end end tmpV=tmpTotal/tmpCount %initialize Ep Ep=E; for i=1:m Ep(missing(i))=tmpV; end

%%%%%%%%%%% NOW WE WORK WITH THE ALGORITHM

Xp1=zeros(N,M);

track=iter; steps=int16(iter/(maxL-minL+1)); previousFraction=2;

% SFRAA ITERATION PROCESS START

while(iter > 0) A=Ep'*Ep; %Find singular value decomposition of A [U,S,V]=svd(A); clear A;

sigma2=diag(S); % sigma2 should have length M for the fraction calc. singular=sqrt(sigma2);

partial_sig2=sum(sigma2(L:M)); total_sig2=sum(sigma2(1:M)); fprintf('\n iteration %3.0f \n', track-iter+1) fprintf(' fixed rank %3.0f \n', L) fraction=partial_sig2/total_sig2; fprintf(' partial sum/total sum of sq. singular values: %1.8f', fraction) fprintf('\n') reduction=(previousFraction-fraction)/previousFraction;

32



fprintf(' reduction ratio: %1.8f', reduction) fprintf('\n') %if the gradient of reduction is small enough the L is updated by +1 if(0<= reduction && reduction < 0.0004) if(L~=maxL) L=L+1; previousFraction=1; else break; end elseif (reduction < 0) fprintf('\Reach to Optimam\n'); break; else previousFraction=fraction; end

%Construct B=Bp B=zeros(m,m); %pre-allocate space %B=sparse(m,m); [is,js]=ind2sub([N,M],missing(1:m)); % is=io, js=jo in the paper for s=1:m for t=s:m if (is(s)==is(t)) %original index B(s,t)=sum(U(js(s),L:M)*U(js(t),L:M)'); % U=V in the [U S V]=svd(A) B(t,s)=B(s,t); %B is symmetric end %end if end %end For t end %end for si %%%NOW CONSTRUCT THE VECTOR Wp

W=sparse(m,1); %pre-allocate space

for t=1:m K=sparse(N,M); K(missing(t))=1; W(t)=sum(diag(U(:,L:M)'*Ep'*K*U(:,L:M))); clear K; end %end for

33



%The Moore-Penrose generalized inverse is used.

pack; [Ub Sb Vb]=svd(B,0); rnk=0; for r=1:m if(Sb(r,r)==0) break; end rnk=rnk+1; end

%xp1 is the least seuqre solution by Moore-Penrose generalized inverse Bt=Vb(:,1:rnk)*inv(Sb(1:rnk,1:rnk))*(Ub(:,1:rnk))'; xp1=-Bt*W; %xp1=-B\W; % By the precision problem, this does not work sometimes %Create matrix B_{p+1} Xp1(missing)=xp1; %Update solution Ep=Ep+Xp1; %set counter iter=iter-1;

end %End while

fprintf('\n') fprintf(' singular values (final iteration):\n') % fprintf('%16.6f',sigma) fprintf('%16.6f',singular)

Ep1=Ep;

34



5. Conclusion

FRAA is very unique method and powerful tool to predict missing value in a matrix data, however it may not be practical to use for many real cases. One obstacle is that the algorithm requires the number of rank as an input, which must be closer to the correct matrix to get an acceptable result. Other drawback is that the initial values of the iteration have unneglectable influence the overall outcome.

SFRAA is a modification of FRAA yet it does not require these initial guesses. Experiment of the missing value prediction on 100 × 100 matrix with varying missing value percentage and rank of matrix demonstrates that the reconstructed matrix by the SFRAA is much reliable than those by the LLSimpute and the BPCAimpute which thought to be the most accurate methods so far.

I implemented the SFRAA by C# language, Seed, that can run on Microsoft Windows environment. One can use the SFRAA method to predict multiple missing values in any matrix data files which can be opened by the Microsoft Excel.

35



References

A. Niknejad, Ph.D thesis, Application of Singular Value Decomposition to DNA microarray, University of Illinois at Chicago, 2005

D.A.Jackson, Stopping rules in principal component analysis: comparison of heuristical andstatistical approaches, Ecology, 1997, 74, 2204

H.Chipman, T.J. Hastie and R. Tibshirani, Clustering microarray data in T. Speed, (Ed.), Statistical Analysis of Gene Expression Microarray Data, Champman & Hall/CRC, 2003, 159.

H. Kim, G. H. Golub, and Haesun Park, Missing value estimation for DNA microarray gene expression data: local least squares imputation, Bioinformatics, 2005, 21, 187

O.Alter, P.O. Brown and D. Botstein, Processing and modeling gene expression data using singular value decomposition, Proceedings of SPIE, 2001, 4266, 171

O.Troyanskaya, M.Cantor, G.Sherlock, P.Brown, T.Hastie, R. Tibshirani, D.Botstein and R. Altman, Missing value estimation for DNA microarray, Bioinformatics, 2001, 17, 520.

S. Friedland, A. Niknejad, and L. Chihara, A simultaneous reconstruction of missing data in DNA microarrays, to appear in LAA.

S. Friedland, A. Niknejad, M Kaveh, H. Zare, An Algorithm for missing value estimation for DAN microarray data, Proc. ICASSP, 2006

S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara, and S. Ishi, A Bayesian imssing value estimation method for gene expression profile data, Bioinformatics, 2005, 19, 2003

36



Acknowledgement

I am happy to express my sincerest thanks to Professor Shmuel Friedland for his advices and support throughout this project. It is very fortunate for me to take his unique classes at the mathematics department, the University of Illinois at Chicago, and perform this master project based on the theories which I learned in these classes. I believe that this work under his advice is a solid basis for my future career development. Discussion with him is very exciting, and I really enjoyed working on this project with him. I also appreciate Professor Robert Sloan for his suggestions to manage this project in the computer science department.

I would like to thank Professor Robert Grossman for offering me the opportunity to work at the National Center for Data Mining as a research assistant.

Lastly, I acknowledge my parents for encouraging me to pursue this master degree. I also want to say many thanks to my friends for sharing pleasure moments and giving me invaluable supports.

May 24 2006

Makio Tamura

37



38

missing value expectation of matrix data by fixed rank ...mtamura/makiotamuramasterproject.pdf ·...

Documents