imputing missing values for genetic interaction data

9
Imputing missing values for genetic interaction data Yishu Wang a , Lin Wang a , Dejie Yang b , Minghua Deng a,c,d,a Center for Quantitative Biology, Peking University, Beijing 100871, China b Institute of Computing Technology, Chinese Academy of Science, Beijing 100190, China c School of Mathematical Sciences, Peking University, Beijing 100871, China d Center for Statistical Sciences, Peking University, Beijing 100871, China article info Article history: Received 2 October 2013 Accepted 27 March 2014 Available online xxxx Keywords: Soft-SVD Imputation EMAP Genetic interaction abstract Background: Epistatic Miniarray Profiles (EMAP) enable the research of genetic interaction as an impor- tant method to construct large-scale genetic interaction networks. However, a high proportion of missing values frequently poses problems in EMAP data analysis since such missing values hinder downstream analysis. While some imputation approaches have been available to EMAP data, we adopted an improved SVD modeling procedure to impute the missing values in EMAP data which has resulted in a higher accu- racy rate compared with existing methods. Results: The improved SVD imputation method adopts an effective soft-threshold to the SVD approach which has been shown to be the best model to impute genetic interaction data when compared with a number of advanced imputation methods. Imputation methods also improve the clustering results of EMAP datasets. Thus, after applying our imputation method on the EMAP dataset, more meaningful mod- ules, known pathways and protein complexes could be detected. Conclusion: While the phenomenon of missing data unavoidably complicates EMAP data, our results showed that we could complete the original dataset by the Soft-SVD approach to accurately recover genetic interactions. Ó 2014 Elsevier Inc. All rights reserved. 1. Introduction Genetic interactions refer to the phenomenon whereby the mutation phenotype of two genes differs to the superimposition effect of two single mutations [1]. In budding yeast and fission yeast, genetic interactions can be acquired using the high-through- put technology known as the Epistatic Miniarray Profile (EMAP) platform [2]. EMAP can construct double deletion strains systemat- ically by crossing query strains with a library of test strains. After- wards, the colony size of the double mutant strains is measured to get the S score, which can indicate the genetic interaction, either synthetic sick/lethal or alleviating [3]. In EMAP datasets, each gene in the query, or library, has its genetic interaction spectrum con- structed by the genetic interaction S score with other genes in the library, or query. Researchers can exploit biological pathways and reveal protein complexes by clustering the S score matrix and, as a result, find cellular organization and gene functions. However, one common characteristic of running EMAP is the significantly high proportion of missing values, even up to 35%, which can reduce the effectiveness of such data analysis tech- niques as cluster analysis and even prevent the use of some matrix factorization techniques, such as SVD or PCA. This phenomenon of missing entries could be explained by the inability of high- throughput technologies to measure genetic interaction strengths. Also, some genetic interactions could be subsequently filtered as a result of unreliability. The problem of missing values in genetic interaction datasets has been discussed before, but few technologies are used to impute quantitative epistasis values in EMAP datasets [4]. Some previous papers reported improvements in some techniques used in gene expression datasets, subsequently applying them to EMAP data. Four general strategies are considered in EMAP data, including three nearest-neighbor-based: k-Nearest Neighbors imputation (kNN) [5], Local Least Squares imputation (LLS) [6], and Iterated Lo- cal Least Square imputation (iLLS) [7]; and one global method known as Bayesian Principal Component Analysis imputation (BPCA) [8]. The term ‘‘local’’ used here represents those algorithms that impute missing values through local information around the missing value. Previously, in order to improve the accuracy of http://dx.doi.org/10.1016/j.ymeth.2014.03.032 1046-2023/Ó 2014 Elsevier Inc. All rights reserved. Corresponding author at: School of Mathematical Sciences, Peking University, Beijing 100871, China. E-mail address: [email protected] (M. Deng). Methods xxx (2014) xxx–xxx Contents lists available at ScienceDirect Methods journal homepage: www.elsevier.com/locate/ymeth Please cite this article in press as: Y. Wang et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.03.032

Upload: minghua

Post on 04-Jan-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Methods xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Methods

journal homepage: www.elsevier .com/locate /ymeth

Imputing missing values for genetic interaction data

http://dx.doi.org/10.1016/j.ymeth.2014.03.0321046-2023/� 2014 Elsevier Inc. All rights reserved.

⇑ Corresponding author at: School of Mathematical Sciences, Peking University,Beijing 100871, China.

E-mail address: [email protected] (M. Deng).

Please cite this article in press as: Y. Wang et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.03.032

Yishu Wang a, Lin Wang a, Dejie Yang b, Minghua Deng a,c,d,⇑a Center for Quantitative Biology, Peking University, Beijing 100871, Chinab Institute of Computing Technology, Chinese Academy of Science, Beijing 100190, Chinac School of Mathematical Sciences, Peking University, Beijing 100871, Chinad Center for Statistical Sciences, Peking University, Beijing 100871, China

a r t i c l e i n f o

Article history:Received 2 October 2013Accepted 27 March 2014Available online xxxx

Keywords:Soft-SVDImputationEMAPGenetic interaction

a b s t r a c t

Background: Epistatic Miniarray Profiles (EMAP) enable the research of genetic interaction as an impor-tant method to construct large-scale genetic interaction networks. However, a high proportion of missingvalues frequently poses problems in EMAP data analysis since such missing values hinder downstreamanalysis. While some imputation approaches have been available to EMAP data, we adopted an improvedSVD modeling procedure to impute the missing values in EMAP data which has resulted in a higher accu-racy rate compared with existing methods.Results: The improved SVD imputation method adopts an effective soft-threshold to the SVD approachwhich has been shown to be the best model to impute genetic interaction data when compared with anumber of advanced imputation methods. Imputation methods also improve the clustering results ofEMAP datasets. Thus, after applying our imputation method on the EMAP dataset, more meaningful mod-ules, known pathways and protein complexes could be detected.Conclusion: While the phenomenon of missing data unavoidably complicates EMAP data, our resultsshowed that we could complete the original dataset by the Soft-SVD approach to accurately recovergenetic interactions.

� 2014 Elsevier Inc. All rights reserved.

1. Introduction

Genetic interactions refer to the phenomenon whereby themutation phenotype of two genes differs to the superimpositioneffect of two single mutations [1]. In budding yeast and fissionyeast, genetic interactions can be acquired using the high-through-put technology known as the Epistatic Miniarray Profile (EMAP)platform [2]. EMAP can construct double deletion strains systemat-ically by crossing query strains with a library of test strains. After-wards, the colony size of the double mutant strains is measured toget the S score, which can indicate the genetic interaction, eithersynthetic sick/lethal or alleviating [3]. In EMAP datasets, each genein the query, or library, has its genetic interaction spectrum con-structed by the genetic interaction S score with other genes inthe library, or query. Researchers can exploit biological pathwaysand reveal protein complexes by clustering the S score matrixand, as a result, find cellular organization and gene functions.

However, one common characteristic of running EMAP is thesignificantly high proportion of missing values, even up to 35%,which can reduce the effectiveness of such data analysis tech-niques as cluster analysis and even prevent the use of some matrixfactorization techniques, such as SVD or PCA. This phenomenon ofmissing entries could be explained by the inability of high-throughput technologies to measure genetic interaction strengths.Also, some genetic interactions could be subsequently filtered as aresult of unreliability.

The problem of missing values in genetic interaction datasetshas been discussed before, but few technologies are used to imputequantitative epistasis values in EMAP datasets [4]. Some previouspapers reported improvements in some techniques used in geneexpression datasets, subsequently applying them to EMAP data.Four general strategies are considered in EMAP data, includingthree nearest-neighbor-based: k-Nearest Neighbors imputation(kNN) [5], Local Least Squares imputation (LLS) [6], and Iterated Lo-cal Least Square imputation (iLLS) [7]; and one global methodknown as Bayesian Principal Component Analysis imputation(BPCA) [8]. The term ‘‘local’’ used here represents those algorithmsthat impute missing values through local information around themissing value. Previously, in order to improve the accuracy of

2 Y. Wang et al. / Methods xxx (2014) xxx–xxx

missing value predictions, these original imputing techniques haveincorporated the symmetric character of datasets [4,9]. However,with the recent development of EMAP technology and the needfor practical application, most EMAP datasets are asymmet-ric[10–12]. Therefore, the symmetric characteristic is not appropri-ate for use in predicting missing entries on an increasing number ofnew EMAP experimental datasets. We extended the original SVDmethod by giving it a soft threshold, changing some optimizationfunctions and restricting conditions [13]. This method which canbe called Soft-SVD has been used in ‘‘Netflix’’ competition [13], im-age recovery [13] and eQTL [14], and it has been demonstrated asthe most efficient algorithm in these fields. The soft-SVD algorithmis not restricted by symmetry. As such, it can be used in a widerange of EMAP datasets. We introduced this methodology to im-pute missing values in EMAP datasets, and, hereinafter, we call itSoft-Impute.

The Soft-Impute methodology adopts the soft threshold to SVDalgorithm and proposes that the nuclear norm results in a convexoptimization problem. It takes advantage of the relevance in a gi-ven dataset to impute missing entries by finding a low-rank matrixthat is close to the original one on the observed data. This algo-rithm is suitable for datasets in which modules are found withhighly correlated entries. In Part 4, we constructed a syntheticdataset with low-rank matrix to test the effect of the Soft-Imputealgorithm and compared it with the imputation accuracy of differ-ent imputation methods. EMAP datasets store genetic interactionspectra, where genes in the same protein complex or biologicalpathway tend to have similar genetic interaction spectra. Accord-ingly, EMAP data matrices contain several highly correlated mod-ules. As a whole, the matrix is low-rank, which satisfies therequest of the Soft-Impute algorithm for datasets.

In this paper, we systematically described how the Soft-Imputemethod is applied to EMAP dataset imputation and then appliedthis method, along with four other imputation methods, to threepublic EMAP datasets. We then conducted a detailed comparisonamong these imputation techniques, showing the markedimprovement of the Soft-Impute algorithm in the performance ofmissing entries estimation. Beyond imputation accuracy, we alsoevaluated these methods in terms of their ability to detect geneticinteraction modules in which genes have similar interaction pro-files, are involved in the same physical complex or pathway, andare also enriched in GO terms. After imputing missing entries inthe EMAP score matrix, we demonstrated the downstream analysisin which hierarchical clustering results were highly improved, andmore significant genetic interaction modules, which are enrichedin the known discoveries, could be exploited.

2. Materials

2.1. Epistatic Miniarray Profile (EMAP)

The genetic interaction datasets used here are from EMAP anal-ysis. Three EMAP datasets are used in our analysis, including theearly secretory pathway (ESP) [15], chromosome biology (CHR)[16] for budding yeast, and a genome-wide EMAP profile for fissionyeast [17]. The first two datasets are symmetric matrices for whichquery genes are the same as library genes. There are 424 geneswith about 80,000 pairwise measurements in the ESP dataset, con-taining about 7.5% missing entries, while there are 743 genes withabout 200,000 pairwise measurements in the CHR dataset, contain-ing about 34% missing entries. On the contrary, the genome-scalegenetic interaction matrix of fission yeast is not symmetric andcontains 953 alleles of 876 genes against a mutant library of morethan 2000 deletions, resulting in an EMAP profile with 1.6 milliongenetic interactions and 16% missing entries.

Please cite this article in press as: Y. Wang et al., Methods (2014), http://dx.do

2.2. Synthetic data

We created a synthetic dataset with low-rank to realize theSoft-Impute algorithm and compared it with other imputationmethods. We assumed k modules in the synthetic dataset, in whichentries in the same module have higher relevance. In contrast, en-tries not in the same module have lower relevance. We constructeda dataset of 250 elements representing query genes in 500 dimen-sions standing for library genes as follows:

1. We first constructed a vector of 500 elements randomly chosenfrom ESP dataset. as Denoted by A1

�!¼ fa1; a2; . . . ; a500g

(a) Multiply by Gaussian noise to every element ai of A1�!

, whichis randomly chosen from Nð1; jaijÞ. and denoted by A2

�!.

(b) Repeat (a) for k times which results in k vectors fA!

1; . . . A!

kg2. Generate a matrix with rank k using the above k vectors.

(a) Generate k random numbers n1;n2; . . . ;nk, ranging from 3 to30, such that their sum is 250.

(b) Generate a matrix with 250 vectors by repeating vector A1�!

for n1 times, vector A2�!

for n2 times,. . ., vector Ak�!

for nk

times. Apparently, such matrix is of rank k.3. Add a Gaussian noise drawn from Nð0;0:5Þ to each entry of the

above matrix.4. Now we construct a matrix of 250 vectors with 500 dimensions.

3. Model and algorithm

3.1. Soft-Impute model

EMAP data can be represented by a matrix Xm�n, where m and nrepresent the number of query genes and library genes. To repre-sent the existence of missing entries in EMAP dataset matrix,X � f1; . . . ;mg � f1; . . . ;ng denotes the indices of observed entries.Therefore, X is the original data with observed entries denoted byX and missing values denoted by X?. To impute this matrix, weaim to find a complete matrix Z, which is close to X on the ob-served entries X and has low rank. Here, the low- rank assumptionis based on the consideration that the genetic interaction profile forcofunctional genes is shown to have highly correlated relationships[15,16]. Srebro et al. have studied generalization error bounds forlearning the low-rank matrices [18], Their work also showed, the-oretically, that the true underlying matrix could be recovered withvery high accuracy under certain assumptions based on the entriesof the matrix, locations, and proportion of unobserved entries[19–21]. Mazumder et al. formulated the above problem as thefollowing optimization problem [13]:

minimize rankðZÞsubject to

Xði;jÞ2XðXij � ZijÞ2 6 d ð1Þ

where d P 0 is a regularization parameter to control the errortolerance.

However, in the above optimization, the rank constraint makesthe problem combinatorially hard for general X [22]. One smallmodification to 1 is [13]:

minimize kZk�subject to

Xði;jÞ2XðXij � ZijÞ2 6 d ð2Þ

where kZk� is the nuclear norm of Z ( Zk k� ¼Pr

i¼1ri, where r1; . . . ;rr

are the singular values of Z and r is the rank of Z). This modificationconstitutes a problem in convex optimization[23]. We can be refor-mulated 2 to the Lagrange form [13]:

minimizeZ

12

Xði;jÞ2XðXij � ZijÞ2 þ k Zk k� ð3Þ

i.org/10.1016/j.ymeth.2014.03.032

Y. Wang et al. / Methods xxx (2014) xxx–xxx 3

Here k P 0 is a regularization parameter controlling the nuclearnorm of estimated value eZk of 3.

Suppose that we only observe a subset of X, indexed by X, andthat the missing entries are indexed by X?. If we define an orthog-onal projection operator P, the matrix X can be projected onto thelinear space of matrices supported by X [19]:

PXðXÞði;jÞ ¼Xij if ði; jÞ 2 X

0 if ði; jÞ R X

�ð4Þ

Now the matrix completion problem in Lagrange form 3 can bewritten nicely as:

minimizeZ

12kPXðXÞ � PXðZÞk2

F þ kkZk� ð5Þ

3.2. Lemma

To solve the optimization problem 5, we first present the fol-lowing lemma (proof can be found in [13]).

Lemma If matrix Wm�n has rank r, then the optimizationproblem:

minZ

12kW� Zk2

F þ kkZk� ð6Þ

has solution bZ ¼ SkðWÞ, where

SkðWÞ ¼ UDkVT

with Dk ¼ diag½ðd1 � kÞþ; . . . ; ðdr � kÞþ�ð7Þ

UDVT is the Singular Value Decomposition (SVD) of W and heretþ ¼maxðt; 0Þ. The notation SkðWÞ refers to soft-thresholding [24]

3.3. Soft-Impute algorithm

Now we begin to introduce the Soft-Impute algorithm. First, werewrite 5 as follows:

minimizeZ

12kX� Zk2

F þ kkZk�

¼ minimizeZ

12kPXðXÞ � ½Z� PX? ðZÞ�k

2F þ kkZk�

¼ minimizeZ

12k½PXðXÞ þ P?XðZÞ� � Zk2

F þ kkZk�

ð8Þ

By Lemma in part 3.2, the optimal solution of optimizationproblem 8 can be solved by iteratively updating Z usingZ �SkðPXðXÞ þ PX? ðZÞÞ ð9Þ

with an arbitrary initialization.We propose a cross-validation-like strategy to select the opti-

mal parameter tuning. The idea is as follows: X is the index of ob-served entries of X. First, we randomly introduce 5% more artificialdeletions in X, by deleting the original observed ones to get a testdataset. Then we solve 5 on a grid of k values on the test dataset.We start from a large kmax, which equals the second largest singularvalue of matrix PXðXÞ. We set the maximum rank of Z, denoted asrankmaxðZÞ, equal to min (m, n). If rankðZÞ < rankmaxðZÞ, we con-tinue to solve 5 and reduce k by a factor g ¼ 0:9 untilrankðZÞP rankmaxðZÞ. Finally, to select the optimal parameter, weevaluate the prediction error between the actual data and the pre-dicted data on a grid of k on the test dataset. Here the Pearson cor-relation and the normalized root mean squared error (NRMSE) 10are used as the evaluation criteria. We can choose the parameterk�, which minimizes the prediction error (the highest Pearson cor-relation or lowest NRMSE).

NRMSE ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimean½ðijanswer � ijguessÞ

2�variance½ijanswer�

sð10Þ

Please cite this article in press as: Y. Wang et al., Methods (2014), http://dx.do

Now we have the algorithm:

Algorithm: Soft-Impute

1. Initialize Zold ¼ 02. For ki in the grid of k, compute from kmax

(a) Repeat:

i. Compute Znew Ski ðPXðXÞ þ PX? ðZoldÞÞ.

ii. Define the energy function:

fki ðZÞ ¼ 12 kPXðXÞ � PXðZÞk2

F þ kkZk�,

iffkiðZnewÞ�fki

ðZoldÞj jfkiðZoldÞ

< e exit.

iii. Zold Znew.

(b). Assign bZki Znew.Then kiþ1 ¼ ki � 0:9 for kiþ1. Repeat (a) to (b) until:

rankðZÞP rankmaxðZÞ.3. Output the solutions: bZk1 ; . . . ; bZki ;

bZkiþ1 ; . . . ; bZkmax .

4. Choose the optimal solution: bZkoptimal .

4. Results and discussion

4.1. Assessing the accuracy of quantitative imputation

We applied the Soft-Impute to three genetic interaction data-sets and one synthetic dataset. The three genetic interaction data-sets investigated here are ESP-EMAP, CHR-EMAP and the S. pombeglobal genetic interaction map.

To assess the effectiveness of imputation techniques for geneticinteraction datasets (EMAP), we artificially introduce additionalmissing values on the basis of an existing incomplete EMAP matrix.We randomly delete the original observed data in EMAP or syn-thetic dataset to get a new matrix with artificial missing entries.This process is repeated multiple times so that we get a series oftest matrices for every original EMAP or synthetic data matrix.After applying different imputation methods on these test matri-ces, we can evaluate the prediction error between the actual dataand the predicted data. Then we get one imputation accuracy dis-tribution for each EMAP dataset for each imputation method.

For ESP, CHR and the synthetic datasets, we repeat the artificialdeletions twenty times, resulting in twenty test matrices, respec-tively. The S. pombe set, which contains about 1.6 million pairwisedata, is too computationally expensive; therefore, we repeat only15 times.

Fig. 1a and b shows when the 10% rate of artificial missing val-ues is generated 20 times in the synthetic dataset. The Soft-Imputealgorithm performs best on all test matrices, irrespective of Pear-son correlation or NRMSE evaluated. The BPCA algorithm couldnot get one stable result. As demonstrated in [8], BPCA does notperform well when genes have dominant local similarity struc-tures. Our test matrices induced from the synthetic dataset havehigh local correlation with respect to a gene module, but the BPCAalgorithm is limited by this characteristic. In order to demonstratethe performance of the Soft-Impute algorithm on datasets with dif-ferent rates of missing values, an artificial missing rate differentfrom 1% to 37% is used. Fig. 1c and d give the imputation resultsof this gradient missing rate in 20 synthetic data test matrices.To demonstrate whether such accuracy performance differences,as seen in the synthetic dataset, are statistically significant, we per-formed the t-test for every two distributions and derived the statis-tical significance (P-value) (see Fig. 2a and b). The running times, asshown in Table 1, are computed when these imputation methodsare applied on 20 synthetic data test matrices with 10% artificialdeletion values and gradient artificial missing value rate.

i.org/10.1016/j.ymeth.2014.03.032

5 10 15 20

0.0

0.5

1.0

1.5

Synthetic datasets with 10% missing values

Pers

on C

orre

latio

nknnillsllsbpcasoft

(a)

5 10 15 20

0.0

0.5

1.0

1.5

Synthetic datasets with 10% missing values

norm

aliz

ed ro

ot m

ean

squa

red

erro

r (N

RM

SE)

knnillsllsbpcasoft

(b)

0.05 0.10 0.15 0.20 0.25 0.30 0.35

0.0

0.5

1.0

1.5

Synthetic datasets with gradient rate of missing values

Pers

on C

orre

latio

n

knnillsllsbpcasoft

(c)

0.05 0.10 0.15 0.20 0.25 0.30 0.35

0.0

0.5

1.0

1.5

2.0

2.5

Synthetic datasets with gradient rate of missing values

norm

aliz

ed ro

ot m

ean

squa

red

erro

r (N

RM

SE)

knnillsllsbpcasoft

(d)

Fig. 1. Capability of the imputation methods to reproduce the original measurements in the synthetic datasets. (a and b) The imputation accuracy (Pearson correlation andNRMSE) on synthetic datasets with 10% values deleted. (c and d) The results on the same original synthetic dataset with gradient missing rates. The x axis represents themissing rate.

4 Y. Wang et al. / Methods xxx (2014) xxx–xxx

Imputation accuracies in terms of Pearson correlation andNRMSE for the different imputation methods on 20 ESP-EMAP datatest matrices with artificial 10% deletion values are shown inFig. 3a and b. Within each matrix, the best imputation accuracyis obtained by Soft-Imputation, while the BPCA algorithm is notstable enough to get even one optimal result. The second bestperforming methods are LLS and ILLS, which are more time-consuming (see Table 2). To demonstrate the imputation resultsacross different missing rates on ESP-EMAP, we also constructedthe gradient artificial missing rate from 1% to 25% by introducingartificial missing entries in the observed subset (see Fig. 3c andd). For each method, the imputation accuracy decreases with theincrease of missing value rate; meanwhile, the Soft-Impute meth-od almost achieves a better performance for different missingrates. Because of the high rate of missing values in the originaldatasets of CHR-EMAP and S. pombe, we did not evaluate the impu-tation accuracy with gradient missing rates. For the CHR-EMAP andS. pombe datasets, we introduced 5% artificial missing values. TheBPCA method depends heavily on the properties of the datasetbeing imputed, and, therefore, it was difficult to find a stableoptimal point. This time, in terms of the S. pombe dataset, BPCAcould not converge and, hence, no BPCA result.

Running times of the different imputation methods on thesethree EMAP datasets are presented in Table 2. k Nearest Neighborimputation is faster than the more advanced imputation methods,

Please cite this article in press as: Y. Wang et al., Methods (2014), http://dx.do

but it has the worst imputation accuracy (Fig. 3). The Soft-Imputealgorithm costs a median time, but it achieves the best accuracyamong the different methods in all datasets. Its surprising abilityto impute the EMAP datasets is supported by the results shownin Fig. 3.

We have demonstrated the process of constructing EMAP andsynthetic data test matrices with random artificial missing values.Thus, for every EMAP dataset (ESP, CHR, S. pombe) and thesynthetic dataset, there are several distributions of imputationaccuracy in terms of different imputation methods. To demon-strate that the imputation accuracy differences on EMAP datasetsof different imputation methods are statistically significant, we de-rived statistical significance (P-value) of t-test between the accu-racy distribution of Soft-Imputation and that of the othermethods for every EMAP dataset. This result is shown in Fig. 2, inwhich the X axis represents methods, and the Y axis representsthe mean of imputation accuracy.

For example, in Fig. 2a and b, a set of 20 test matrices with arti-ficial deletions is induced from the original synthetic dataset. Afterimputing on this set with different imputation methods, two accu-racy distributions resulted, including Pearson correlation (Fig. 2a)and NRMSE (Fig. 2b), for each imputation method. In these two fig-ures, the means of accuracy distributions of different imputationmethods are plotted in Fig. 2, where the red histogram representsthe Soft-Impute method, and blue histograms represent the other

i.org/10.1016/j.ymeth.2014.03.032

(a)

Soft.kNN Soft.LLS Soft.iLLs Soft.BPCA

Soft−ImputeOther−Method

Mea

n of

Per

son

Cor

rela

tion

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

<2.2e−16 5.6e−07 1.3e−13 4.4e−4

Synthetic DatatsetP−value

(b)

Soft.kNN Soft.LLS Soft.iLLs Soft.BPCA

Soft−ImputeOther−Method

nor

mal

ized

root

mea

n sq

uare

d er

ror (

NR

MSE

)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

<2.2e−16 7.3e−3 4.5e−9 0.01

Synthetic DatatsetP−value

(c)

Soft.kNN Soft.LLS Soft.iLLs Soft.BPCA

Soft−ImputeOther−Method

Mea

n of

Per

son

Cor

rela

tion

0.0

0.2

0.4

0.6

0.8

1.0

1.2

<2.2e−16 2.8e−8 <2.2e−16 9.3e−4

ESP−EMAP DatatsetP−value

(d)

Soft.kNN Soft.LLS Soft.iLLs Soft.BPCA

Soft−ImputeOther−Method

Mea

n of

nor

mal

ized

root

mea

n sq

uare

d er

ror (

NR

MSE

)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

<2.2e−16 2.0e−7 <2.2e−16 5.8e−4

ESP−EMAP DatatsetP−value

Fig. 2. Comparison of Soft-Impute with other methods on means of accuracy distributions.

Y. Wang et al. / Methods xxx (2014) xxx–xxx 5

imputation methods. The P-values after t-test have also beenshown in these figures.

Fig. 2 shows the mean of imputation accuracy in terms of Pear-son correlation and NRMSE of different methods on three kinds ofEMAP datasets and one synthetic dataset. Using the t-test, we cansee that the better imputation accuracy of Soft-Impute algorithmover that of the other methods on three kinds of EMAP datasetsand one synthetic dataset is statistically significant.

Algorithms, such as kNN, LLS and iLLS, focus on the local infor-mation around missing values. For example, KNN only depends onthe nearest K neighbors for each missing entry. BPCA does not havegood performance when genes have dominant local similaritystructures [8]. However, the Soft-Impute method performs theSVD algorithm to control the whole rank of the matrix and choosesone optimal parameter to control the rank. This methodology takesfull advantage of the correlation of the genetic interaction profilesto predict unknown genetic interactions.

4.2. Agreement with the original clustering results

We were also motivated to impute EMAP datasets in order toimprove the efficiency of downstream analysis. A widely usedanalysis technique applied to EMAP datasets is average-linkagehierarchical clustering, using the R program. In order to assess

Please cite this article in press as: Y. Wang et al., Methods (2014), http://dx.do

the effect of different imputation methods on clustering and down-stream biological analysis, we compared clustering results on thethree kinds of EMAP datasets after imputation by the methods pre-sented above. As expected, the Soft-Impute algorithm improvedthe clustering results better than the other imputation methods.We used the Jaccard Index to determine how well the predictedcluster modules correspond to benchmark gene sets (GO terms).The Jaccard Index [25] between two sets Mi and Bj is defined as:

]fMi \ Bjg]fMi [ Bjg

ð11Þ

where ]{A} denotes the number of set A.For module Mi, the Jaccard Index between Mi and each gene set

Bj in the benchmark is computed, and the Jaccard Index of Mi andthe benchmark gene sets is defined as the maximum of Jaccard In-dex between Mi and any gene set in the benchmark:

Jaccard Index ðMi;BÞ ¼maxjfJaccard IndexðMi;BjÞg ð12Þ

Thus, the average Jaccard Index of the predicted modules andthe benchmark gene sets can be computed as:

Jaccard Index ðM;BÞ ¼P

i21;...kJaccard Index ðMi;BÞk

ð13Þ

i.org/10.1016/j.ymeth.2014.03.032

(e)

Soft.kNN Soft.LLS Soft.iLLs Soft.BPCA

Soft−ImputeOther−Method

Mea

n of

Per

son

Cor

rela

tion

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

<2.2e−16 <2.2e−16 <2.2e−16 <2.2e−16

CHR−EMAP DatatsetP−value

(f)

Soft.kNN Soft.LLS Soft.iLLs Soft.BPCA

Soft−ImputeOther−Method

Mea

n of

nor

mal

ized

root

mea

n sq

uare

d er

ror (

NR

MSE

)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

<2.2e−16 <2.2e−16 <2.2e−16 <2.2e−16

CHR−EMAP DatatsetP−value

(g)

Soft.kNN Soft.LLS Soft.iLLs

Soft−ImputeOther−Method

Mea

n of

Per

son

Cor

rela

tion

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

<2.2e−16 <2.2e−16 <2.2e−16<2.2e−16

Pombe DatatsetP−value

(h)

Soft.kNN Soft.LLS Soft.iLLs

Soft−ImputeOther−Method

Mea

n of

nor

mal

ized

root

mea

n sq

uare

d er

ror (

NR

MSE

)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

<2.2e−16 <2.2e−16 <2.2e−16<2.2e−16

Pombe DatatsetP−value

Fig. 2 (continued)

Table 1Running time of imputation methods on synthetic datasets.

Imputation methods Implementation CPU running timea

Soft-Impute Matlab 616.0 s/650.1 sLocal Least Squares (LLS) Matlab 732.1 s/4407.8 sIterated Local Least Squares (ILLS) Matlab 2367.2 s/19987.3 sBayesian Principal Component

(BPCA)Matlab 10198.0 s/

11037.8 sk Nearest Neighbor (kNN) R 159.9 s/170.2 s

a Running time (10% missing values/gradient missing rate from 1% to 37%).

6 Y. Wang et al. / Methods xxx (2014) xxx–xxx

The accuracy of clustering result is evaluated by the averageJaccard Index of the predicted modules and benchmark gene sets.In the ideal situation where the predicted modules perfectly matchthe benchmark gene sets, the Jaccard Index is 1. The larger the Jac-card Index, the better the predictions are. The hierarchical cluster-ing algorithm is used to predict the gene clusters in the three kindsof original EMAP gene sets after different kinds of imputation. Thebenchmark (‘‘theoretical’’) gene sets are GO iterms. The results arepresented as Jaccard Index, numbers of predicted gene modules,and numbers of predicted gene modules enriched in GO (seeTable 3). Table 3 (A-B) presents the results of S. cerevisiae EMAPdatasets. GO-slim iterm of S. cerevisiae was downloaded from

Please cite this article in press as: Y. Wang et al., Methods (2014), http://dx.do

SGD. Table 3 (C) presents the result of the S. pombe EMAP dataset,and the GO-slim iterm of S. pombe was downloaded from thehomepage of Prof. Krogan (http://kroganlab.ucsf.edu/).

Here we adopted the average-linkage hierarchical clusteringalgorithm in which the distance of gene A and gene B is definedas 1� jcorðA;BÞj, where jcorðA; BÞj is the absolute value of the cor-relation of genetic interaction profile of gene A and gene B. To havea consistent criterion to cut the hierarchical clustering tree in thehierarchical algorithm, we adopted the same ‘‘height’’ in the hier-archical clustering algorithm in the R Program, in which ‘‘height’’stands for the fusion levels of the clustering method. For aver-age-linkage hierarchical clustering, the height values are the aver-age distances among clusters. Here, the cutoff of height value is setas 0.8 for the three EMAP datasets. According to our definition ofdistance, the height value ranges from 0 to 1. From Table 3 (No Fil-ter), we can see that the cluster numbers of the datasets after theSoft-Impute imputation method are much less than the others.This result can be explained by the low-rank property of datasetmatrix after Soft-Impute imputation. Although the group interval(height value) is big enough in the other imputation methods,there are too many clusters with very small gene capacity, indicat-ing that they are meaningless clusters. Therefore, we filtered theclustering results by cutting the subtrees with fewer than 3 geneswhich are regarded as non-tight groups (see Table 3 Filter).

i.org/10.1016/j.ymeth.2014.03.032

5 10 15 20

0.0

0.5

1.0

1.5

ESP with 10% artificial missing values

Pers

on C

orre

latio

n

knnillsllsbpcasoft

(a)

5 10 15 20

0.0

0.5

1.0

1.5

ESP datasets with 10% artificial missing values

Nor

mal

ized

Roo

t Mea

n Sq

uare

d Er

ror (

NR

MSE

)

knnillsllsbpcasoft

(b)

0.00 0.05 0.10 0.15 0.20 0.25

0.0

0.5

1.0

1.5

ESP datasets with gradient rate of artificial missing values

Pers

on C

orre

latio

n

knnillsllsbpcasoft

(c)

0.00 0.05 0.10 0.15 0.20 0.25

0.0

0.5

1.0

1.5

ESP datasets with gradient rate of artificial missing values

norm

aliz

ed ro

ot m

ean

squa

red

erro

r (N

RM

SE)

knnillsllsbpcasoft

(d)

5 10 15 20

0.0

0.5

1.0

1.5

CHR datasets with 5% artificial missing values

Pers

on C

orre

latio

n

knnillsllsbpcasoft

(e)

5 10 15 20

0.0

0.5

1.0

1.5

CHR datasets with 5% artificial missing values

norm

aliz

ed ro

ot m

ean

squa

red

erro

r (N

RM

SE)

knnillsllsbpcasoft

(f)

2 4 6 8 10 12 14

0.0

0.5

1.0

1.5

S.Pombe datasets with 5% artificial missing values

Pers

on C

orre

latio

n

knnillsllssoft

(g)

2 4 6 8 10 12 14

0.0

0.5

1.0

1.5

S.Pombe datasets with 5% artificial missing values

norm

aliz

ed ro

ot m

ean

squa

red

erro

r (N

RM

SE)

knnillsllssoft

(h)

Fig. 3. The imputation accuracy of imputation methods on EMAP datasets.

Y. Wang et al. / Methods xxx (2014) xxx–xxx 7

Please cite this article in press as: Y. Wang et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.03.032

Table 2Running time of imputation methods.

EMAP datasets Imputation methods

Soft-Impute

k-Nearest Neighbors(kNN)

Local Least Squares(LLS)

Iterated Local Least Square(iLLS)

Bayesian Principal Component(BPCA)

ESP with 10% missing values 1537.4 s 172.6 s 3756.6 s 7572.4 s 46,558.3 sESP with grad missing

values2006.3 s 288.9 s 4961.3 s 9015.3 s 55126.7 s

CHR with 5% missing values 31,681.3 s 1568.5 s 38,789.3 s 52,817.9 s 6,68,890.7 sPombe with 5% missing

values1,38,979.3 s 2327.8 s 82,647 s 1.3141e + 05 s NA

Table 3Clustering results.

Methods No Filter Filter

] Modules JC-index ] Enricheda ] Modules JC-Index ] Enricheda

A: ESP-EMAP datasetsSoft-Impute 32 0.112 27 24 0.108 22LLS 122 0.100 98 51 0.089 39iLLS 122 0.098 99 51 0.088 40BPCA 116 0.095 98 52 0.101 41kNN 118 0.097 95 47 0.078 34

B: CHR-EMAP datasetsSoft-Impute 52 0.064 51 37 0.0677 36LLS 240 0.045 232 83 0.0633 78iLLS 236 0.049 227 78 0.0621 74BPCA 179 0.056 167 70 0.059 66kNN 183 0.055 175 70 0.057 62

C: S. pombe-EMAP datasetsSoft-Impute 46 0.065 37 42 0.0627 35LLS 356 0.0289 261 89 0.0113 59iLLS 335 0.0299 249 96 0.0122 61kNN 381 0.027 276 87 0.0101 58

Significance level: FDR 6 0.05. ]Modules: the number of modules predicted by hierarchical clustering in EMAP datasets after imputation by different methods. ]Enriched: thenumber of modules predicted by hierarchical clustering enriched in the GO-slim iterms.

a Hyper-geometric test applied to test the enrichment of gene sets.

Table 4Clustering results.

Imputationmethods

] Genes inpublishedmodules

] Publishedmodules foundby h-cluster

] Gene numbermean of publishedmodules

A: ESP-EMAP datasetsSoft-Impute 67 6 11.67LLS 37 4 9.25iLLS 37 4 9.25BPCA 39 4 9.75kNN 6 1 6

B: CHR-EMAP datasetsSoft-Impute 49 6 8.17LLS 35 6 5.83iLLS 36 6 6BPCA 38 6 6.33kNN 17 3 5.67

C: S. pombe-EMAP datasetsSoft-Impute 55 11 5LLS 38 8 4.75iLLS 49 9 5.4kNN 47 8 5.875

8 Y. Wang et al. / Methods xxx (2014) xxx–xxx

The clustering results of EMAP datasets after imputation by dif-ferent methods were compared by measuring their consistencywith known gene modules. These known modules were extractedfrom previous papers for the ESP-EMAP dataset [15], CHR-EMAPdataset [16] and S. pombe dataset [17]. Here we used the hierarchi-cal clustering algorithm, and we set the same uniform cutoff ofheight value, i.e., 0.8, as that shown in Table 3. We discarded thosemeaningless clusters with gene numbers less than 3. We comparedthe number of genes predicted by hierarchical clustering that are inknown modules and the number of known modules that are pre-dicted by hierarchical clustering. The results can be found inTable 4.

As indicated in Table 4, using our method to impute EMAP data-sets before hierarchical clustering (using average- linkage) is moreinformative. Especially, in the S. pombe dataset [17], two previouslyuncharacterized genes, SPAC1610.01 and SPAC18G6.13, werefound to be clustered in an mRNA splicing module. The clusteringresults after Soft-Impute imputation could find this module con-taining these two genes and other genes involved in mRNA splic-ing, while LLS, iLLS and kNN could not.

These results have demonstrated the ability of the Soft-Imputealgorithm to improve downstream data analysis. The Soft-ImputeAlgorithm takes advantage of the correlation of genetic interactionprofiles to predict unknown genetic interactions. The Z matrix isthe integrated matrix that has been imputed by Soft-Impute impu-tation, and its rank is limited during the imputation. Mathemati-cally, the Soft-Impute procedure eliminates those smalleigenvalues and reserves big eigenvalues, equivalently. Such pro-cedure tends to clear up data to enhance the strong correlation

Please cite this article in press as: Y. Wang et al., Methods (2014), http://dx.do

structure among genes in the data matrix. In other words, the pro-cedure of Soft-Impute algorithm achieves the reorganization andrefining of the original dataset, while, at the same time, it imputesthe missing entries. As a consequence, this methodology can im-prove the downstream clustering effectiveness, while predictingthe missing entries in a manner superior to that of other imputa-tion methods.

i.org/10.1016/j.ymeth.2014.03.032

Y. Wang et al. / Methods xxx (2014) xxx–xxx 9

5. Conclusion

In this article, we have introduced a method named Soft-Imputeto impute missing values of EMAP datasets. Soft-Impute methoduses the correlation among genes to impute the missing values.This method adopts an efficient algorithm to solve imputationproblem and guaranteed its convergence [13]. It develops one softthreshold to the SVD algorithm, which can be selected as an opti-mal one by choosing the regularization parameter k. This method-ology was proposed by Hastie [13], and it has been used in imagerecovery and eQTL studies, but this is the first time it has beenintroduced in genetic interaction data imputation.

We have compared the Soft-Impute method with four otherpopular imputation methods on genetic interaction data, includingthree kinds of EMAP datasets and one synthetic dataset. First, thegiven datasets were imputed and then the imputation accuracywas determined, followed by hierarchical clustering. Finally, wecompared the clustering results against GO annotations and thepublished annotations. We demonstrated that the Soft-Imputemethod could achieve a better performance in imputation accuracyand that it improved downstream data analysis better than otherexisting methods.

To the best of our knowledge, this paper is the first attempt tointroduce the Soft-Impute algorithm into imputing missing valuesin genetic interaction datasets. This algorithm is appropriate fordatasets which have modules in which entries have high correla-tion. Consequently, it can be used widely in many kinds of fieldswith such characteristics to realize the imputation of missingentries.

The imputation of missing values is the first step of data analy-sis, and it has very important influence on the downstream analy-sis. The Soft-Impute method could improve the performance ofdownstream data analysis and promote the further exploration ofgenetic interaction network.

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (31171262, 11021463), the National Key Basic

Please cite this article in press as: Y. Wang et al., Methods (2014), http://dx.do

Research Project of China (2009CB918503), and partly supportedby MOST International Collaborative Project (2011DFA31860).

References

[1] R.A. Fisher, Trans. Roy. Soc. Edinburgh 52 (02) (1919) 399–433.[2] C. Boone, H. Bussey, B.J. Andrews, Nat. Rev. Genet. 8 (6) (2007) 437–449.[3] S.R. Collins, M. Schuldiner, N.J. Krogan, J.S. Weissman, Genome Biol. 7 (7)

(2006) R63.[4] C. Ryan, D. Greene, G. Cagney, P. Cunningham, BMC Bioinf. 11 (1) (2010)

197.[5] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D.

Botstein, R.B. Altman, Bioinformatics 17 (6) (2001) 520–525.[6] H. Kim, G.H. Golub, H. Park, Bioinformatics 21 (2) (2005) 187–198.[7] Z. Cai, M. Heydari, G. Lin, in: APBC, 2006, pp. 159–168.[8] S. Oba, M.-A. Sato, I. Takemasa, M. Monden, K.-I. Matsubara, S. Ishii,

Bioinformatics 19 (16) (2003) 2088–2096.[9] C. Ryan, G. Cagney, N. Krogan, P. Cunningham, D. Greene, in: Network Biology,

Springer, 2011, pp. 353–361.[10] O. Zuk, E. Hechter, S.R. Sunyaev, E.S. Lander, Proc. Natl. Acad. Sci. 109 (4)

(2012) 1193–1198.[11] C. Stark, B.-J. Breitkreutz, A. Chatr-Aryamontri, L. Boucher, R. Oughtred, M.S.

Livstone, J. Nixon, K. Van Auken, X. Wang, X. Shi, et al., Nucleic Acids Res. 39(suppl 1) (2011) D698–D704.

[12] A. Roguev, D. Talbot, G.L. Negri, M. Shales, G. Cagney, S. Bandyopadhyay, B.Panning, N.J. Krogan, Nat. Methods 10 (2013) 432–437.

[13] R. Mazumder, T. Hastie, R. Tibshirani, J. Mach. Learn. Res. 99 (2010) 2287–2322.

[14] C. Yang, L. Wang, S. Zhang, H. Zhao, Bioinformatics 29 (8) (2013) 1026–1034.[15] M. Schuldiner, S.R. Collins, N.J. Thompson, V. Denic, A. Bhamidipati, T. Punna, J.

Ihmels, B. Andrews, C. Boone, J.F. Greenblatt, et al., Cell 123 (3) (2005) 507–519.

[16] S.R. Collins, K.M. Miller, N.L. Maas, A. Roguev, J. Fillingham, C.S. Chu, M.Schuldiner, M. Gebbia, J. Recht, M. Shales, et al., Nature 446 (7137) (2007)806–810.

[17] C.J. Ryan, A. Roguev, K. Patrick, J. Xu, H. Jahari, Z. Tong, P. Beltrao, M. Shales, H.Qu, S.R. Collins, et al., Mol. Cell 46 (5) (2012) 691–704.

[18] N. Srebro, N. Alon, T.S. Jaakkola, in: Advances in Neural Information ProcessingSystems, 2004, pp. 1321–1328.

[19] J.-F. Cai, E.J. Candès, Z. Shen, SIAM J. Optimizat. 20 (4) (2010) 1956–1982.[20] E.J. Candès, T. Tao, IEEE Trans. Inf. Theory 56 (5) (2010) 2053–2080.[21] R.H. Keshavan, A. Montanari, S. Oh, IEEE Trans. Inf. Theory 56 (6) (2010) 2980–

2998.[22] N. Srebro, T. Jaakkola, et al., in: ICML, vol. 3, AAAI Press, 2003, pp. 720–727.[23] M. Fazel, Matrix rank minimization with applications, Ph.D. thesis, Stanford

University (2002).[24] D.L. Donoho, I.M. Johnstone, G. Kerkyacharian, D. Picard, J. R. Statist. Soc. B

(1995) 301–369.[25] J. Song, M. Singh, Bioinformatics 25 (23) (2009) 3143–3150.

i.org/10.1016/j.ymeth.2014.03.032