biclustering-based imputation in longitudinal data · biclustering-based imputation in longitudinal...

1

Biclustering-based imputation in longitudinal dataInes A. Nolasco, Alexandra M. Carvalho, Sara C. Madeira

Abstract—Amyotrophic Lateral Sclerosis (ALS) is a neurode-generative disorder that affects motor abilities. The patientswith ALS progress rapidly and die in few years mainly fromrespiratory failure, the importance of being able to identify whenone should start respiratory assistance is vital. The prediction ofthe need for respiratory assistance is approached by the analysisof longitudinal data consisting in clinical follow-ups of patientsthrough time. However, this data is very prone to the occurrenceof missing values. In this work the problem of missing valuesin longitudinal data is addressed by the application of biclustertechniques in order to find trends in the data and thus improvethe imputation of missing values. The proposed approach wastested together with several baseline imputation methods in bothsynthetic and real-world data (ALS dataset). Results indicate thatgenerally the use of biclustering-based imputation improves theresults.

Index Terms—Missing Values, Longitudinal data, Biclustering-based imputation, ALS disease

I. INTRODUCTION

AMYOTROPHIC lateral sclerosis (ALS) is a motor-neuron disease, that involves the degeneration of up-

per and lower motor neurons causing muscle weakness andatrophy throughout the body. ALS forces people to needcontinuous care and basic life functions become compromisednear the final stages of the disease. Without any knowncure, patient care is reduced to symptom relief, improvingquality of life and increasing life expectancy as much aspossible. With adequate care patient’s average life expectancyis between 2 to 5 years. A worsening in the lives of both patientand families happen when respiratory difficulties appear andthus respiratory assistance, Non Invasive Ventilation (NIV), isneeded. It’s a delicate stage since from this point onward thepatient will be much more dependent of machines and caregivers and it’s also a process that involves higher costs.

Several studies based on clinical follow-ups of patients,focus on this phase of ALS, with the purpose of predictingwhen will the patients need NIV in advance, as much possible.Clinical follow-ups are usually presented as longitudinal data,which consists in subjects being observed for some periodof time at several moments. A recurring problem in thiskind of studies is missing values, i.e., observations that, forsome reason, were not made or were not written down andthus no value is known for that person, for that feature, atthat moment. Missing values imply that some informationis missing, and thus, knowledge cannot be extracted fromthat data. It is intuitive that depending on the magnitude,and characteristics of these missing values, the conclusionstaken from the analysis of the data may be distorted. Studyinghow these missing values affect the conclusions taken, andunderstanding the methods that better deal with missing databecame a pressing matter in the context of studies predictingNIV. As an example, in the ongoing work by Andre Carreiro

et al. [1], the authors are able to predict if NIV will be neededor not by the time of the sixth clinical evaluation, based onthe five previous evaluations. It is this author’s belief thatby improving the predictive performance of these studies asignificant help may be provided to ALS patients in order tobetter cope with the conditions inflicted by this disease.

To tackle the problem of missing values in longitudinal dataand also in the specific context of the ALS disease, the presentwork explores the application of biclustering techniques withthe objective of finding trends among data and thus develop anovel and more adequate way to deal with missing values inlongitudinal data. Although not new, the Biclustering concept,only recently has been further explored, and new and robustalgorithms have been developed in recent years. Due to thisadvance it is now possible to explore the application of thesetechniques in other contexts, such as missing data.

A. Problem Formulation

The problem of predicting the need for NIV in ALSpatients is structured as a classification problem. Here, subjectsevaluated in ALS related features in several moments, arelabeled as Evol or noEvol considering the evolution or notfrom not needing NIV to needing NIV from the 5th to the6th evaluation moment. This work tackles the problem ofmissing values in the specific context of the prediction of theneed for NIV in ALS patients, in order to help the predictivepowers of these studies. In that sense it is important to (1)understand the implications that missing values have on theclassification accuracy, (2) understand the relation between thedata structure (longitudinal) and the capacity of the methodsto deal with missing values, and finally, (3) using the derivedknowledge to enhance the classification problem accuracy bytreating missing values in an intelligent and specially designedway.

II. BACKGROUND

A. Longitudinal data

Longitudinal studies, from where longitudinal data results,are designed to repeatedly observe the same subjects andmeasure the same features of interest for long periods oftime. The main advantage of this design is that it allowsto understand the evolution or change in behavior of somesubjects excluding time-invariant characteristics that couldcloud the conclusions.

The difference between this and other time-consideringdesigns is not always clear, in specific, time-series data alsoconsists of successive measurements made over some timeinterval, however the fields of research using one and otherdesign are very different. Usually time-series data presentsa large quantity of time points, and the rate at which the

2

observations are made is much faster. For instance the dataacquired from a sensor during a certain period of time maybe seen as a time-series.

B. Missing values

Missing data occurs when no data values are stored for somevariables. An example of missing data is the non-responseto some questions on a survey. This may have differentreasons, as Little and Rubin exemplify in [2] a person maynot answer because she refuses to, or because she does notknow the answer. These two different reasons represent distinctmechanisms that lead to data being missed and should betreated differently when analyzing the data.

Missing data can then be classified, referring to the mech-anisms that causes it to be missing. In [3] a formalizationis proposed in the following way: Y is the variable underobservation which may have missing values, X are othervariables in observation without missing values, and R is aresponse indicator which has the value 1 corresponding to amissing value of Y and 0 corresponding to an observed valueof Y. Given this, missing data is classified as:Not Missing at random (NMAR) If the reason why the

data is missing is related to the value itself, then thedata is said to be not missing at random. This meansthat the probability of observing a missing value inY is not independent from the variable Y itself, i.e.,P(R=1—X,Y)=P(R=1—Y).

Missing at random (MAR) Data missing at random hasmissing values unrelated with the value itself, butrelated to other variables under observation. There-fore, MAR data has the probability of having missingvalues in Y described as P(R=1—X,Y)=P(R=1—X).

Missing completely at random (MCAR) The values missingin this category are completely independent from anydimension of the data. Given this, the probability ofmissing values in MCAR data can be described asP(R=1—X,Y)=P(R=1).

1) Imputation methods: Following the need to predict miss-ing values, imputation methods were developed. These arebased on the assumption that the resulting inferences from arepresentative sample, should be similar if the same analysisis performed on a different, but also representative sample.Therefore, and as A. Rogier et al [4] state, if we change onesubject from the sample and this subject is drawn at randomfrom the population under study, the analysis results shouldnot be very different.

R. Little and D. Rubin [5], also propose an intuitive reasonfor applying these methods: if in our study we have some vari-able Yj which is highly correlated with some other observedvariable Yk, and if some values are missing in Yj then it isvery tempting to predict these Yj values from Yk.

Taking that into account, the overall scheme of these meth-ods is to imput, in the place of the missing values, values thatmake sense to be there. It is in the approach used to decidewhich values to input that the methods differ.

The general positive point with this kind of methods is that,after the imputation process is done, we may work with a full

sized sample, and perform the exact same analysis we woulddo if no missing values ever occurred. This is, however, notwithout problems which are discussed through this section.

Hot-deck imputation If a value is missing in the responses ofsome subject, a sub-sample of subjects with similarcharacteristics is formed and the value missing isimputed with a value taken at random from thatsubgroup. This approach results in biased estimatorsand misleadingly small standard errors since there isno new information being added. Another problemregarding this method is the high difficulty or evenimpossibility of being able to get a subset of sub-jects with similar characteristics. In practice, ad hocstrategies are employed.

Mean imputation In the place of missing values, this methodinputs the estimated sample mean of the variablethat shows missing values. This approach may begeneralized to use any metric of frequency or centraltendency to infer missing values. However, the sameproblems pointed before exists, as we are not imput-ing new information, (the imputed values are basedon the sample and do not increase the uncertainty as-sociated) and as we are using a full sized sample, thestandard error for the estimated parameters will bemisleadingly small. The bias problem also remainsspecially if the data is not MCAR.

Regression imputation Based on the sample, a regressionfunction is constructed where the variables withmissing values as the dependent ones and the otherobserved variables as the predictors. The missingslots are then imputed with values resulting from thatfunction. Several kinds of regression functions maybe derived, and should be used accordingly to thetype of data at hand, for instance linear or logisticregressions. The problems mentioned above are stillnot solved with general regression imputation meth-ods, since the biased estimators and small standarderrors continue to happen.

2) Imputation in longitudinal data: Throughout the courseof this work it became somewhat clear that the imputationprocess is very sensitive to data structure and design. Theworks mentioned above are an example of that, where specifictools are used to tackle specific aspects of the data.

The same conjecture appears to be true when dealing withlongitudinal data. Indeed, knowing the exact structure of adataset, and in this case, understanding that attributes presentan evolution over time, is of interest when developing methodsto deal with missing values in longitudinal data. The followingmethods appear to be extensions of the baseline methodsinstead of contributions on missing data management.

Individual mean imputation The missing value is imputedwith the mean of observed values in the distinct timepoints for the same instance.

Time mean imputation The missing value is imputed with themean of the observed values in distinct instances forthe same time-point. As before, the imputation withother metrics, e.g., median, is possible, and should be

3

considered according to the dataset’s characteristics.

Several studies try to understand if the use of speciallydeveloped methods for longitudinal data would result in betterimputed values than the general baseline methods. In thiscontext two studies [6] and [7], arrive to the same con-clusions: if we are dealing with longitudinal data then thelongitudinal aspects should enter into consideration in theimputation method and the longitudinal methods applied yieldbetter imputations in both cases when compared with baselinemethods.

C. Biclustering

Clustering stands for grouping together objects that in somesense are similar to each other or that have more in commonthan the rest of the objects. The rules that define whichobjects belong to the same cluster may be very different,may even be applied to discretized versions of data or tothe real values directly, depending on the problem at hand.In the simplest cases, measures of distance between objectsare used. Clustering may also be applied in more than onedimension. Consider a data matrix A of size n × m, whererows represent objects and columns are attributes of inter-est, clustering applied to objects will group together objectsthat present closest values for all attributes, and will resultin a sub-matrix of size s × m containing all the selectedobjects (s < n). But clustering may also be applied to thecolumns, (attributes) and that would group together attributesthat behave in a similar way for all objects, resulting in asub-matrix n× p with n objects and p attributes (p < m). Insome contexts, clustering simultaneously in both dimensions(rows and columns) is also of interest, This means that asubset of objects and attributes are selected to form a sub-matrix of size k × t with k < n and t < m, called bicluster.In [8] the authors express the underlying difference betweenthese clustering approaches, clustering methods derive a globalmodel for the data, since each object belonging to a clusteris selected using all attributes or vice-versa. Biclustering,however, produces local models since objects and attributesbelonging to a bicluster are selected considering only a subsetof attributes and objects.

Time-series data biclustering: Biclustering has also beenshown to be an interesting technique to be applied to time-series data with the objective of finding objects that behavein a similar manner over the same time-points, i.e., objectsthat present the same trends over time. For that purposeconsider the application of biclustering to a data matrix Awhere, instead of different attributes, columns would repre-sent different time-slices of the same feature. If the goalis to find local trends over time, then it makes sense thatcolumns, representing time-points, selected for each biclusterto be contiguous. This restriction was further explored bythe authors in [9] which proposed an algorithm that is ableto find such biclusters in linear time, the called ContiguousColumn Coherent Biclustering (CCC-Biclustering) algorithm.In this algorithm a discretized version (A) of the data matrix(A’) is generated, and rows and columns are grouped together

following two rules: first, columns must be contiguous; andsecond, discretized symbols in each column must be the samein every row selected.

III. METHODOLOGY

A. Biclustering-based imputation

Following the idea of imputation inside classes or subgroupsintroduced in Section II-B1, in this work it will be furtherexplored the use of biclustering techniques, applied to ourlongitudinal dataset, in order to create similarity groups andperform group-dependent imputation. The idea is that if theimputation is carried out based on groups that share somesimilarity, it is to be expected that the imputed values aremore accurate than if the imputation was made consideringthe whole dataset. Considering time-dependent aspects in theimputation process, the idea of grouping persons according tosome similarity can be described as looking for local trends inthe data, i.e., we want to group together patients that for somefeature show the same evolution over time. Then, it is imputedtherein any missing value taking into account the group trendfor that feature. The biclustering strategy used in the courseof this work is called CCC-biclustering, which as previouslyexplained (in Section II-C) only finds biclusters with con-tiguous time points. However, a slightly different version ofthis algorithm has to be used, the e-CCCbiclustering, whichallows the biclustering of objects with approximate similarityinstead of an exact one, i.e., selected patients in the definedtime-points do not need to be precisely equal between them,but only similar in some defined degree that may containmismatches or missing values. In the context of this work,allowing missing values inside bicluster is imperative to beable to group together subjects that present some missingvalues in some time-point, and generate biclusters as shownin Figure 1.

Fig. 1. Illustration of an e-CCCbicluster containing samples with missingvalues.

However, we are not interested in allowing other mis-matches/errors in the bi-cluster computation, so the e-CCCbiclustering algorithm was modified to allow only missingvalues as errors. Before applying the biclustering algorithm,the data needs to be processed in order to generate one-feature matrices for each longitudinal feature in the original

4

dataset. Each data matrix consists in the observations for theconsidered feature for all patients at different time pointsand it constitutes the base data matrix where biclustering isperformed. In Figure 2 a representation of the strategy usedto biclustering this data is presented step by step.

Fig. 2. Bicluster computation workflow. First construct the one-featurematrices by separating the dataset in sub-datasets of only one feature (but withall samples and all time points). Then apply the modified e-CCCbiclusteringalgorithm to find biclusters in samples that have the same discretized valueswith only missing values as the possible differences.

After the generation of each one-feature matrix, as describedin the Section II-C, the matrix needs to be discretized sincethis algorithm applies biclustering over a discretized version ofthe data. There are several options regarding the discretizationmethod, but the one that seems more convenient to ourproblem is the discretization in n symbols performed withequal width by subject.

This discretization approach looks into the values of eachsubject across the time and creates n bins of equal widththat correspond univocously to a symbol of the discretizationalphabet. Alphabets of 3 symbols are usually used, in par-ticular the following sequence of letters ordered in crescentorder: (D,N,U ). But, other alphabets are possible, in thiswork, in order to understand the effect of discretization onthe results of biclustering-based imputation the discretizationwith 5 symbols is also used, corresponding to the followingalphabet: (A,B,C,D,E).

A point to consider is that not all resulting biclusters areof interest, they may be too small or trivial1, they maybe statistically insignificant, or they may simply not makemuch sense in the context of the problem. For instance,we are interested in getting local trends in a six time-pointlongitudinal data and therefore a bicluster of only two timepoints it is not very consistent with that. To understand theimplications of these aspects in the imputations performed,and what kind of metric should be interesting to apply, itis considered the use of four different sets of biclusters inthe imputation process: (1) all non trivial biclusters, (2) onlysignificant biclusters, (3) all non trivial biclusters with morethan 3 time points, and (4) only significant biclusters withmore than 3 time points.

Once the biclusters are formed, the imputation may beperformed inside the biclusters exactly as it would be done

1Herein trivial means a bicluster with only one time point or relatively toonly one sample/person.

on a whole dataset except that it is performed on a smallergroup of data that forms the bicluster. The general imputationprocess inside the bicluster consists in the following steps. Toeach longitudinal feature there is a one-feature matrix withmissing values and a description of the bicluster found. Eachmissing value is attributed to one bicluster from which theimputation is to be performed. However, several biclustersmay contain the same missing value and a single biclustermust be selected. To solve this, in any of the four casesenumerated above, the biclusters are evaluated according totheir statistical significance and the most significant one isselected. Furthermore, some missing values don’t fall insideany bicluster. In these cases, the missing value will remainmissing, i.e., these missing values will not be imputed, or areimputed with an additional method that is applied to the wholeone-feature matrix.

After the univocal relation between missing value andbicluster is computed, the local imputation process takes placeand a single value to impute is found. This value is thusimputed, in the one-feature matrix, in the place of the missingvalue. A schematic representation of this process is presentedin Figure 3.

Fig. 3. Illustration of the biclustering-based imputation process. For eachmissing value, it finds a bicluster that contains it. Next, it takes the sub-matrix of the data contained in that bicluster and perform imputation on thatmissing value.

B. Imputation methods appliedIn this section a description of the imputation procedures

that are applied and compared are presented.Expectation maximization (EM) Imputation using EM

approach is performed with the Matlab software,specifically the EM imputation implementation de-scribed in [10]. This approach receives as input amatrix with missing values defined as Not a Number(NaN) and generates as output the same matrix withthe missing values imputed.

Median across subjects (MED) This approach was imple-mented in the Matlab environment. The procedurecomputes for each one-feature matrix the median foreach time point across all subjects, then it imputeseach missing value with the corresponding medianvalue computed.

Median longitudinal (MEDL) A variation on the previousimplementation was also developed where median

5

values to input are computed separately, not only ineach feature, but also in each subject, across all timepoints.

Bicluster-based imputation (BIC) Following the strategydescribed in Section III-A three imputation methodsapplied to the biclusters were explored: imputationwith Median cross Persons, imputation with EMand imputation by bicluster pattern. The first twoapproaches are simply direct applications of theprevious introduced implementations and differ onlyin the sense that they are applied to a much morerestricted group, the bicluster. The last method usesthe information of the bicluster pattern and the localvalues from the same person to predict the valueto impute. The strategy used is: first, select thecorresponding letter of the bicluster pattern for thatmissing value, and then apply reverse discretizationto it. Based on the other values available for that per-son, compute the value interval that letter representsand make the imputation with the mean value of thatinterval. An example of this process is presented inFigure 4.

Fig. 4. Illustration of the biclustering-based imputation using ”by pattern“approach. First determine which discretization letter corresponds to themissing value. Second, compute the mean value of the interval that letterrepresents for the given subject. The imputation is then performed with thismean value.

IV. RESULTS AND DISCUSSION

A. Synthetic data

Using the BiGen generator, four different sized matriceswere generated. The sizes defined were: 1000×150, 2000×200and 5000 × 200. Each generated matrix consists in integervalues ranging from 0 to 20 and planted with biclusters andmissing values. The biclusters planted are set to be ”orderpreserving across rows“ and their size is defined by an uniformdistribution for both rows and columns, for which the userdefines the minimum and maximum values. The defined sizeswere not extremely big in order to simulate what happens withthe real-world data and considering the dataset size. After thebiclusters are planted, missing values were also included in

each matrix and in different percentages: 10%, 20%, 30% and50%. This generates the final datasets where imputations areto be performed. In Figure 5 a summary and description ofeach dataset is presented.

Fig. 5. Description of datasets generated by BiGen, before imputation isperformed. Each dataset is here described by dataset size, percentage ofmissing values, bicluster parameters and number of missing values found inbiclusters.

Each dataset generated and imputed is evaluated by twometrics: percentage of missing values imputed and meanimputation error. These datasets are described in Figure 6.

Fig. 6. Imputation strategies description. Each imputation method that isused appears in green, methods not used are shown in grey. As an example,BICem MED imputes missing values inside biclusters with the EM approachfollowed by applying median imputation to the remaining missing values.

In the present section the results of such evaluation arepresented.

From the several questions one may want to answer fromthis analysis, the most interesting is whether the bicluster-based imputation approach derives better imputed values withrespect to alternative methods. To answer this, it is crucialto compare between imputation approaches that use bicluster-based imputation in one portion of data and an additionalmethod on the remaining missing values, with the imputationapproaches that use the same additional method to imputethe whole dataset. By doing this it is possible to directlyunderstand if the use of bicluster-based imputation, even ona small portion of data, results in a better imputation thanif this approach was not used. The specific approaches thatfall in the category described are MED versus BICmed MED,or versus BICem MED and also EM versus BICem EM.Such comparison may be observed in Figures 7 and 8 where

6

these approaches are compared relatively to their mean im-putation error. The conclusion is straightforward, when usingbicluster-based imputation methods, whichever be the imputa-tion method used inside the bicluster, the mean error is smallerthan if no bicluster imputation is applied. Also these results areconsistent in all nine synthetic datasets created, independentlyof size and percentage of missing values, i.e., the relativerelations between mean imputation errors for the methods inanalysis are maintained, confirming that these conclusions areindependent of percentage of missing values or dataset size.

Fig. 7. All nine synthetic datasets present the same relative results: biclusterimputation, enhances the imputation results. MED: median imputation in thewhole dataset.

Fig. 8. All datasets present the same relative results: bicluster imputation,enhances the imputation results. EM: EM imputation in the whole dataset.

These imputation methods are also robust to the amount ofmissing values in the synthetic datasets. As can be seen fromFigure 9, in general, for all data sizes, the mean imputationerror almost does not increase with a dramatic increase ofmissing values (from 10% to 50%).

Fig. 9. Mean imputation error for the smallest dataset (1000 × 150) withdiffering amount of missing values. All methods perform almost equally wellfor different amount of missing values, even when this amount rises to 50%.This result is consistent across the datasets of different sizes.

Finally, it is also possible to find which one of the testedimputation approach shows the most promising results in terms

of mean imputation error, in Figure 10 the mean imputationerror is represented for the smallest dataset (1000 × 150) forall the imputation approaches applied. As before, these resultsare also consistent for all datasets sizes and amount of missingvalues, thus only results from the datasets of size 1000× 150are shown here. Analyzing these results leads to the conclusionthat the BICem MED and BICem EM methods consistentlyachieve better imputations, i.e., , the predicted values are closerto the real ones.

Fig. 10. Mean imputation error in the smallest dataset (1000 × 150)for all methods, and different amount of missing values. For all datasets(sizes and amount of missing values) the best method was BICmed MEDfollowed closely by BICmed MED. BICem MED: EM imputation insideeach bicluster followed by median imputation across rows of the whole datasetfor the rest of the missing values. BICem EM: EM imputation inside eachbicluster followed by EM imputation in the whole dataset for the rest of themissing values. See above in the text for the definition of all methods.

B. Real-world data

For the real-world data, the ALS dataset, the imputationmethods were indirectly evaluated through the classificationresults. The classification problem, as described in Section I-A,can be described as predicting the evolution or not of needingassisted ventilation (NIV) by the time of the sixth visit usingall previous observations.

Classification is performed in a supervised way, but sincethe two classes Evol and noEvol are seriously unbalanced,it was necessary to apply Synthetic Minority OversamplingTechnique (SMOTE) in order to achieve better balance.

Concerning the classifiers, the ongoing work by AndreCarreiro [1], selects the Naive Bayes since this is the one thatyields better results. However, Naive Bayes (NB) implemen-tation in the WEKA data mining software is also known tobe a classifier that tolerates particularly well missing values,so it is not expected that classification results can be signifi-cantly improved by using better imputation methods. For thisreason, and in order to be able to highlight which imputationsreally improve the classification process, different classifierswere used, namely Decision Trees (DT), K-Nearest-Neighbor(KNN) and linear Support vector machine (LinearSVM).

For classification parameters, NB was applied with kernelestimator, KNN was applied with 1 neighbor, DT was per-formed with a confidence factor of 0.25 and without Laplacesmoothing, and LinearSVM was performed with complexityof 1.0.

7

Regarding the default method for dealing with missingvalues: the NB implementation in WEKA simply omits theconditional probabilities of the features with missing values intest instances; the KNN implementation assigns to the missingvalues the maximum distance when comparing instances withmissing values; the DT implementation simply does not con-sider the values of the missing attributes to compute gain andentropy; and the LinearSVM implementation treats missingvalues by imputing global means/modes.

The classification process was performed with cross valida-tion setup, where each dataset (Original, SMOTE 300%, andSMOTE 500%) was divided in five folds from which 4 wereused for training and one for testing. These experiments, foreach classifier and dataset, were repeated 10 times. Classifica-tions are evaluated with F-measure that balances the influenceof each class and integrates both precision and recall in a finalnumber.

1) Data description: This work was build upon clinical datacontaining information regarding ALS patient follow-ups col-lected by the Neuromuscular Unit at the Molecular MedicineInstitute of Lisbon. As mentioned, this dataset is constructedin a longitudinal fashion where each patient is observed inseveral moments through time. Although observations do notfollow a strict plan, they tend to average 3 month betweenconsecutive observations. The dataset contains demographicinformation, patient characteristics, neuropsychological anal-ysis, motor evaluations and also respiratory tests where theNIV requirement is included. In short, each patient evaluationconsists in the observation of 34 different features. There arestatic features, which are time invariant, and longitudinal oneswhich are time variant and may show some trend. From the34 features, 22 are longitudinal and are the focus of this work.In the context of the presented problem, each patient’s follow-up is labeled with Evol or noEvol considering if an evolutionin the NIV indicator exists or not. The higher the number offollow-ups the easier it should be to perceive and exploit trendsin the data. Therefore, only patients that presented at least fivefollow-ups were considered. From these only the patients thatdidn’t evolve from not needing NIV to needing NIV beforethe fifth moment are of interest to the classification problemat hand. Although other setups could be considered, this wasthe best option from a balance between number of resultingpatients and number of follow-ups, since more follow-upsresult in fewer patients fulfilling the needed conditions. Byfiltering the not interesting ones, the resulting dataset consistsof 159 patients observed in 34 different features at 5 differentmoments, which takes the form of a matrix of size 159*170,as depicted in the Figure 2.

The resulting dataset is quite unbalanced. It contains 31Evol samples and 128 noEvol samples. This means that onlyabout 20% of the cases are Evol samples.

Missing Values Analysis: Approximately 40% of the valuesin the present dataset are missing. These missing values occurin approximately 80% of the features, there is no singlepatient that does not present at least one missing value andare distributed unevenly throughout the two classes sincefrom the total number of values belonging to class Evolapproximately 80% are missing, against 20% missing values in

class noEvol. This represents a problem for the classification.The longitudinal features of this dataset are the ones presentingmore quantity f missing values.

2) Biclustering results : As previously mentioned, themodified version of the biclustering algorithm, e-CCC-Biclustering, was applied to longitudinal features that previ-ously were transformed in one-feature matrices. The resultsof such procedure are presented here. The discretization de-scribed in Section III-A, needed when applying this algorithm,was performed with two different number of symbols, 3 and5, that correspond to the following alphabets: U, N, D, and A,B, C, D, E.

To understand the ability of the e-CCC-Biclustering algo-rithm to find and group together patients that show the sametrends, it is necessary to analyze the amount and importanceof the found biclusters. Although the trivial biclusters havealready been filtered out, not all resulting biclusters are ofinterest for the present problem, and thus a characterizationrelating the size and significance of the biclusters is needed.For this reason, the biclusters to consider are grouped infour different categories (ALL, SIG, TP, and TPeSIG), asintroduced in Section III-A.

This analysis allows for a characterization of the biclustersfound and results in the general idea that, as expected, thenumber of biclusters that are both significant and have threeor more time-points are scarce. Also, it may be concluded thatthe higher the number of missing values in each feature thelower the number of biclusters found. This was expected sincemissings leads to lost of information, the higher the numberof missing values the scattered the data becomes, increasingthe difficulty of finding interesting biclusters.

An important aspect to be analyzed, since it is highlycorrelated with the imputation results, is the amount of missingvalues that are caught in biclusters.

The total number of missing values that are grouped inbiclusters do not ascend to 30%, and that is considering allthe non trivial biclusters. If a more restricted but also moreinteresting group of biclusters is to be considered then onlyapproximately 5% of the missing values are caught. Also,regarding the effect that the discretization options had onthe capability of finding biclusters and on their quality, wefind that using 3 or 5 symbols does not result in a concretedifference.

After the biclustering process, datasets were imputed gen-erating the datasets described in Figure 11.

3) Classification results: Because of the extreme data un-balance, SMOTE was imperative to be applied and only withthe application of SMOTE with 300% was it possible to obtaina balanced dataset. As it is the usual procedure, SMOTEwith 500% was also applied in order to obtain the invertedunbalanced dataset, i.e., the class Evol with as many moreinstances than the class noEvol had in the original dataset.These two procedures were applied to each dataset describedabove.

After these stage, all datasets described, together with theones created by SMOTE, are classified as explained before.The resulting classifications are evaluated trough the metricsTP rate, TN rate, Precision, Recall, K-statistics and F-measure.

8

Fig. 12. F-measure for all balanced datasets classified with NB.

Fig. 13. F-measure for all balanced datasets classified with NB.

Being unable to directly evaluate the imputation methodswith these data, the focus here is to understand which methodswork better with which classifiers, in order to help improve theclassification process addressed in other works. In Figures 13,14, 15 and 16, the F-measure metric for each classifier andbalanced datasets (with SMOTE300) are represented. Fromhere it is possible not only to observe which method is thebest for each classifier, but more importantly it is possible toinquire some aspects regarding the relation between imputationmethods particularities and the classifiers performance. It isalso important to have in mind that the results concerningthe original dataset (ORI) serve to evaluate the default man-agement of missing values that each classifier implementationuses, and compare it with the present imputation methods in

test. The default strategies are described in the Section IV-B.

Using the Naive Bayes classifier, the imputations applyingMedian to the whole dataset(MED) and Bbimputation withmedian (as an example the datasets imputed with Bic3SIG andBIC5SIG) improves over Weka’s default method for dealingwith missing values (ORI). Also when using KNN, the defaultmethod to deal with missing values in Weka performs worstthan Bbimputation approaches with median (BIC3SIG andBIC5SIG), EM applied to the whole dataset (EM) and Medianapplied to the whole dataset (MED). Regarding Decision Trees(DT), it is noticeable that the Bbimputation procedure with EMimproves the results over the EM imputation applied featureby feature. As to the linear SVM, the biclustering-basedimputation methods using the by pattern approach ( as an

9

Fig. 14. F-measure for all balanced datasets classified with KNN.

Fig. 15. F-measure for all balanced datasets classified with linearSVM.

example the datasets imputed with BIc3bypattern) and Medianapplied to the whole dataset (MED) improves results over thedefault missing treatment of WEKA. The Bbimputation withby pattern approach also improves results over EM appliedfeature by feature (EMbyfeature) and over Median applied tothe whole dataset(MED).

These conclusions are supported by the results of theWilcoxon signed-rank tests, that compares results of bothexperiments and is able to determine if the values are sig-nificantly different or not. The results of the applied tests in

the form of P values are presented, for each classifier in thetables I, II III and IV. A P value lower than 0.05 indicates thatthe F score mean values for both experiments are different andthus conclusions about performance may be drawn.

V. CONCLUSION

In this work it was studied the problem of missing values inlongitudinal data. Synthetic datasets were generated to test theperformance of imputation methods against ground truth, andto test the influence of the amount of missing values on the

10

Fig. 16. F-measure for all balanced datasets classified with Dt.

Fig. 11. For each dataset it is shown which are the imputation approachesuses (green), the order of its application, and the number of missings eachmethod is able to imput.

imputation methods. The problem was furthered evaluated ina real-world dataset of ALS patients, where the purpose is topredict the evolution of the need of Non Invasive Ventilation(NIV) for assisted breathing.

The problem of imputing missing values in longitudinaldata is approached here by the use of biclustering algorithms.The application of biclustering algorithms here allowed tofind trends in the data, from which better imputations can beperformed. The following conclusions were derived:

• The tested methods are robust to the number of missingvalues even when this amount rises dramatically to 50%of the total data.

• Using biclusters to impute in local portions of the datashows improvement on the quality of the imputation inthe synthetic data as well as in the performance of theclassifications in the real-world data.

• Regarding ALS data the biclustering algorithm requiresdiscretization before being applied and it was found that

TABLE IP VALUE RESULTS OF APPLYING THE WILCOXON SIGNED-RANK TEST TO

THE F SCORE RESULTS OF THE EXPERIMENTS DEFINED AT THE LEFTCOLUMN, TOGETHER WITH THE F SCORE MEAN VALUES FOR EACH

EXPERIMENT.

Naive Bayes

Mean1 Mean2 P value’MEDvsBIC3bypatternMED’ 0,898 0,8947 0,326751396

’MEDvsBIC5bypattern MED’ 0,898 0,8935 0,239517679’MEDvsBIC3SIGTP MED’ 0,898 0,9007 0,567822337

’ORIvsBIC3SIG’ 0,872 0,8777 5,18E-01’ORIvsBIC5SIG’ 0,872 0,8792 3,52E-01

’ORIvsBIC3bypattern’ 0,872 0,8754 0,797367247’ORIvsBIC5bypattern’ 0,872 0,8722 0,791341672

’ORIvsMED’ 0,872 0,898 2,36E-03’ORIvsEM’ 0,872 0,8802 0,400191719

’ORIvsEMbyfeature’ 0,872 0,8754 0,667968919’EMvsBIC3em EM’ 0,8802 0,8654 0,000728388’EMvsBIC5em EM’ 0,8802 0,8671 0,00199244

’EMbyfeature vs BIC3em EMbyfeature’ 0,8754 0,8701 0,820018494’EMbyfeature vs BIC5em EMbyfeature’ 0,8754 0,8763 0,979542991

using from 3 to 5 symbols in the discretization does notchange results significantly.

For the future, it is of interest to analyze how the proposedbiclustering-based imputation approaches as to their capacityto treat different mechanisms of missing values, i.e., comparethe performance in datasets with MCAR, MAR and NMARdata. Also of interest for the future is the application of thesame methods to other real-world datasets such as for theAlzheimer’s disease, to confirm the conclusions drawn in thiswork.

The prediction of the need for NIV in ALS patients basedon the present dataset may be considered a challenging case,where the amount of missing values is significant and the dis-tribution of these between the two classes is quite unbalanced.

11

TABLE IIP VALUE RESULTS OF APPLYING THE WILCOXON SIGNED-RANK TEST TO


EXPERIMENT.

KNN



’ORIvsBIC3SIG’ 0,6696 0,7272 9,60E-06’ORIvsBIC5SIG’ 0,6696 0,7262 1,94E-05


’ORIvsMED’ 0,6696 0,8613 7,56E-10’ORIvsEM’ 0,6696 0,7059 0,0008672



TABLE IIIP VALUE RESULTS OF APPLYING THE WILCOXON SIGNED-RANK TEST TO


EXPERIMENT.

DT



’ORIvsBIC3SIG’ 0,8072 0,8157 0,5657’ORIvsBIC5SIG’ 0,8072 0,8165 0,4418


’ORIvsMED’ 0,8072 0,8218 0,1809’ORIvsEM’ 0,8072 0,786 0,064



TABLE IVP VALUE RESULTS OF APPLYING THE WILCOXON SIGNED-RANK TEST TO


EXPERIMENT.

linear SVM



’ORIvsBIC3SIG’ 0,841 0,8442 0,8135’ORIvsBIC5SIG’ 0,841 0,8461 0,7981


’ORIvsMED’ 0,841 0,8727 0,0093’ORIvsEM’ 0,841 0,8546 0,2249



The general good classification results obtained here make usbelieve that this work is in the right track and contributedpositively to the solution of the problem at hand.

REFERENCES

[1] A. V. Carreiro, S. Pinto, A. M. Carvalho, M. de Carvalho, and S. C.Madeira, “Predicting non-invasive ventilation in als patients using timewindows,” ACM, 2011.

[2] R. J. A. Little and D. B. Rubin, “Missing data,” in Statistical Analysiswith Missing Data, 1987.

[3] P. D. Allison, Missing data. Sage publications, 2001, vol. 136.[4] A. R. T. Donders, G. J. van der Heijden, T. Stijnen, and K. G. Moons,

“Review: a gentle introduction to imputation of missing values,” Journalof clinical epidemiology, vol. 59, no. 10, pp. 1087–1091, 2006.

[5] R. J. A. Little and D. B. Rubin, “Filling the missing values,” in StatisticalAnalysis with Missing Data, 1987.

[6] J. M. Engels and P. Diehr, “Imputation of missing longitudinal data:a comparison of methods,” Journal of clinical epidemiology, vol. 56,no. 10, pp. 968–976, 2003.

[7] J. Twisk and W. de Vente, “Attrition in longitudinal studies: how to dealwith missing data,” Journal of clinical epidemiology, vol. 55, no. 4, pp.329–337, 2002.

[8] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for biologicaldata analysis: a survey,” Computational Biology and Bioinformatics,IEEE/ACM Transactions on, vol. 1, no. 1, pp. 24–45, 2004.

[9] ——, “A linear time biclustering algorithm for time series gene expres-sion data,” in Algorithms in Bioinformatics. Springer, 2005, pp. 39–52.

[10] T. Schneider, “Analysis of incomplete climate data: Estimation of meanvalues and covariance matrices and imputation of missing values,”Journal of Climate, vol. 14, no. 5, pp. 853–871, 2001.

biclustering-based imputation in longitudinal data · biclustering-based imputation in longitudinal...

Documents