research article on multilabel classification methods of

12
Research Article On Multilabel Classification Methods of Incompletely Labeled Biomedical Text Data Anton Kolesov, 1,2 Dmitry Kamyshenkov, 2 Maria Litovchenko, 1,2,3 Elena Smekalova, 4 Alexey Golovizin, 2 and Alex Zhavoronkov 1,2,3 1 Center for Pediatric Hematology, Oncology, and Immunology, Moscow 117997, Russia 2 Moscow Institute of Physics and Technology, Moscow 117303, Russia 3 e Biogerontology Research Foundation, Reading W1J 5NE, UK 4 Chemistry Department, Moscow State University, Moscow 119991, Russia Correspondence should be addressed to Alex Zhavoronkov; [email protected] Received 9 September 2013; Revised 8 December 2013; Accepted 12 December 2013; Published 23 January 2014 Academic Editor: Dejing Dou Copyright © 2014 Anton Kolesov et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Multilabel classification is oſten hindered by incompletely labeled training datasets; for some items of such dataset (or even for all of them) some labels may be omitted. In this case, we cannot know if any item is labeled fully and correctly. When we train a classifier directly on incompletely labeled dataset, it performs ineffectively. To overcome the problem, we added an extra step, training set modification, before training a classifier. In this paper, we try two algorithms for training set modification: weighted k-nearest neighbor (WkNN) and soſt supervised learning (SoſtSL). Both of these approaches are based on similarity measurements between data vectors. We performed the experiments on AgingPortfolio (text dataset) and then rechecked on the Yeast (nontext genetic data). We tried SVM and RF classifiers for the original datasets and then for the modified ones. For each dataset, our experiments demonstrated that both classification algorithms performed considerably better when preceded by the training set modification step. 1. Background and Significance Multilabel classification with supervised machine learning is a widespread problem in data analysis. However, very oſten, we have to perform multilabel classification when we are not guaranteed that our training set itself is perfectly preclassified. is is especially actual in the the case of national biomedical grants with ambiguous classification schemes. A particular grant may belong to several classes or may be miscategorized in the case of a keyword-based classification scheme. An interesting illustration is the project titled “Levels of Literacy of Men with Prostate Cancer.” is project may be classified by an algorithm as “prostate cancer,” “cancer biomarkers,” or “cancer education” whereas a researcher would consider it appropriately in relation to literacy. is kind of context makes the generation of training sets more complicated and costly. Many experts need to collaborate extensively in the selection of the full set of document cate- gories from the large number available for classification. Since such collaboration seldom happens, we end up assigning an incomplete set of categories to the training set document. When a document that is relevant to a particular class does not bear its label, it turns into a negative instance of class during the learning process. As a consequence, the deci- sion rules are distorted and the classification performance degrades. With an increase in the amount of textual information in the biomedical sphere, such problems become recurrent and need our attention. For example, about half a million new records are added each year on PubMed and thousands of research initiatives funded by grants are conducted annually around the world. Grant application abstracts are usually made public and the IARP project adds over 250 thousand new projects each year. Hindawi Publishing Corporation Computational and Mathematical Methods in Medicine Volume 2014, Article ID 781807, 11 pages http://dx.doi.org/10.1155/2014/781807

Upload: others

Post on 20-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Research ArticleOn Multilabel Classification Methods of Incompletely LabeledBiomedical Text Data

Anton Kolesov12 Dmitry Kamyshenkov2 Maria Litovchenko123 Elena Smekalova4

Alexey Golovizin2 and Alex Zhavoronkov123

1 Center for Pediatric Hematology Oncology and Immunology Moscow 117997 Russia2Moscow Institute of Physics and Technology Moscow 117303 Russia3The Biogerontology Research Foundation Reading W1J 5NE UK4Chemistry Department Moscow State University Moscow 119991 Russia

Correspondence should be addressed to Alex Zhavoronkov zhavoronkovbiogerontologyorg

Received 9 September 2013 Revised 8 December 2013 Accepted 12 December 2013 Published 23 January 2014

Academic Editor Dejing Dou

Copyright copy 2014 Anton Kolesov et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Multilabel classification is often hindered by incompletely labeled training datasets for some items of such dataset (or even for all ofthem) some labels may be omitted In this case we cannot know if any item is labeled fully and correctly When we train a classifierdirectly on incompletely labeled dataset it performs ineffectively To overcome the problem we added an extra step training setmodification before training a classifier In this paper we try two algorithms for training set modification weighted k-nearestneighbor (WkNN) and soft supervised learning (SoftSL) Both of these approaches are based on similarity measurements betweendata vectors We performed the experiments on AgingPortfolio (text dataset) and then rechecked on the Yeast (nontext geneticdata) We tried SVM and RF classifiers for the original datasets and then for the modified ones For each dataset our experimentsdemonstrated that both classification algorithms performed considerably better when preceded by the training set modificationstep

1 Background and Significance

Multilabel classification with supervised machine learning isa widespread problem in data analysis However very oftenwe have to perform multilabel classification when we are notguaranteed that our training set itself is perfectly preclassifiedThis is especially actual in the the case of national biomedicalgrants with ambiguous classification schemes A particulargrant may belong to several classes or may be miscategorizedin the case of a keyword-based classification scheme

An interesting illustration is the project titled ldquoLevelsof Literacy of Men with Prostate Cancerrdquo This project maybe classified by an algorithm as ldquoprostate cancerrdquo ldquocancerbiomarkersrdquo or ldquocancer educationrdquo whereas a researcherwould consider it appropriately in relation to literacy Thiskind of context makes the generation of training sets morecomplicated and costly Many experts need to collaborate

extensively in the selection of the full set of document cate-gories from the large number available for classification Sincesuch collaboration seldom happens we end up assigning anincomplete set of categories to the training set document

When a document that is relevant to a particular class 119860does not bear its label it turns into a negative instance of class119860 during the learning process As a consequence the deci-sion rules are distorted and the classification performancedegrades

With an increase in the amount of textual information inthe biomedical sphere such problems become recurrent andneed our attention For example about half a million newrecords are added each year on PubMed and thousands ofresearch initiatives funded by grants are conducted annuallyaround the world Grant application abstracts are usuallymade public and the IARP project adds over 250 thousandnew projects each year

Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2014 Article ID 781807 11 pageshttpdxdoiorg1011552014781807

2 Computational and Mathematical Methods in Medicine

b c

a

L998400

o

L998400

x

L998400

Δ

(a) Before label restoration

b c

a

L998400

o

L998400

x

Lx

(b) After label restoration

Figure 1 Decision rules before and after missing label restoration

In addition to classifying publication abstracts and grantdatabases methods described in this paper may be applied toother classification tasks in biomedical sciences as well

2 Objective

In this article we address the problem of classification whenfor each training object some proper labels may be omittedIn order to understand the properties of incompletely labeledtraining set and its impact on learning outcomes let usconsider an artificial example

Figure 1(a) shows the initial training set for themultilabelclassification task for 3 classes of points on the plane In ourwork we used Support Vector Machine (SVM) classificationwhich is a popular classical classificationmethodwith a broadrange of applications ranging from text to tumor selection [1]gene expression classification [2] and mitotic cell modeling[3] problems

With the Binary Relevance [4] approach based on a linearSVM we can obtain the decision rules for the classes ofcrosses circles and triangles Please note that this is anexample of an error-free classification Let us assume thatobject a really belongs to ldquocrossesrdquo and ldquocirclesrdquo and object abelongs to ldquocrossesrdquo and ldquotrianglesrdquo But in real life the trainingset is often incompletely labeled Figure 1(a) shows us sucha situation when object a is labeled only as ldquocirclerdquo (ldquocrossrdquois missed) and object b is labeled only as ldquotrianglerdquo (missedldquocirclerdquo)

In Figure 1(b) the missing tags and bold lines are addedand the newdecision rules for the classes of crosses and circlesafter recovering lost tags are depicted This example showsthat in the case of incompletely labeled dataset a decisionrule may be quite distorted and have a negative effect on theclassification performance

In order to reduce the negative impact of incompletelylabeled datasets we proposed a special approach based ontraining set modification that reduces contradictions Afterapplying the algorithm to a real collection of data the resultsof the Support Vector Machine (SVM) and Random Forest(RF) classification schemes improved Here RF is known to

be one of the most effective machine learning techniques[5ndash7]

To address the incompleteness of training sets in thispaper we shall describe a new strategy for constructingclassification algorithms On the one hand the performanceof this strategy is evaluated using data collections from theAgingPortfolio resource available on the Web [8] On theother hand its effectiveness is confirmed by applying it to theYeast dataset described below

Several methods like data cleaning [9] outlier detection[10] reference object selection [11] and hybrid classifica-tion algorithms [12] for improving performance have beenproposed for training set modification To date the abilityof these approaches to provide real text classification hasnot been sufficiently studied Furthermore none of thesemethods of training set modification is suitable for solvingclassification problems with an incompletely labeled trainingset

3 Methods

The already-proposed algorithms are based on the followingthree assumptions about totally new input classifier data

(1) A large number of training set objects are assumedto have an incomplete set of labels By definitiona complete set of labels is a set which leads toa perfect consensus among experts regarding theimpossibility of further adding or removing a labelfrom a document in the data collection

(2) Experts are not expected tomake an error in assigningcategory labels to documents That is to say thetraining set generation may involve errors of type1 only (checking hypotheses of the type ldquoobject 119889belongs to category label clrdquo)

(3) The compactness hypothesis is assumed to hold Thismeans similar objects are likely to belong to the samecategories as compact subsets located in the objectspace The solution of a classification problem underthese assumptions requires that an algorithm treatdocument relevancy on the basis of data geometry

Computational and Mathematical Methods in Medicine 3

We developed alternative approaches because the algo-rithms for training set modification were not designed towork with these assumptions Then we used the followingtwo detailed algorithms in our experiments

(1) themethod based on a recent soft supervised learningapproach [13] (was labeled as ldquoSoftSLrdquo)

(2) weighted k-nearest neighbour classifier algorithm(labeled as ldquoWkNNrdquo [14ndash16])

These algorithms use the nearest neighbour set of a doc-ument which is in line with our third assumption

The first step in the modification of the training setinvolves the generation of a set PC of document-categoryrelevancy pairs overlooked by the experts

PC = (119889 cl) | 120595 (119889 cl) = 1 (1)

where 119889 is a document cl is a class label (category) and 120595

is the function of our training set modification algorithm(WkNN or SoftSL)

Consider

120595 (119889 cl)

= 1 if our algorithm places document 119889 into class cl0 otherwise

(2)

Then two possible outcomes are considered

(1) complete inclusion of PC into the training set (optionwas denoted as ldquoaddrdquo)

(2) exclusion of document 119889 from the negative examplesof the category cl for all relevancy pairs (119889 cl) isin PC(option was denoted as ldquodelrdquo)

The modified training set will still contain the objectswhich according to the algorithm 120595 do not belong to the setlabeled by the expert It is possible to find the next missinglabels in the documents of the training set

31 SoftSL Algorithm for Finding Missing Labels In thissection we outline the application of a new graph algorithmfor Soft-supervised learning also called SoftSL [13] Eachdocument is represented by a vertex within a weighted undi-rected graph and our proposed framework minimizes theweighted Kullback-Leibler divergence between distributionsthat encode the classmembership probabilities of each vertex

The advantages of this graph algorithm include directapplicability to the multilabel categorization problem as wellas improved performance compare to alternatives [14] Themain idea of the SoftSL algorithm is the following

Let119863 = 119863119897 119863119906 be a set of labeled and unlabeled objectswith

119863119897 = (119909119894 119910119894)119897

119894=1 119863119906 = (119909119894)

119899

119894=119897+1 (3)

where 119909119894 is the input vector representing the objects to becategorized and 119910119894 is the category label

Let119866 = (119881 119864) be a weighted undirected graph Here119881 =

1 119899 where 119899 is the cardinality of 119863 and 119864 = 119881 times 119881 and120596119894119895 is the weight of the edge linking objects 119894 and 119895

The weight of the edge is defined as

119908119894119895 = sim (119909119894 119909119895) if 119895 isin 119870 (119894)

0 if 119895 notin 119870 (119894) (4)

Here sim(119909119894 119909119895) is the measure of similarity between 119894thand 119895th objects (eg cosine measure) and119870(119894) is the set of knearest neighbours of object 119909119894

Each object is associated with a set of probabilities119901119894 = (119901

119905

119894)119898

119905=1of belonging to each of the 119898 classes 119871 =

cl119894119898

119894=1 According to information from 119863119897 we determined

the probabilities 119903119894 = (119903119905

119894)119898

119905=1119897

119894=1that documents 119909119894

119897

119894=1

belong for each of m classes thus 119903119905119894gt 0 if (119889119894 cl119905) isin 119863119897 Each

labeled object also has a known set of probabilities 119903119894 assignedby the expertsOur intention is tominimize themisalignmentfunction 1198621(119901) over sets of probabilities

1198621 (119901) =

119897

sum

119894=1

119863KL (119903119894 119901119894)

+ 120583

119899

sum

119894=1

sum

119895isin119870(119894)

120596119894119895119863KL (119901119894 119901119895) minus ]119899

sum

119894=1

119867(119901119894)

119863KL (119901119894 119901119895)def=

119898

sum

119905=1

119901119905

119894log119901119905119895

119867 (119901119894)def=

119898

sum

119905=1

119901119905

119894log119901119905119894

(5)

where 119863KL(119901119894 119901119895) means Kullback-Leibler distance and119867(119901119894)means entropy

120583 and ] are the parameters of the algorithm definingcontribution of each term into 1198621(119901) The meanings of allterms are listed below

(1) The first term in the expression of1198621 shows how closethe generated probabilities are to the ones assigned bythe experts

(2) The second term accounts for the graph geometry andguarantees that the objects close to one another on thegraph will have similar probability distributions overclasses

(3) The third term is included in case other terms in theexpression are not contradictory Its purpose is to pro-duce a regular and uniform probability distributionover classes

Numerically the problem is solved using AlternatingMinimization (AM) [13] Note that119863119906 is absent in the case ofunlabeled dataTheminimization of the objectivemin1199011198621(119901)leads to the set of probabilities 119901119894 = 119901

1

119894 119901

119898

119894 for each

document 119889119894 isin 119863 We introduce a threshold 119879 isin [0 1]

to assign additional categories relevant to each document if119901119895

119894⩾ 119879 then 119889119894 isin cl119895

4 Computational and Mathematical Methods in Medicine

32 Weighted kNN Algorithm for Finding Missing Labels Inthis section we shall briefly describe the weighted k-nearestneighbour algorithm [17] that is capable of directly solvingthe multilabel categorization problem

Let 120588(119889 1198891015840) be a distance function between the docu-ments 119889 and 1198891015840 The function which assigns document 119889 toclass label cl isin 119871 is then defined as

119878 (119889 cl) =sum1198891015840isinkNN(119889) 120588 (119889 119889

1015840) 119868 (1198891015840 cl)

sum1198891015840isinkNN(119889) 120588 (119889 119889

1015840)

119868 (1198891015840 cl) =

1 if 1198891015840 isin cl0 otherwise

(6)

Here kNN(119889) is the set of k nearest neighbours ofdocument 119889 in the training set

We introduce a threshold 119879 such that if 119878(119889 cl) ⩾ 119879 then119889 isin cl Then the algorithm counts 119878(119889 cl) for all possible(119889 cl) combinations When 119878(119889 cl) ⩾ 119879 every combinationis considered to be a missing label and used to modify thetraining set

33 Support Vector Machine We will use the Linear SupportVector Machine as a classification algorithm in this caseSince SVM is mainly a binary classifier the Binary Relevanceapproach is therefore chosen to address multilabel problemsThis method implies training a separate decision rule 119908119897119909 +119887119897 gt 0 for every category 119897 isin 119871 More details are availablein our previous work on methods for structuring scientificknowledge [18] In our study as the implementation of SVMwe used Weka binding of the LIBLINEAR library [19]

34 Random Forest Random Forest is an ensemble ofmachine learning methods which combines tree predictorsIn this combination each tree depends on the values ofa random vector sampled independently All trees in theforest have the same distribution More details about thismethod can be found in [20 21] In our study we used theimplementation of Random Forest fromWeka [22]

4 Experimental Results

In this section we describe how did we perform text classifi-cation experiments We applied the classification algorithmsto initial (unmodified) training sets as well as to the trainingsets modified with ldquoaddrdquo or ldquodelrdquo methods

We shall first discuss the scheme of training set transfor-mation and its usefulness Then we shall present the processof data generation Finally we shall consider the performancemeasures used in the experiments experimental setting andthe results of the parameter estimation and final validation

41 Training Set Modification Training set modification stepis described in detail in Section 3 (methods) However it isimportant to notice that in both cases documents that donot belong to set PC according to the relevance algorithm 120595

(too far from documents labeled to the given category) arestill retained in the training set The reason for this choice is

that we assume that experts are not supposed to make anymistake of type II (when they give a document an odd label)only the absence of proper label is supposed to encounter

The omission of the relevance pair (119889 119860) in the trainingset makes document 119889move into the set of negative examplesfor learning a classifier for the class 119860 This problem altersthe decision rule and negatively affects performance Theproposed set modification scheme is designed to avoid suchproblems during the training session of the classifier

42 Datasets and Data Preprocessing

421 AgingPortfolio Dataset The first experiment was car-ried out using data from the AgingPortfolio informationresource [18 23] The AgingPortfolio system includes adatabase of projects related to aging and is funded bythe National Institutes of Health (NIH) and the EuropeanCommission (ECCORDIS)This database currently containsmore than one million projects Each of its records writtenin English displays information related to the authorrsquos nametitle a brief description of the motivation and researchobjectives the name of the organization and the fundingperiod of the project Some projects contain additionalkeywords with an average description in 100 words In thisexperiment we used only the title a brief description andtag fields

A taxonomy contains 335 categories with 6 hierarchicallevels used for document classification A detailed informa-tion about the taxonomy is available on the Internationalaging research portfolio Web site [23] Biomedical expertsmanually put the category labels on the document trainingand test sets In the process they used a special procedurefor labeling the document test set Two sets of categoriescarefully selected by different experts were assigned to eachdocument of the test set Then a combination of these cate-gories was used to achieve amore complete category labelingDifferent participants like the AgingPortfolio resource userscreated the training set with little control A visual inspectionsuggests that the training set contained a significant numberof projects with incomplete sets of category labels The sameconclusion is also achieved by comparing the average numberof categories per project This average is 44 in the sample setcompared to 979 in the more thoroughly designed test setThe total number of projects was 3 246 for the training set183 for the development set and 1 000 for the test sets

Throughout our study we used the vector model of textrepresentation The list of keywords and their combinationsfrom the TerMine [24] system (National Centre for TextMining or NaCTeM) provided the terms used in our studyThe method used in this system combines linguistic andstatistical information of candidate terms

Later we conducted the analysis and processing of theset of keyword combinations Whenever the short keywordcombinations were present in longer ones the latter weresplit into shorter ones The algorithms and the code usedfor the keyword combinations decomposition are availablefrom theAgingPortfolioWeb site [8] According to the resultsof our previous experiments the new vectorization method

Computational and Mathematical Methods in Medicine 5

provided a 3 increase by the 1198651-measure compared to thegeneral ldquobag-of-wordsrdquo model We assigned feature weightsaccording to the TFIDF rule in the BM25 formulation [25 26]and then normalized vectors representing the documents inthe Euclidean metric of 119899-dimensional space

422 Yeast Dataset Yeast dataset [27] is a biomedical datasetof Yeast genes divided into 14 different functional classesEach instance in the dataset is a gene represented by a vectorwhose features are the microarray expression levels undervarious conditions We used it to reveal if our methods aresuitable for the classification of genetic information as well asfor textual data

Let us describe the methods for an incomplete datasetmodeling Since the dataset is well annotated andwidely usedthe objects (genes) have complete sets of category labels Byrandom deletion of labels from documents wemade amodelof the incomplete sets of labels in the training set Parameter119901 was introduced as the fraction of deleted labels

We deleted labels using the following conditions

(1) for each class 119901 represents the fraction of deletedlabelsWe keep the distribution of labels by categoriesafter modeling the incomplete sets of labels

(2) the number of objects in the training set remains thesame At least one label is preserved after the labeldeletion process

No preprocessing step was necessary because the data issupplied already prepared as a matrix of numbers [27]

43 Performance Measurements The following characteris-tics were used to evaluate and compare different classificationalgorithms

(i) Microaveraged precision recall and1198651-measure [27](ii) CROC curves and their AUC values computed for

selected categories [28] CROC curve is a modifi-cation of a ROC curve where 119909 axis is rescaledas 119909new(119909) We used a standard exponent scaling119909new(119909) = (1 minus 119890

minus120572119909)(1 minus 119890

minus120572) with 120572 = 7

44 Experimental Setup Theprocedures for selecting impor-tant parameters of the algorithms outlined are described next

441 Parameters for SVM

AgingPortfolio Dataset The following SVM parameters weretuned for each decision rule

(i) cost parameter 119862 controls a trade-off between maxi-mization of the separation margin and minimizationof the total error [15]

(ii) parameter 119887119897 that plays the role of a classification thre-shold in the decision rule

We performed a parameter tuning by using a slidingcontrol method with 5-fold cross-validation according to

the following strategyThe 119862-parameter was varied on a gridfollowed by 119887119897-parameter (for every value of 119862) tuning forevery category A set PC = (119862 119887119897)119897isin119871 of parameter pairswas considered optimal if it maximized the 1198651-measure withaveraging over documentsWhile119862 has the same value for allcategories the 119887119897 threshold parameter was tuned (for a givenvalue of 119862) for each class label 119897 isin 119871

Yeast Dataset In experiments with the Yeast dataset theselection of SVM parameters was not performed (ie 119862 = 1119887119897 = 0 for all values of 119897 isin 119871)

442 Parameters for RF The 30 solution trees were used tobuild the Random Forest The number of inputs to considerwhile splitting a tree node is the square root of featuresrsquonumber The procedure was done according to the [29] LeoBreiman who developed the Random Forest Algorithm

443 AgingPortfolio Dataset Parameters 119896 and119879were tunedon a grid as follows We prepared a validation set 119863devof 183 documents as in the case of the test set The per-formance metrics for SVM classifiers trained on modifieddocument sets were then evaluated on 119863dev A combinationof parameters was considered optimal if it maximized the 1198651-measure Parameter 120583 of the SoftSL algorithm was tuned ona grid keeping 119896 fixed at its optimal value A fixed categoryassignment threshold of 119879 = 0005 is used for the SoftSLtraining set modification algorithm We used ] = 0 sinceall documents in the experiments contained some categorylabels and regularization was unnecessary

444 Yeast Dataset Themethod for selecting the parametersfor the Yeast dataset is the same as in the AgingPortfolio119863devwas composed of 300 (20 of 1 500) randomly selected genesfor training The SoftSL training set modification algorithmwas not used for this dataset

45 A Comparison of Methods

451 AgingPortfolio We evaluated the general performancebased on a total of 62 categories that contained at least 30documents in the training set and at least 7 documents inthe test set The results for precision recall and 1198651-measureare presented in Table 1 It is evident that a parameter tuningsignificantly boosts both precision and recall

Also all of our training set modification methods paylower precision for higher recall values If we consider 1198651measure as a general quality function such trade-offmay lookquiet reasonable especially for add+WkNNmethod

The average numbers of training set categories per doc-ument and documents per category are listed in Table 2 Aswe can see the SoftSL approach alters the training set moresignificantly As a result a larger number of relevancy tagsare added This is consistent with higher recall and lowerprecision values of add+SoftSL and del+SoftSL as comparedto WkNN-based methods in Table 1

Figure 2 compares CROC curves for representative cate-gories of AgingPortfolio dataset computed for SVM without

6 Computational and Mathematical Methods in Medicine

Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods

Method Precision Recall 1198651

SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576

Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification

Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887

training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification

AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL

CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL

Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods

Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well

452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917

The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly

Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677

Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)

Method Precision Recall 1198651

SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866

Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817

Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044

Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications

Method Precision Recall 1198651

RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194

Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06

is used

Computational and Mathematical Methods in Medicine 7

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03919

SVM add+WkNN AUC = 04323

SVM delRandom selection

+WkNN AUC = 04181

(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 06830

SVM add+WkNN AUC = 07484

SVM delRandom selection

+WkNN AUC = 07559

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

0806040200

Transformed false positive rate

SVM only AUC = 04843

SVM add+WkNN AUC = 05340

SVM delRandom selection

+WkNN AUC = 05238

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 04892

SVM add+WkNN AUC = 05603

SVM delRandom selection

+WkNN AUC = 05460

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)

Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)

The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)

A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly

improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)

5 Discussion

Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our

8 Computational and Mathematical Methods in Medicine

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03904

SVM add+SoftSL AUC = 04747

SVM delRandom selection

+SoftSL AUC = 04498

(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 06818

SVM add+SoftSL AUC = 06374

SVM delRandom selection

+SoftSL AUC = 07018

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04847

SVM add+SoftSL AUC = 05447

SVM delRandom selection

+SoftSL AUC = 05423

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04900

SVM add+SoftSL AUC = 05757

SVM delRandom selection

+SoftSL AUC = 05587

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)

Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)

training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all

To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL

Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels

We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than

Computational and Mathematical Methods in Medicine 9

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC =

=

04752

SVM add+WkNN AUC 04311

(a) ldquoExperimental techniques in vivo methodsrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 06380

= 07375SVM add+WkNN AUC

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category

=

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05454

SVM add+WkNN AUC 05348

(c) ldquoAging mechanisms by anatomy cell levelrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05757

= 05541SVM add+WkNN AUC

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory

Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio

SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement

Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant

documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold

One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result

10 Computational and Mathematical Methods in Medicine

Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0

Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716

Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224

Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707

should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones

Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets

Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set

Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets

Conflict of Interests

The authors declare that there is no conflict of interests

Acknowledgments

The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch

References

[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012

[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013

[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013

[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010

[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010

[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002

Computational and Mathematical Methods in Medicine 11

[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage

[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009

[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007

[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008

[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010

[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008

[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009

[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998

[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967

[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009

[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011

[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008

[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009

[23] International aging research portfolio httpagingportfolioorg

[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000

[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994

[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005

[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004

[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010

[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm

[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

2 Computational and Mathematical Methods in Medicine

b c

a

L998400

o

L998400

x

L998400

Δ

(a) Before label restoration

b c

a

L998400

o

L998400

x

Lx

(b) After label restoration

Figure 1 Decision rules before and after missing label restoration

In addition to classifying publication abstracts and grantdatabases methods described in this paper may be applied toother classification tasks in biomedical sciences as well

2 Objective

In this article we address the problem of classification whenfor each training object some proper labels may be omittedIn order to understand the properties of incompletely labeledtraining set and its impact on learning outcomes let usconsider an artificial example

Figure 1(a) shows the initial training set for themultilabelclassification task for 3 classes of points on the plane In ourwork we used Support Vector Machine (SVM) classificationwhich is a popular classical classificationmethodwith a broadrange of applications ranging from text to tumor selection [1]gene expression classification [2] and mitotic cell modeling[3] problems

With the Binary Relevance [4] approach based on a linearSVM we can obtain the decision rules for the classes ofcrosses circles and triangles Please note that this is anexample of an error-free classification Let us assume thatobject a really belongs to ldquocrossesrdquo and ldquocirclesrdquo and object abelongs to ldquocrossesrdquo and ldquotrianglesrdquo But in real life the trainingset is often incompletely labeled Figure 1(a) shows us sucha situation when object a is labeled only as ldquocirclerdquo (ldquocrossrdquois missed) and object b is labeled only as ldquotrianglerdquo (missedldquocirclerdquo)

In Figure 1(b) the missing tags and bold lines are addedand the newdecision rules for the classes of crosses and circlesafter recovering lost tags are depicted This example showsthat in the case of incompletely labeled dataset a decisionrule may be quite distorted and have a negative effect on theclassification performance

In order to reduce the negative impact of incompletelylabeled datasets we proposed a special approach based ontraining set modification that reduces contradictions Afterapplying the algorithm to a real collection of data the resultsof the Support Vector Machine (SVM) and Random Forest(RF) classification schemes improved Here RF is known to

be one of the most effective machine learning techniques[5ndash7]

To address the incompleteness of training sets in thispaper we shall describe a new strategy for constructingclassification algorithms On the one hand the performanceof this strategy is evaluated using data collections from theAgingPortfolio resource available on the Web [8] On theother hand its effectiveness is confirmed by applying it to theYeast dataset described below

Several methods like data cleaning [9] outlier detection[10] reference object selection [11] and hybrid classifica-tion algorithms [12] for improving performance have beenproposed for training set modification To date the abilityof these approaches to provide real text classification hasnot been sufficiently studied Furthermore none of thesemethods of training set modification is suitable for solvingclassification problems with an incompletely labeled trainingset

3 Methods

The already-proposed algorithms are based on the followingthree assumptions about totally new input classifier data

(1) A large number of training set objects are assumedto have an incomplete set of labels By definitiona complete set of labels is a set which leads toa perfect consensus among experts regarding theimpossibility of further adding or removing a labelfrom a document in the data collection

(2) Experts are not expected tomake an error in assigningcategory labels to documents That is to say thetraining set generation may involve errors of type1 only (checking hypotheses of the type ldquoobject 119889belongs to category label clrdquo)

(3) The compactness hypothesis is assumed to hold Thismeans similar objects are likely to belong to the samecategories as compact subsets located in the objectspace The solution of a classification problem underthese assumptions requires that an algorithm treatdocument relevancy on the basis of data geometry

Computational and Mathematical Methods in Medicine 3

We developed alternative approaches because the algo-rithms for training set modification were not designed towork with these assumptions Then we used the followingtwo detailed algorithms in our experiments

(1) themethod based on a recent soft supervised learningapproach [13] (was labeled as ldquoSoftSLrdquo)

(2) weighted k-nearest neighbour classifier algorithm(labeled as ldquoWkNNrdquo [14ndash16])

These algorithms use the nearest neighbour set of a doc-ument which is in line with our third assumption

The first step in the modification of the training setinvolves the generation of a set PC of document-categoryrelevancy pairs overlooked by the experts

PC = (119889 cl) | 120595 (119889 cl) = 1 (1)

where 119889 is a document cl is a class label (category) and 120595

is the function of our training set modification algorithm(WkNN or SoftSL)

Consider

120595 (119889 cl)

= 1 if our algorithm places document 119889 into class cl0 otherwise

(2)

Then two possible outcomes are considered

(1) complete inclusion of PC into the training set (optionwas denoted as ldquoaddrdquo)

(2) exclusion of document 119889 from the negative examplesof the category cl for all relevancy pairs (119889 cl) isin PC(option was denoted as ldquodelrdquo)

The modified training set will still contain the objectswhich according to the algorithm 120595 do not belong to the setlabeled by the expert It is possible to find the next missinglabels in the documents of the training set

31 SoftSL Algorithm for Finding Missing Labels In thissection we outline the application of a new graph algorithmfor Soft-supervised learning also called SoftSL [13] Eachdocument is represented by a vertex within a weighted undi-rected graph and our proposed framework minimizes theweighted Kullback-Leibler divergence between distributionsthat encode the classmembership probabilities of each vertex

The advantages of this graph algorithm include directapplicability to the multilabel categorization problem as wellas improved performance compare to alternatives [14] Themain idea of the SoftSL algorithm is the following

Let119863 = 119863119897 119863119906 be a set of labeled and unlabeled objectswith

119863119897 = (119909119894 119910119894)119897

119894=1 119863119906 = (119909119894)

119899

119894=119897+1 (3)

where 119909119894 is the input vector representing the objects to becategorized and 119910119894 is the category label

Let119866 = (119881 119864) be a weighted undirected graph Here119881 =

1 119899 where 119899 is the cardinality of 119863 and 119864 = 119881 times 119881 and120596119894119895 is the weight of the edge linking objects 119894 and 119895

The weight of the edge is defined as

119908119894119895 = sim (119909119894 119909119895) if 119895 isin 119870 (119894)

0 if 119895 notin 119870 (119894) (4)

Here sim(119909119894 119909119895) is the measure of similarity between 119894thand 119895th objects (eg cosine measure) and119870(119894) is the set of knearest neighbours of object 119909119894

Each object is associated with a set of probabilities119901119894 = (119901

119905

119894)119898

119905=1of belonging to each of the 119898 classes 119871 =

cl119894119898

119894=1 According to information from 119863119897 we determined

the probabilities 119903119894 = (119903119905

119894)119898

119905=1119897

119894=1that documents 119909119894

119897

119894=1

belong for each of m classes thus 119903119905119894gt 0 if (119889119894 cl119905) isin 119863119897 Each

labeled object also has a known set of probabilities 119903119894 assignedby the expertsOur intention is tominimize themisalignmentfunction 1198621(119901) over sets of probabilities

1198621 (119901) =

119897

sum

119894=1

119863KL (119903119894 119901119894)

+ 120583

119899

sum

119894=1

sum

119895isin119870(119894)

120596119894119895119863KL (119901119894 119901119895) minus ]119899

sum

119894=1

119867(119901119894)

119863KL (119901119894 119901119895)def=

119898

sum

119905=1

119901119905

119894log119901119905119895

119867 (119901119894)def=

119898

sum

119905=1

119901119905

119894log119901119905119894

(5)

where 119863KL(119901119894 119901119895) means Kullback-Leibler distance and119867(119901119894)means entropy

120583 and ] are the parameters of the algorithm definingcontribution of each term into 1198621(119901) The meanings of allterms are listed below

(1) The first term in the expression of1198621 shows how closethe generated probabilities are to the ones assigned bythe experts

(2) The second term accounts for the graph geometry andguarantees that the objects close to one another on thegraph will have similar probability distributions overclasses

(3) The third term is included in case other terms in theexpression are not contradictory Its purpose is to pro-duce a regular and uniform probability distributionover classes

Numerically the problem is solved using AlternatingMinimization (AM) [13] Note that119863119906 is absent in the case ofunlabeled dataTheminimization of the objectivemin1199011198621(119901)leads to the set of probabilities 119901119894 = 119901

1

119894 119901

119898

119894 for each

document 119889119894 isin 119863 We introduce a threshold 119879 isin [0 1]

to assign additional categories relevant to each document if119901119895

119894⩾ 119879 then 119889119894 isin cl119895

4 Computational and Mathematical Methods in Medicine

32 Weighted kNN Algorithm for Finding Missing Labels Inthis section we shall briefly describe the weighted k-nearestneighbour algorithm [17] that is capable of directly solvingthe multilabel categorization problem

Let 120588(119889 1198891015840) be a distance function between the docu-ments 119889 and 1198891015840 The function which assigns document 119889 toclass label cl isin 119871 is then defined as

119878 (119889 cl) =sum1198891015840isinkNN(119889) 120588 (119889 119889

1015840) 119868 (1198891015840 cl)

sum1198891015840isinkNN(119889) 120588 (119889 119889

1015840)

119868 (1198891015840 cl) =

1 if 1198891015840 isin cl0 otherwise

(6)

Here kNN(119889) is the set of k nearest neighbours ofdocument 119889 in the training set

We introduce a threshold 119879 such that if 119878(119889 cl) ⩾ 119879 then119889 isin cl Then the algorithm counts 119878(119889 cl) for all possible(119889 cl) combinations When 119878(119889 cl) ⩾ 119879 every combinationis considered to be a missing label and used to modify thetraining set

33 Support Vector Machine We will use the Linear SupportVector Machine as a classification algorithm in this caseSince SVM is mainly a binary classifier the Binary Relevanceapproach is therefore chosen to address multilabel problemsThis method implies training a separate decision rule 119908119897119909 +119887119897 gt 0 for every category 119897 isin 119871 More details are availablein our previous work on methods for structuring scientificknowledge [18] In our study as the implementation of SVMwe used Weka binding of the LIBLINEAR library [19]

34 Random Forest Random Forest is an ensemble ofmachine learning methods which combines tree predictorsIn this combination each tree depends on the values ofa random vector sampled independently All trees in theforest have the same distribution More details about thismethod can be found in [20 21] In our study we used theimplementation of Random Forest fromWeka [22]

4 Experimental Results

In this section we describe how did we perform text classifi-cation experiments We applied the classification algorithmsto initial (unmodified) training sets as well as to the trainingsets modified with ldquoaddrdquo or ldquodelrdquo methods

We shall first discuss the scheme of training set transfor-mation and its usefulness Then we shall present the processof data generation Finally we shall consider the performancemeasures used in the experiments experimental setting andthe results of the parameter estimation and final validation

41 Training Set Modification Training set modification stepis described in detail in Section 3 (methods) However it isimportant to notice that in both cases documents that donot belong to set PC according to the relevance algorithm 120595

(too far from documents labeled to the given category) arestill retained in the training set The reason for this choice is

that we assume that experts are not supposed to make anymistake of type II (when they give a document an odd label)only the absence of proper label is supposed to encounter

The omission of the relevance pair (119889 119860) in the trainingset makes document 119889move into the set of negative examplesfor learning a classifier for the class 119860 This problem altersthe decision rule and negatively affects performance Theproposed set modification scheme is designed to avoid suchproblems during the training session of the classifier

42 Datasets and Data Preprocessing

421 AgingPortfolio Dataset The first experiment was car-ried out using data from the AgingPortfolio informationresource [18 23] The AgingPortfolio system includes adatabase of projects related to aging and is funded bythe National Institutes of Health (NIH) and the EuropeanCommission (ECCORDIS)This database currently containsmore than one million projects Each of its records writtenin English displays information related to the authorrsquos nametitle a brief description of the motivation and researchobjectives the name of the organization and the fundingperiod of the project Some projects contain additionalkeywords with an average description in 100 words In thisexperiment we used only the title a brief description andtag fields

A taxonomy contains 335 categories with 6 hierarchicallevels used for document classification A detailed informa-tion about the taxonomy is available on the Internationalaging research portfolio Web site [23] Biomedical expertsmanually put the category labels on the document trainingand test sets In the process they used a special procedurefor labeling the document test set Two sets of categoriescarefully selected by different experts were assigned to eachdocument of the test set Then a combination of these cate-gories was used to achieve amore complete category labelingDifferent participants like the AgingPortfolio resource userscreated the training set with little control A visual inspectionsuggests that the training set contained a significant numberof projects with incomplete sets of category labels The sameconclusion is also achieved by comparing the average numberof categories per project This average is 44 in the sample setcompared to 979 in the more thoroughly designed test setThe total number of projects was 3 246 for the training set183 for the development set and 1 000 for the test sets

Throughout our study we used the vector model of textrepresentation The list of keywords and their combinationsfrom the TerMine [24] system (National Centre for TextMining or NaCTeM) provided the terms used in our studyThe method used in this system combines linguistic andstatistical information of candidate terms

Later we conducted the analysis and processing of theset of keyword combinations Whenever the short keywordcombinations were present in longer ones the latter weresplit into shorter ones The algorithms and the code usedfor the keyword combinations decomposition are availablefrom theAgingPortfolioWeb site [8] According to the resultsof our previous experiments the new vectorization method

Computational and Mathematical Methods in Medicine 5

provided a 3 increase by the 1198651-measure compared to thegeneral ldquobag-of-wordsrdquo model We assigned feature weightsaccording to the TFIDF rule in the BM25 formulation [25 26]and then normalized vectors representing the documents inthe Euclidean metric of 119899-dimensional space

422 Yeast Dataset Yeast dataset [27] is a biomedical datasetof Yeast genes divided into 14 different functional classesEach instance in the dataset is a gene represented by a vectorwhose features are the microarray expression levels undervarious conditions We used it to reveal if our methods aresuitable for the classification of genetic information as well asfor textual data

Let us describe the methods for an incomplete datasetmodeling Since the dataset is well annotated andwidely usedthe objects (genes) have complete sets of category labels Byrandom deletion of labels from documents wemade amodelof the incomplete sets of labels in the training set Parameter119901 was introduced as the fraction of deleted labels

We deleted labels using the following conditions

(1) for each class 119901 represents the fraction of deletedlabelsWe keep the distribution of labels by categoriesafter modeling the incomplete sets of labels

(2) the number of objects in the training set remains thesame At least one label is preserved after the labeldeletion process

No preprocessing step was necessary because the data issupplied already prepared as a matrix of numbers [27]

43 Performance Measurements The following characteris-tics were used to evaluate and compare different classificationalgorithms

(i) Microaveraged precision recall and1198651-measure [27](ii) CROC curves and their AUC values computed for

selected categories [28] CROC curve is a modifi-cation of a ROC curve where 119909 axis is rescaledas 119909new(119909) We used a standard exponent scaling119909new(119909) = (1 minus 119890

minus120572119909)(1 minus 119890

minus120572) with 120572 = 7

44 Experimental Setup Theprocedures for selecting impor-tant parameters of the algorithms outlined are described next

441 Parameters for SVM

AgingPortfolio Dataset The following SVM parameters weretuned for each decision rule

(i) cost parameter 119862 controls a trade-off between maxi-mization of the separation margin and minimizationof the total error [15]

(ii) parameter 119887119897 that plays the role of a classification thre-shold in the decision rule

We performed a parameter tuning by using a slidingcontrol method with 5-fold cross-validation according to

the following strategyThe 119862-parameter was varied on a gridfollowed by 119887119897-parameter (for every value of 119862) tuning forevery category A set PC = (119862 119887119897)119897isin119871 of parameter pairswas considered optimal if it maximized the 1198651-measure withaveraging over documentsWhile119862 has the same value for allcategories the 119887119897 threshold parameter was tuned (for a givenvalue of 119862) for each class label 119897 isin 119871

Yeast Dataset In experiments with the Yeast dataset theselection of SVM parameters was not performed (ie 119862 = 1119887119897 = 0 for all values of 119897 isin 119871)

442 Parameters for RF The 30 solution trees were used tobuild the Random Forest The number of inputs to considerwhile splitting a tree node is the square root of featuresrsquonumber The procedure was done according to the [29] LeoBreiman who developed the Random Forest Algorithm

443 AgingPortfolio Dataset Parameters 119896 and119879were tunedon a grid as follows We prepared a validation set 119863devof 183 documents as in the case of the test set The per-formance metrics for SVM classifiers trained on modifieddocument sets were then evaluated on 119863dev A combinationof parameters was considered optimal if it maximized the 1198651-measure Parameter 120583 of the SoftSL algorithm was tuned ona grid keeping 119896 fixed at its optimal value A fixed categoryassignment threshold of 119879 = 0005 is used for the SoftSLtraining set modification algorithm We used ] = 0 sinceall documents in the experiments contained some categorylabels and regularization was unnecessary

444 Yeast Dataset Themethod for selecting the parametersfor the Yeast dataset is the same as in the AgingPortfolio119863devwas composed of 300 (20 of 1 500) randomly selected genesfor training The SoftSL training set modification algorithmwas not used for this dataset

45 A Comparison of Methods

451 AgingPortfolio We evaluated the general performancebased on a total of 62 categories that contained at least 30documents in the training set and at least 7 documents inthe test set The results for precision recall and 1198651-measureare presented in Table 1 It is evident that a parameter tuningsignificantly boosts both precision and recall

Also all of our training set modification methods paylower precision for higher recall values If we consider 1198651measure as a general quality function such trade-offmay lookquiet reasonable especially for add+WkNNmethod

The average numbers of training set categories per doc-ument and documents per category are listed in Table 2 Aswe can see the SoftSL approach alters the training set moresignificantly As a result a larger number of relevancy tagsare added This is consistent with higher recall and lowerprecision values of add+SoftSL and del+SoftSL as comparedto WkNN-based methods in Table 1

Figure 2 compares CROC curves for representative cate-gories of AgingPortfolio dataset computed for SVM without

6 Computational and Mathematical Methods in Medicine

Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods

Method Precision Recall 1198651

SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576

Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification

Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887

training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification

AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL

CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL

Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods

Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well

452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917

The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly

Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677

Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)

Method Precision Recall 1198651

SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866

Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817

Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044

Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications

Method Precision Recall 1198651

RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194

Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06

is used

Computational and Mathematical Methods in Medicine 7

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03919

SVM add+WkNN AUC = 04323

SVM delRandom selection

+WkNN AUC = 04181

(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 06830

SVM add+WkNN AUC = 07484

SVM delRandom selection

+WkNN AUC = 07559

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

0806040200

Transformed false positive rate

SVM only AUC = 04843

SVM add+WkNN AUC = 05340

SVM delRandom selection

+WkNN AUC = 05238

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 04892

SVM add+WkNN AUC = 05603

SVM delRandom selection

+WkNN AUC = 05460

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)

Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)

The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)

A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly

improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)

5 Discussion

Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our

8 Computational and Mathematical Methods in Medicine

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03904

SVM add+SoftSL AUC = 04747

SVM delRandom selection

+SoftSL AUC = 04498

(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 06818

SVM add+SoftSL AUC = 06374

SVM delRandom selection

+SoftSL AUC = 07018

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04847

SVM add+SoftSL AUC = 05447

SVM delRandom selection

+SoftSL AUC = 05423

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04900

SVM add+SoftSL AUC = 05757

SVM delRandom selection

+SoftSL AUC = 05587

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)

Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)

training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all

To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL

Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels

We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than

Computational and Mathematical Methods in Medicine 9

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC =

=

04752

SVM add+WkNN AUC 04311

(a) ldquoExperimental techniques in vivo methodsrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 06380

= 07375SVM add+WkNN AUC

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category

=

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05454

SVM add+WkNN AUC 05348

(c) ldquoAging mechanisms by anatomy cell levelrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05757

= 05541SVM add+WkNN AUC

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory

Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio

SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement

Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant

documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold

One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result

10 Computational and Mathematical Methods in Medicine

Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0

Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716

Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224

Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707

should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones

Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets

Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set

Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets

Conflict of Interests

The authors declare that there is no conflict of interests

Acknowledgments

The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch

References

[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012

[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013

[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013

[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010

[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010

[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002

Computational and Mathematical Methods in Medicine 11

[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage

[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009

[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007

[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008

[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010

[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008

[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009

[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998

[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967

[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009

[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011

[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008

[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009

[23] International aging research portfolio httpagingportfolioorg

[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000

[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994

[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005

[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004

[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010

[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm

[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Computational and Mathematical Methods in Medicine 3

We developed alternative approaches because the algo-rithms for training set modification were not designed towork with these assumptions Then we used the followingtwo detailed algorithms in our experiments

(1) themethod based on a recent soft supervised learningapproach [13] (was labeled as ldquoSoftSLrdquo)

(2) weighted k-nearest neighbour classifier algorithm(labeled as ldquoWkNNrdquo [14ndash16])

These algorithms use the nearest neighbour set of a doc-ument which is in line with our third assumption

The first step in the modification of the training setinvolves the generation of a set PC of document-categoryrelevancy pairs overlooked by the experts

PC = (119889 cl) | 120595 (119889 cl) = 1 (1)

where 119889 is a document cl is a class label (category) and 120595

is the function of our training set modification algorithm(WkNN or SoftSL)

Consider

120595 (119889 cl)

= 1 if our algorithm places document 119889 into class cl0 otherwise

(2)

Then two possible outcomes are considered

(1) complete inclusion of PC into the training set (optionwas denoted as ldquoaddrdquo)

(2) exclusion of document 119889 from the negative examplesof the category cl for all relevancy pairs (119889 cl) isin PC(option was denoted as ldquodelrdquo)

The modified training set will still contain the objectswhich according to the algorithm 120595 do not belong to the setlabeled by the expert It is possible to find the next missinglabels in the documents of the training set

31 SoftSL Algorithm for Finding Missing Labels In thissection we outline the application of a new graph algorithmfor Soft-supervised learning also called SoftSL [13] Eachdocument is represented by a vertex within a weighted undi-rected graph and our proposed framework minimizes theweighted Kullback-Leibler divergence between distributionsthat encode the classmembership probabilities of each vertex

The advantages of this graph algorithm include directapplicability to the multilabel categorization problem as wellas improved performance compare to alternatives [14] Themain idea of the SoftSL algorithm is the following

Let119863 = 119863119897 119863119906 be a set of labeled and unlabeled objectswith

119863119897 = (119909119894 119910119894)119897

119894=1 119863119906 = (119909119894)

119899

119894=119897+1 (3)

where 119909119894 is the input vector representing the objects to becategorized and 119910119894 is the category label

Let119866 = (119881 119864) be a weighted undirected graph Here119881 =

1 119899 where 119899 is the cardinality of 119863 and 119864 = 119881 times 119881 and120596119894119895 is the weight of the edge linking objects 119894 and 119895

The weight of the edge is defined as

119908119894119895 = sim (119909119894 119909119895) if 119895 isin 119870 (119894)

0 if 119895 notin 119870 (119894) (4)

Here sim(119909119894 119909119895) is the measure of similarity between 119894thand 119895th objects (eg cosine measure) and119870(119894) is the set of knearest neighbours of object 119909119894

Each object is associated with a set of probabilities119901119894 = (119901

119905

119894)119898

119905=1of belonging to each of the 119898 classes 119871 =

cl119894119898

119894=1 According to information from 119863119897 we determined

the probabilities 119903119894 = (119903119905

119894)119898

119905=1119897

119894=1that documents 119909119894

119897

119894=1

belong for each of m classes thus 119903119905119894gt 0 if (119889119894 cl119905) isin 119863119897 Each

labeled object also has a known set of probabilities 119903119894 assignedby the expertsOur intention is tominimize themisalignmentfunction 1198621(119901) over sets of probabilities

1198621 (119901) =

119897

sum

119894=1

119863KL (119903119894 119901119894)

+ 120583

119899

sum

119894=1

sum

119895isin119870(119894)

120596119894119895119863KL (119901119894 119901119895) minus ]119899

sum

119894=1

119867(119901119894)

119863KL (119901119894 119901119895)def=

119898

sum

119905=1

119901119905

119894log119901119905119895

119867 (119901119894)def=

119898

sum

119905=1

119901119905

119894log119901119905119894

(5)

where 119863KL(119901119894 119901119895) means Kullback-Leibler distance and119867(119901119894)means entropy

120583 and ] are the parameters of the algorithm definingcontribution of each term into 1198621(119901) The meanings of allterms are listed below

(1) The first term in the expression of1198621 shows how closethe generated probabilities are to the ones assigned bythe experts

(2) The second term accounts for the graph geometry andguarantees that the objects close to one another on thegraph will have similar probability distributions overclasses

(3) The third term is included in case other terms in theexpression are not contradictory Its purpose is to pro-duce a regular and uniform probability distributionover classes

Numerically the problem is solved using AlternatingMinimization (AM) [13] Note that119863119906 is absent in the case ofunlabeled dataTheminimization of the objectivemin1199011198621(119901)leads to the set of probabilities 119901119894 = 119901

1

119894 119901

119898

119894 for each

document 119889119894 isin 119863 We introduce a threshold 119879 isin [0 1]

to assign additional categories relevant to each document if119901119895

119894⩾ 119879 then 119889119894 isin cl119895

4 Computational and Mathematical Methods in Medicine

32 Weighted kNN Algorithm for Finding Missing Labels Inthis section we shall briefly describe the weighted k-nearestneighbour algorithm [17] that is capable of directly solvingthe multilabel categorization problem

Let 120588(119889 1198891015840) be a distance function between the docu-ments 119889 and 1198891015840 The function which assigns document 119889 toclass label cl isin 119871 is then defined as

119878 (119889 cl) =sum1198891015840isinkNN(119889) 120588 (119889 119889

1015840) 119868 (1198891015840 cl)

sum1198891015840isinkNN(119889) 120588 (119889 119889

1015840)

119868 (1198891015840 cl) =

1 if 1198891015840 isin cl0 otherwise

(6)

Here kNN(119889) is the set of k nearest neighbours ofdocument 119889 in the training set

We introduce a threshold 119879 such that if 119878(119889 cl) ⩾ 119879 then119889 isin cl Then the algorithm counts 119878(119889 cl) for all possible(119889 cl) combinations When 119878(119889 cl) ⩾ 119879 every combinationis considered to be a missing label and used to modify thetraining set

33 Support Vector Machine We will use the Linear SupportVector Machine as a classification algorithm in this caseSince SVM is mainly a binary classifier the Binary Relevanceapproach is therefore chosen to address multilabel problemsThis method implies training a separate decision rule 119908119897119909 +119887119897 gt 0 for every category 119897 isin 119871 More details are availablein our previous work on methods for structuring scientificknowledge [18] In our study as the implementation of SVMwe used Weka binding of the LIBLINEAR library [19]

34 Random Forest Random Forest is an ensemble ofmachine learning methods which combines tree predictorsIn this combination each tree depends on the values ofa random vector sampled independently All trees in theforest have the same distribution More details about thismethod can be found in [20 21] In our study we used theimplementation of Random Forest fromWeka [22]

4 Experimental Results

In this section we describe how did we perform text classifi-cation experiments We applied the classification algorithmsto initial (unmodified) training sets as well as to the trainingsets modified with ldquoaddrdquo or ldquodelrdquo methods

We shall first discuss the scheme of training set transfor-mation and its usefulness Then we shall present the processof data generation Finally we shall consider the performancemeasures used in the experiments experimental setting andthe results of the parameter estimation and final validation

41 Training Set Modification Training set modification stepis described in detail in Section 3 (methods) However it isimportant to notice that in both cases documents that donot belong to set PC according to the relevance algorithm 120595

(too far from documents labeled to the given category) arestill retained in the training set The reason for this choice is

that we assume that experts are not supposed to make anymistake of type II (when they give a document an odd label)only the absence of proper label is supposed to encounter

The omission of the relevance pair (119889 119860) in the trainingset makes document 119889move into the set of negative examplesfor learning a classifier for the class 119860 This problem altersthe decision rule and negatively affects performance Theproposed set modification scheme is designed to avoid suchproblems during the training session of the classifier

42 Datasets and Data Preprocessing

421 AgingPortfolio Dataset The first experiment was car-ried out using data from the AgingPortfolio informationresource [18 23] The AgingPortfolio system includes adatabase of projects related to aging and is funded bythe National Institutes of Health (NIH) and the EuropeanCommission (ECCORDIS)This database currently containsmore than one million projects Each of its records writtenin English displays information related to the authorrsquos nametitle a brief description of the motivation and researchobjectives the name of the organization and the fundingperiod of the project Some projects contain additionalkeywords with an average description in 100 words In thisexperiment we used only the title a brief description andtag fields

A taxonomy contains 335 categories with 6 hierarchicallevels used for document classification A detailed informa-tion about the taxonomy is available on the Internationalaging research portfolio Web site [23] Biomedical expertsmanually put the category labels on the document trainingand test sets In the process they used a special procedurefor labeling the document test set Two sets of categoriescarefully selected by different experts were assigned to eachdocument of the test set Then a combination of these cate-gories was used to achieve amore complete category labelingDifferent participants like the AgingPortfolio resource userscreated the training set with little control A visual inspectionsuggests that the training set contained a significant numberof projects with incomplete sets of category labels The sameconclusion is also achieved by comparing the average numberof categories per project This average is 44 in the sample setcompared to 979 in the more thoroughly designed test setThe total number of projects was 3 246 for the training set183 for the development set and 1 000 for the test sets

Throughout our study we used the vector model of textrepresentation The list of keywords and their combinationsfrom the TerMine [24] system (National Centre for TextMining or NaCTeM) provided the terms used in our studyThe method used in this system combines linguistic andstatistical information of candidate terms

Later we conducted the analysis and processing of theset of keyword combinations Whenever the short keywordcombinations were present in longer ones the latter weresplit into shorter ones The algorithms and the code usedfor the keyword combinations decomposition are availablefrom theAgingPortfolioWeb site [8] According to the resultsof our previous experiments the new vectorization method

Computational and Mathematical Methods in Medicine 5

provided a 3 increase by the 1198651-measure compared to thegeneral ldquobag-of-wordsrdquo model We assigned feature weightsaccording to the TFIDF rule in the BM25 formulation [25 26]and then normalized vectors representing the documents inthe Euclidean metric of 119899-dimensional space

422 Yeast Dataset Yeast dataset [27] is a biomedical datasetof Yeast genes divided into 14 different functional classesEach instance in the dataset is a gene represented by a vectorwhose features are the microarray expression levels undervarious conditions We used it to reveal if our methods aresuitable for the classification of genetic information as well asfor textual data

Let us describe the methods for an incomplete datasetmodeling Since the dataset is well annotated andwidely usedthe objects (genes) have complete sets of category labels Byrandom deletion of labels from documents wemade amodelof the incomplete sets of labels in the training set Parameter119901 was introduced as the fraction of deleted labels

We deleted labels using the following conditions

(1) for each class 119901 represents the fraction of deletedlabelsWe keep the distribution of labels by categoriesafter modeling the incomplete sets of labels

(2) the number of objects in the training set remains thesame At least one label is preserved after the labeldeletion process

No preprocessing step was necessary because the data issupplied already prepared as a matrix of numbers [27]

43 Performance Measurements The following characteris-tics were used to evaluate and compare different classificationalgorithms

(i) Microaveraged precision recall and1198651-measure [27](ii) CROC curves and their AUC values computed for

selected categories [28] CROC curve is a modifi-cation of a ROC curve where 119909 axis is rescaledas 119909new(119909) We used a standard exponent scaling119909new(119909) = (1 minus 119890

minus120572119909)(1 minus 119890

minus120572) with 120572 = 7

44 Experimental Setup Theprocedures for selecting impor-tant parameters of the algorithms outlined are described next

441 Parameters for SVM

AgingPortfolio Dataset The following SVM parameters weretuned for each decision rule

(i) cost parameter 119862 controls a trade-off between maxi-mization of the separation margin and minimizationof the total error [15]

(ii) parameter 119887119897 that plays the role of a classification thre-shold in the decision rule

We performed a parameter tuning by using a slidingcontrol method with 5-fold cross-validation according to

the following strategyThe 119862-parameter was varied on a gridfollowed by 119887119897-parameter (for every value of 119862) tuning forevery category A set PC = (119862 119887119897)119897isin119871 of parameter pairswas considered optimal if it maximized the 1198651-measure withaveraging over documentsWhile119862 has the same value for allcategories the 119887119897 threshold parameter was tuned (for a givenvalue of 119862) for each class label 119897 isin 119871

Yeast Dataset In experiments with the Yeast dataset theselection of SVM parameters was not performed (ie 119862 = 1119887119897 = 0 for all values of 119897 isin 119871)

442 Parameters for RF The 30 solution trees were used tobuild the Random Forest The number of inputs to considerwhile splitting a tree node is the square root of featuresrsquonumber The procedure was done according to the [29] LeoBreiman who developed the Random Forest Algorithm

443 AgingPortfolio Dataset Parameters 119896 and119879were tunedon a grid as follows We prepared a validation set 119863devof 183 documents as in the case of the test set The per-formance metrics for SVM classifiers trained on modifieddocument sets were then evaluated on 119863dev A combinationof parameters was considered optimal if it maximized the 1198651-measure Parameter 120583 of the SoftSL algorithm was tuned ona grid keeping 119896 fixed at its optimal value A fixed categoryassignment threshold of 119879 = 0005 is used for the SoftSLtraining set modification algorithm We used ] = 0 sinceall documents in the experiments contained some categorylabels and regularization was unnecessary

444 Yeast Dataset Themethod for selecting the parametersfor the Yeast dataset is the same as in the AgingPortfolio119863devwas composed of 300 (20 of 1 500) randomly selected genesfor training The SoftSL training set modification algorithmwas not used for this dataset

45 A Comparison of Methods

451 AgingPortfolio We evaluated the general performancebased on a total of 62 categories that contained at least 30documents in the training set and at least 7 documents inthe test set The results for precision recall and 1198651-measureare presented in Table 1 It is evident that a parameter tuningsignificantly boosts both precision and recall

Also all of our training set modification methods paylower precision for higher recall values If we consider 1198651measure as a general quality function such trade-offmay lookquiet reasonable especially for add+WkNNmethod

The average numbers of training set categories per doc-ument and documents per category are listed in Table 2 Aswe can see the SoftSL approach alters the training set moresignificantly As a result a larger number of relevancy tagsare added This is consistent with higher recall and lowerprecision values of add+SoftSL and del+SoftSL as comparedto WkNN-based methods in Table 1

Figure 2 compares CROC curves for representative cate-gories of AgingPortfolio dataset computed for SVM without

6 Computational and Mathematical Methods in Medicine

Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods

Method Precision Recall 1198651

SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576

Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification

Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887

training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification

AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL

CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL

Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods

Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well

452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917

The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly

Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677

Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)

Method Precision Recall 1198651

SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866

Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817

Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044

Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications

Method Precision Recall 1198651

RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194

Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06

is used

Computational and Mathematical Methods in Medicine 7

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03919

SVM add+WkNN AUC = 04323

SVM delRandom selection

+WkNN AUC = 04181

(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 06830

SVM add+WkNN AUC = 07484

SVM delRandom selection

+WkNN AUC = 07559

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

0806040200

Transformed false positive rate

SVM only AUC = 04843

SVM add+WkNN AUC = 05340

SVM delRandom selection

+WkNN AUC = 05238

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 04892

SVM add+WkNN AUC = 05603

SVM delRandom selection

+WkNN AUC = 05460

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)

Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)

The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)

A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly

improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)

5 Discussion

Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our

8 Computational and Mathematical Methods in Medicine

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03904

SVM add+SoftSL AUC = 04747

SVM delRandom selection

+SoftSL AUC = 04498

(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 06818

SVM add+SoftSL AUC = 06374

SVM delRandom selection

+SoftSL AUC = 07018

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04847

SVM add+SoftSL AUC = 05447

SVM delRandom selection

+SoftSL AUC = 05423

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04900

SVM add+SoftSL AUC = 05757

SVM delRandom selection

+SoftSL AUC = 05587

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)

Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)

training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all

To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL

Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels

We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than

Computational and Mathematical Methods in Medicine 9

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC =

=

04752

SVM add+WkNN AUC 04311

(a) ldquoExperimental techniques in vivo methodsrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 06380

= 07375SVM add+WkNN AUC

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category

=

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05454

SVM add+WkNN AUC 05348

(c) ldquoAging mechanisms by anatomy cell levelrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05757

= 05541SVM add+WkNN AUC

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory

Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio

SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement

Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant

documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold

One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result

10 Computational and Mathematical Methods in Medicine

Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0

Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716

Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224

Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707

should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones

Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets

Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set

Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets

Conflict of Interests

The authors declare that there is no conflict of interests

Acknowledgments

The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch

References

[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012

[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013

[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013

[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010

[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010

[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002

Computational and Mathematical Methods in Medicine 11

[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage

[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009

[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007

[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008

[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010

[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008

[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009

[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998

[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967

[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009

[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011

[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008

[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009

[23] International aging research portfolio httpagingportfolioorg

[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000

[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994

[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005

[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004

[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010

[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm

[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

4 Computational and Mathematical Methods in Medicine

32 Weighted kNN Algorithm for Finding Missing Labels Inthis section we shall briefly describe the weighted k-nearestneighbour algorithm [17] that is capable of directly solvingthe multilabel categorization problem

Let 120588(119889 1198891015840) be a distance function between the docu-ments 119889 and 1198891015840 The function which assigns document 119889 toclass label cl isin 119871 is then defined as

119878 (119889 cl) =sum1198891015840isinkNN(119889) 120588 (119889 119889

1015840) 119868 (1198891015840 cl)

sum1198891015840isinkNN(119889) 120588 (119889 119889

1015840)

119868 (1198891015840 cl) =

1 if 1198891015840 isin cl0 otherwise

(6)

Here kNN(119889) is the set of k nearest neighbours ofdocument 119889 in the training set

We introduce a threshold 119879 such that if 119878(119889 cl) ⩾ 119879 then119889 isin cl Then the algorithm counts 119878(119889 cl) for all possible(119889 cl) combinations When 119878(119889 cl) ⩾ 119879 every combinationis considered to be a missing label and used to modify thetraining set

33 Support Vector Machine We will use the Linear SupportVector Machine as a classification algorithm in this caseSince SVM is mainly a binary classifier the Binary Relevanceapproach is therefore chosen to address multilabel problemsThis method implies training a separate decision rule 119908119897119909 +119887119897 gt 0 for every category 119897 isin 119871 More details are availablein our previous work on methods for structuring scientificknowledge [18] In our study as the implementation of SVMwe used Weka binding of the LIBLINEAR library [19]

34 Random Forest Random Forest is an ensemble ofmachine learning methods which combines tree predictorsIn this combination each tree depends on the values ofa random vector sampled independently All trees in theforest have the same distribution More details about thismethod can be found in [20 21] In our study we used theimplementation of Random Forest fromWeka [22]

4 Experimental Results

In this section we describe how did we perform text classifi-cation experiments We applied the classification algorithmsto initial (unmodified) training sets as well as to the trainingsets modified with ldquoaddrdquo or ldquodelrdquo methods

We shall first discuss the scheme of training set transfor-mation and its usefulness Then we shall present the processof data generation Finally we shall consider the performancemeasures used in the experiments experimental setting andthe results of the parameter estimation and final validation

41 Training Set Modification Training set modification stepis described in detail in Section 3 (methods) However it isimportant to notice that in both cases documents that donot belong to set PC according to the relevance algorithm 120595

(too far from documents labeled to the given category) arestill retained in the training set The reason for this choice is

that we assume that experts are not supposed to make anymistake of type II (when they give a document an odd label)only the absence of proper label is supposed to encounter

The omission of the relevance pair (119889 119860) in the trainingset makes document 119889move into the set of negative examplesfor learning a classifier for the class 119860 This problem altersthe decision rule and negatively affects performance Theproposed set modification scheme is designed to avoid suchproblems during the training session of the classifier

42 Datasets and Data Preprocessing

421 AgingPortfolio Dataset The first experiment was car-ried out using data from the AgingPortfolio informationresource [18 23] The AgingPortfolio system includes adatabase of projects related to aging and is funded bythe National Institutes of Health (NIH) and the EuropeanCommission (ECCORDIS)This database currently containsmore than one million projects Each of its records writtenin English displays information related to the authorrsquos nametitle a brief description of the motivation and researchobjectives the name of the organization and the fundingperiod of the project Some projects contain additionalkeywords with an average description in 100 words In thisexperiment we used only the title a brief description andtag fields

A taxonomy contains 335 categories with 6 hierarchicallevels used for document classification A detailed informa-tion about the taxonomy is available on the Internationalaging research portfolio Web site [23] Biomedical expertsmanually put the category labels on the document trainingand test sets In the process they used a special procedurefor labeling the document test set Two sets of categoriescarefully selected by different experts were assigned to eachdocument of the test set Then a combination of these cate-gories was used to achieve amore complete category labelingDifferent participants like the AgingPortfolio resource userscreated the training set with little control A visual inspectionsuggests that the training set contained a significant numberof projects with incomplete sets of category labels The sameconclusion is also achieved by comparing the average numberof categories per project This average is 44 in the sample setcompared to 979 in the more thoroughly designed test setThe total number of projects was 3 246 for the training set183 for the development set and 1 000 for the test sets

Throughout our study we used the vector model of textrepresentation The list of keywords and their combinationsfrom the TerMine [24] system (National Centre for TextMining or NaCTeM) provided the terms used in our studyThe method used in this system combines linguistic andstatistical information of candidate terms

Later we conducted the analysis and processing of theset of keyword combinations Whenever the short keywordcombinations were present in longer ones the latter weresplit into shorter ones The algorithms and the code usedfor the keyword combinations decomposition are availablefrom theAgingPortfolioWeb site [8] According to the resultsof our previous experiments the new vectorization method

Computational and Mathematical Methods in Medicine 5

provided a 3 increase by the 1198651-measure compared to thegeneral ldquobag-of-wordsrdquo model We assigned feature weightsaccording to the TFIDF rule in the BM25 formulation [25 26]and then normalized vectors representing the documents inthe Euclidean metric of 119899-dimensional space

422 Yeast Dataset Yeast dataset [27] is a biomedical datasetof Yeast genes divided into 14 different functional classesEach instance in the dataset is a gene represented by a vectorwhose features are the microarray expression levels undervarious conditions We used it to reveal if our methods aresuitable for the classification of genetic information as well asfor textual data

Let us describe the methods for an incomplete datasetmodeling Since the dataset is well annotated andwidely usedthe objects (genes) have complete sets of category labels Byrandom deletion of labels from documents wemade amodelof the incomplete sets of labels in the training set Parameter119901 was introduced as the fraction of deleted labels

We deleted labels using the following conditions

(1) for each class 119901 represents the fraction of deletedlabelsWe keep the distribution of labels by categoriesafter modeling the incomplete sets of labels

(2) the number of objects in the training set remains thesame At least one label is preserved after the labeldeletion process

No preprocessing step was necessary because the data issupplied already prepared as a matrix of numbers [27]

43 Performance Measurements The following characteris-tics were used to evaluate and compare different classificationalgorithms

(i) Microaveraged precision recall and1198651-measure [27](ii) CROC curves and their AUC values computed for

selected categories [28] CROC curve is a modifi-cation of a ROC curve where 119909 axis is rescaledas 119909new(119909) We used a standard exponent scaling119909new(119909) = (1 minus 119890

minus120572119909)(1 minus 119890

minus120572) with 120572 = 7

44 Experimental Setup Theprocedures for selecting impor-tant parameters of the algorithms outlined are described next

441 Parameters for SVM

AgingPortfolio Dataset The following SVM parameters weretuned for each decision rule

(i) cost parameter 119862 controls a trade-off between maxi-mization of the separation margin and minimizationof the total error [15]

(ii) parameter 119887119897 that plays the role of a classification thre-shold in the decision rule

We performed a parameter tuning by using a slidingcontrol method with 5-fold cross-validation according to

the following strategyThe 119862-parameter was varied on a gridfollowed by 119887119897-parameter (for every value of 119862) tuning forevery category A set PC = (119862 119887119897)119897isin119871 of parameter pairswas considered optimal if it maximized the 1198651-measure withaveraging over documentsWhile119862 has the same value for allcategories the 119887119897 threshold parameter was tuned (for a givenvalue of 119862) for each class label 119897 isin 119871

Yeast Dataset In experiments with the Yeast dataset theselection of SVM parameters was not performed (ie 119862 = 1119887119897 = 0 for all values of 119897 isin 119871)

442 Parameters for RF The 30 solution trees were used tobuild the Random Forest The number of inputs to considerwhile splitting a tree node is the square root of featuresrsquonumber The procedure was done according to the [29] LeoBreiman who developed the Random Forest Algorithm

443 AgingPortfolio Dataset Parameters 119896 and119879were tunedon a grid as follows We prepared a validation set 119863devof 183 documents as in the case of the test set The per-formance metrics for SVM classifiers trained on modifieddocument sets were then evaluated on 119863dev A combinationof parameters was considered optimal if it maximized the 1198651-measure Parameter 120583 of the SoftSL algorithm was tuned ona grid keeping 119896 fixed at its optimal value A fixed categoryassignment threshold of 119879 = 0005 is used for the SoftSLtraining set modification algorithm We used ] = 0 sinceall documents in the experiments contained some categorylabels and regularization was unnecessary

444 Yeast Dataset Themethod for selecting the parametersfor the Yeast dataset is the same as in the AgingPortfolio119863devwas composed of 300 (20 of 1 500) randomly selected genesfor training The SoftSL training set modification algorithmwas not used for this dataset

45 A Comparison of Methods

451 AgingPortfolio We evaluated the general performancebased on a total of 62 categories that contained at least 30documents in the training set and at least 7 documents inthe test set The results for precision recall and 1198651-measureare presented in Table 1 It is evident that a parameter tuningsignificantly boosts both precision and recall

Also all of our training set modification methods paylower precision for higher recall values If we consider 1198651measure as a general quality function such trade-offmay lookquiet reasonable especially for add+WkNNmethod

The average numbers of training set categories per doc-ument and documents per category are listed in Table 2 Aswe can see the SoftSL approach alters the training set moresignificantly As a result a larger number of relevancy tagsare added This is consistent with higher recall and lowerprecision values of add+SoftSL and del+SoftSL as comparedto WkNN-based methods in Table 1

Figure 2 compares CROC curves for representative cate-gories of AgingPortfolio dataset computed for SVM without

6 Computational and Mathematical Methods in Medicine

Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods

Method Precision Recall 1198651

SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576

Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification

Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887

training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification

AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL

CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL

Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods

Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well

452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917

The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly

Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677

Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)

Method Precision Recall 1198651

SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866

Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817

Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044

Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications

Method Precision Recall 1198651

RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194

Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06

is used

Computational and Mathematical Methods in Medicine 7

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03919

SVM add+WkNN AUC = 04323

SVM delRandom selection

+WkNN AUC = 04181

(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 06830

SVM add+WkNN AUC = 07484

SVM delRandom selection

+WkNN AUC = 07559

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

0806040200

Transformed false positive rate

SVM only AUC = 04843

SVM add+WkNN AUC = 05340

SVM delRandom selection

+WkNN AUC = 05238

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 04892

SVM add+WkNN AUC = 05603

SVM delRandom selection

+WkNN AUC = 05460

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)

Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)

The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)

A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly

improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)

5 Discussion

Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our

8 Computational and Mathematical Methods in Medicine

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03904

SVM add+SoftSL AUC = 04747

SVM delRandom selection

+SoftSL AUC = 04498

(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 06818

SVM add+SoftSL AUC = 06374

SVM delRandom selection

+SoftSL AUC = 07018

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04847

SVM add+SoftSL AUC = 05447

SVM delRandom selection

+SoftSL AUC = 05423

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04900

SVM add+SoftSL AUC = 05757

SVM delRandom selection

+SoftSL AUC = 05587

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)

Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)

training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all

To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL

Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels

We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than

Computational and Mathematical Methods in Medicine 9

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC =

=

04752

SVM add+WkNN AUC 04311

(a) ldquoExperimental techniques in vivo methodsrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 06380

= 07375SVM add+WkNN AUC

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category

=

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05454

SVM add+WkNN AUC 05348

(c) ldquoAging mechanisms by anatomy cell levelrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05757

= 05541SVM add+WkNN AUC

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory

Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio

SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement

Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant

documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold

One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result

10 Computational and Mathematical Methods in Medicine

Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0

Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716

Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224

Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707

should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones

Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets

Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set

Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets

Conflict of Interests

The authors declare that there is no conflict of interests

Acknowledgments

The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch

References

[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012

[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013

[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013

[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010

[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010

[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002

Computational and Mathematical Methods in Medicine 11

[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage

[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009

[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007

[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008

[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010

[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008

[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009

[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998

[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967

[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009

[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011

[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008

[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009

[23] International aging research portfolio httpagingportfolioorg

[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000

[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994

[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005

[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004

[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010

[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm

[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Computational and Mathematical Methods in Medicine 5

provided a 3 increase by the 1198651-measure compared to thegeneral ldquobag-of-wordsrdquo model We assigned feature weightsaccording to the TFIDF rule in the BM25 formulation [25 26]and then normalized vectors representing the documents inthe Euclidean metric of 119899-dimensional space

422 Yeast Dataset Yeast dataset [27] is a biomedical datasetof Yeast genes divided into 14 different functional classesEach instance in the dataset is a gene represented by a vectorwhose features are the microarray expression levels undervarious conditions We used it to reveal if our methods aresuitable for the classification of genetic information as well asfor textual data

Let us describe the methods for an incomplete datasetmodeling Since the dataset is well annotated andwidely usedthe objects (genes) have complete sets of category labels Byrandom deletion of labels from documents wemade amodelof the incomplete sets of labels in the training set Parameter119901 was introduced as the fraction of deleted labels

We deleted labels using the following conditions

(1) for each class 119901 represents the fraction of deletedlabelsWe keep the distribution of labels by categoriesafter modeling the incomplete sets of labels

(2) the number of objects in the training set remains thesame At least one label is preserved after the labeldeletion process

No preprocessing step was necessary because the data issupplied already prepared as a matrix of numbers [27]

43 Performance Measurements The following characteris-tics were used to evaluate and compare different classificationalgorithms

(i) Microaveraged precision recall and1198651-measure [27](ii) CROC curves and their AUC values computed for

selected categories [28] CROC curve is a modifi-cation of a ROC curve where 119909 axis is rescaledas 119909new(119909) We used a standard exponent scaling119909new(119909) = (1 minus 119890

minus120572119909)(1 minus 119890

minus120572) with 120572 = 7

44 Experimental Setup Theprocedures for selecting impor-tant parameters of the algorithms outlined are described next

441 Parameters for SVM

AgingPortfolio Dataset The following SVM parameters weretuned for each decision rule

(i) cost parameter 119862 controls a trade-off between maxi-mization of the separation margin and minimizationof the total error [15]

(ii) parameter 119887119897 that plays the role of a classification thre-shold in the decision rule

We performed a parameter tuning by using a slidingcontrol method with 5-fold cross-validation according to

the following strategyThe 119862-parameter was varied on a gridfollowed by 119887119897-parameter (for every value of 119862) tuning forevery category A set PC = (119862 119887119897)119897isin119871 of parameter pairswas considered optimal if it maximized the 1198651-measure withaveraging over documentsWhile119862 has the same value for allcategories the 119887119897 threshold parameter was tuned (for a givenvalue of 119862) for each class label 119897 isin 119871

Yeast Dataset In experiments with the Yeast dataset theselection of SVM parameters was not performed (ie 119862 = 1119887119897 = 0 for all values of 119897 isin 119871)

442 Parameters for RF The 30 solution trees were used tobuild the Random Forest The number of inputs to considerwhile splitting a tree node is the square root of featuresrsquonumber The procedure was done according to the [29] LeoBreiman who developed the Random Forest Algorithm

443 AgingPortfolio Dataset Parameters 119896 and119879were tunedon a grid as follows We prepared a validation set 119863devof 183 documents as in the case of the test set The per-formance metrics for SVM classifiers trained on modifieddocument sets were then evaluated on 119863dev A combinationof parameters was considered optimal if it maximized the 1198651-measure Parameter 120583 of the SoftSL algorithm was tuned ona grid keeping 119896 fixed at its optimal value A fixed categoryassignment threshold of 119879 = 0005 is used for the SoftSLtraining set modification algorithm We used ] = 0 sinceall documents in the experiments contained some categorylabels and regularization was unnecessary

444 Yeast Dataset Themethod for selecting the parametersfor the Yeast dataset is the same as in the AgingPortfolio119863devwas composed of 300 (20 of 1 500) randomly selected genesfor training The SoftSL training set modification algorithmwas not used for this dataset

45 A Comparison of Methods

451 AgingPortfolio We evaluated the general performancebased on a total of 62 categories that contained at least 30documents in the training set and at least 7 documents inthe test set The results for precision recall and 1198651-measureare presented in Table 1 It is evident that a parameter tuningsignificantly boosts both precision and recall

Also all of our training set modification methods paylower precision for higher recall values If we consider 1198651measure as a general quality function such trade-offmay lookquiet reasonable especially for add+WkNNmethod

The average numbers of training set categories per doc-ument and documents per category are listed in Table 2 Aswe can see the SoftSL approach alters the training set moresignificantly As a result a larger number of relevancy tagsare added This is consistent with higher recall and lowerprecision values of add+SoftSL and del+SoftSL as comparedto WkNN-based methods in Table 1

Figure 2 compares CROC curves for representative cate-gories of AgingPortfolio dataset computed for SVM without

6 Computational and Mathematical Methods in Medicine

Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods

Method Precision Recall 1198651

SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576

Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification

Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887

training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification

AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL

CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL

Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods

Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well

452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917

The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly

Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677

Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)

Method Precision Recall 1198651

SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866

Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817

Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044

Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications

Method Precision Recall 1198651

RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194

Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06

is used

Computational and Mathematical Methods in Medicine 7

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03919

SVM add+WkNN AUC = 04323

SVM delRandom selection

+WkNN AUC = 04181

(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 06830

SVM add+WkNN AUC = 07484

SVM delRandom selection

+WkNN AUC = 07559

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

0806040200

Transformed false positive rate

SVM only AUC = 04843

SVM add+WkNN AUC = 05340

SVM delRandom selection

+WkNN AUC = 05238

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 04892

SVM add+WkNN AUC = 05603

SVM delRandom selection

+WkNN AUC = 05460

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)

Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)

The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)

A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly

improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)

5 Discussion

Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our

8 Computational and Mathematical Methods in Medicine

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03904

SVM add+SoftSL AUC = 04747

SVM delRandom selection

+SoftSL AUC = 04498

(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 06818

SVM add+SoftSL AUC = 06374

SVM delRandom selection

+SoftSL AUC = 07018

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04847

SVM add+SoftSL AUC = 05447

SVM delRandom selection

+SoftSL AUC = 05423

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04900

SVM add+SoftSL AUC = 05757

SVM delRandom selection

+SoftSL AUC = 05587

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)

Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)

training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all

To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL

Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels

We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than

Computational and Mathematical Methods in Medicine 9

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC =

=

04752

SVM add+WkNN AUC 04311

(a) ldquoExperimental techniques in vivo methodsrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 06380

= 07375SVM add+WkNN AUC

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category

=

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05454

SVM add+WkNN AUC 05348

(c) ldquoAging mechanisms by anatomy cell levelrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05757

= 05541SVM add+WkNN AUC

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory

Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio

SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement

Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant

documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold

One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result

10 Computational and Mathematical Methods in Medicine

Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0

Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716

Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224

Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707

should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones

Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets

Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set

Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets

Conflict of Interests

The authors declare that there is no conflict of interests

Acknowledgments

The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch

References

[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012

[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013

[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013

[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010

[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010

[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002

Computational and Mathematical Methods in Medicine 11

[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage

[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009

[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007

[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008

[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010

[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008

[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009

[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998

[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967

[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009

[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011

[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008

[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009

[23] International aging research portfolio httpagingportfolioorg

[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000

[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994

[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005

[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004

[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010

[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm

[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

6 Computational and Mathematical Methods in Medicine

Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods

Method Precision Recall 1198651

SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576

Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification

Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887

training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification

AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL

CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL

Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods

Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well

452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917

The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly

Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677

Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)

Method Precision Recall 1198651

SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866

Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817

Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)

Method Precision Recall 1198651

SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044

Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications

Method Precision Recall 1198651

RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194

Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06

is used

Computational and Mathematical Methods in Medicine 7

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03919

SVM add+WkNN AUC = 04323

SVM delRandom selection

+WkNN AUC = 04181

(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 06830

SVM add+WkNN AUC = 07484

SVM delRandom selection

+WkNN AUC = 07559

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

0806040200

Transformed false positive rate

SVM only AUC = 04843

SVM add+WkNN AUC = 05340

SVM delRandom selection

+WkNN AUC = 05238

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 04892

SVM add+WkNN AUC = 05603

SVM delRandom selection

+WkNN AUC = 05460

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)

Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)

The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)

A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly

improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)

5 Discussion

Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our

8 Computational and Mathematical Methods in Medicine

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03904

SVM add+SoftSL AUC = 04747

SVM delRandom selection

+SoftSL AUC = 04498

(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 06818

SVM add+SoftSL AUC = 06374

SVM delRandom selection

+SoftSL AUC = 07018

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04847

SVM add+SoftSL AUC = 05447

SVM delRandom selection

+SoftSL AUC = 05423

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04900

SVM add+SoftSL AUC = 05757

SVM delRandom selection

+SoftSL AUC = 05587

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)

Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)

training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all

To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL

Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels

We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than

Computational and Mathematical Methods in Medicine 9

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC =

=

04752

SVM add+WkNN AUC 04311

(a) ldquoExperimental techniques in vivo methodsrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 06380

= 07375SVM add+WkNN AUC

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category

=

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05454

SVM add+WkNN AUC 05348

(c) ldquoAging mechanisms by anatomy cell levelrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05757

= 05541SVM add+WkNN AUC

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory

Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio

SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement

Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant

documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold

One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result

10 Computational and Mathematical Methods in Medicine

Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0

Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716

Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224

Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707

should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones

Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets

Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set

Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets

Conflict of Interests

The authors declare that there is no conflict of interests

Acknowledgments

The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch

References

[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012

[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013

[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013

[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010

[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010

[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002

Computational and Mathematical Methods in Medicine 11

[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage

[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009

[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007

[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008

[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010

[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008

[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009

[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998

[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967

[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009

[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011

[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008

[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009

[23] International aging research portfolio httpagingportfolioorg

[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000

[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994

[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005

[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004

[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010

[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm

[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Computational and Mathematical Methods in Medicine 7

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03919

SVM add+WkNN AUC = 04323

SVM delRandom selection

+WkNN AUC = 04181

(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 06830

SVM add+WkNN AUC = 07484

SVM delRandom selection

+WkNN AUC = 07559

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

0806040200

Transformed false positive rate

SVM only AUC = 04843

SVM add+WkNN AUC = 05340

SVM delRandom selection

+WkNN AUC = 05238

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)

10

08

06

04

02

00

Reca

ll

100806040200

Transformed false positive rate

SVM only AUC = 04892

SVM add+WkNN AUC = 05603

SVM delRandom selection

+WkNN AUC = 05460

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)

Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)

The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)

A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly

improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)

5 Discussion

Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our

8 Computational and Mathematical Methods in Medicine

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03904

SVM add+SoftSL AUC = 04747

SVM delRandom selection

+SoftSL AUC = 04498

(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 06818

SVM add+SoftSL AUC = 06374

SVM delRandom selection

+SoftSL AUC = 07018

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04847

SVM add+SoftSL AUC = 05447

SVM delRandom selection

+SoftSL AUC = 05423

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04900

SVM add+SoftSL AUC = 05757

SVM delRandom selection

+SoftSL AUC = 05587

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)

Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)

training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all

To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL

Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels

We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than

Computational and Mathematical Methods in Medicine 9

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC =

=

04752

SVM add+WkNN AUC 04311

(a) ldquoExperimental techniques in vivo methodsrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 06380

= 07375SVM add+WkNN AUC

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category

=

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05454

SVM add+WkNN AUC 05348

(c) ldquoAging mechanisms by anatomy cell levelrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05757

= 05541SVM add+WkNN AUC

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory

Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio

SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement

Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant

documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold

One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result

10 Computational and Mathematical Methods in Medicine

Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0

Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716

Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224

Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707

should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones

Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets

Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set

Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets

Conflict of Interests

The authors declare that there is no conflict of interests

Acknowledgments

The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch

References

[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012

[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013

[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013

[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010

[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010

[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002

Computational and Mathematical Methods in Medicine 11

[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage

[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009

[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007

[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008

[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010

[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008

[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009

[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998

[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967

[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009

[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011

[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008

[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009

[23] International aging research portfolio httpagingportfolioorg

[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000

[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994

[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005

[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004

[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010

[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm

[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

8 Computational and Mathematical Methods in Medicine

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 03904

SVM add+SoftSL AUC = 04747

SVM delRandom selection

+SoftSL AUC = 04498

(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 06818

SVM add+SoftSL AUC = 06374

SVM delRandom selection

+SoftSL AUC = 07018

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04847

SVM add+SoftSL AUC = 05447

SVM delRandom selection

+SoftSL AUC = 05423

(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM only AUC = 04900

SVM add+SoftSL AUC = 05757

SVM delRandom selection

+SoftSL AUC = 05587

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)

Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)

training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all

To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL

Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels

We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than

Computational and Mathematical Methods in Medicine 9

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC =

=

04752

SVM add+WkNN AUC 04311

(a) ldquoExperimental techniques in vivo methodsrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 06380

= 07375SVM add+WkNN AUC

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category

=

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05454

SVM add+WkNN AUC 05348

(c) ldquoAging mechanisms by anatomy cell levelrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05757

= 05541SVM add+WkNN AUC

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory

Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio

SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement

Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant

documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold

One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result

10 Computational and Mathematical Methods in Medicine

Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0

Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716

Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224

Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707

should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones

Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets

Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set

Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets

Conflict of Interests

The authors declare that there is no conflict of interests

Acknowledgments

The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch

References

[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012

[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013

[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013

[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010

[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010

[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002

Computational and Mathematical Methods in Medicine 11

[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage

[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009

[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007

[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008

[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010

[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008

[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009

[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998

[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967

[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009

[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011

[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008

[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009

[23] International aging research portfolio httpagingportfolioorg

[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000

[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994

[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005

[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004

[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010

[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm

[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Computational and Mathematical Methods in Medicine 9

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC =

=

04752

SVM add+WkNN AUC 04311

(a) ldquoExperimental techniques in vivo methodsrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 06380

= 07375SVM add+WkNN AUC

(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category

=

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05454

SVM add+WkNN AUC 05348

(c) ldquoAging mechanisms by anatomy cell levelrdquo category

10

10

08

08

06

06

04

04

02

02

00

00

Reca

ll

Transformed false positive rate

SVM addRandom selection

+SoftSL AUC = 05757

= 05541SVM add+WkNN AUC

(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory

Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio

SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement

Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant

documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold

One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result

10 Computational and Mathematical Methods in Medicine

Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0

Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716

Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224

Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707

should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones

Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets

Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set

Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets

Conflict of Interests

The authors declare that there is no conflict of interests

Acknowledgments

The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch

References

[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012

[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013

[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013

[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010

[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010

[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002

Computational and Mathematical Methods in Medicine 11

[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage

[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009

[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007

[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008

[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010

[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008

[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009

[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998

[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967

[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009

[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011

[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008

[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009

[23] International aging research portfolio httpagingportfolioorg

[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000

[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994

[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005

[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004

[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010

[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm

[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

10 Computational and Mathematical Methods in Medicine

Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0

Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716

Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)

119901 parameter Precision Recall 1198651

0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224

Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search

119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651

0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707

should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones

Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets

Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set

Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets

Conflict of Interests

The authors declare that there is no conflict of interests

Acknowledgments

The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch

References

[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012

[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013

[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013

[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010

[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003

[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010

[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002

Computational and Mathematical Methods in Medicine 11

[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage

[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009

[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007

[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008

[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010

[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008

[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009

[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998

[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967

[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009

[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011

[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008

[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009

[23] International aging research portfolio httpagingportfolioorg

[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000

[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994

[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005

[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004

[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010

[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm

[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Computational and Mathematical Methods in Medicine 11

[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage

[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009

[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007

[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008

[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010

[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008

[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009

[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998

[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967

[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009

[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011

[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008

[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009

[23] International aging research portfolio httpagingportfolioorg

[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000

[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994

[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005

[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004

[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010

[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm

[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom