research article on multilabel classification methods of
TRANSCRIPT
Research ArticleOn Multilabel Classification Methods of Incompletely LabeledBiomedical Text Data
Anton Kolesov12 Dmitry Kamyshenkov2 Maria Litovchenko123 Elena Smekalova4
Alexey Golovizin2 and Alex Zhavoronkov123
1 Center for Pediatric Hematology Oncology and Immunology Moscow 117997 Russia2Moscow Institute of Physics and Technology Moscow 117303 Russia3The Biogerontology Research Foundation Reading W1J 5NE UK4Chemistry Department Moscow State University Moscow 119991 Russia
Correspondence should be addressed to Alex Zhavoronkov zhavoronkovbiogerontologyorg
Received 9 September 2013 Revised 8 December 2013 Accepted 12 December 2013 Published 23 January 2014
Academic Editor Dejing Dou
Copyright copy 2014 Anton Kolesov et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Multilabel classification is often hindered by incompletely labeled training datasets for some items of such dataset (or even for all ofthem) some labels may be omitted In this case we cannot know if any item is labeled fully and correctly When we train a classifierdirectly on incompletely labeled dataset it performs ineffectively To overcome the problem we added an extra step training setmodification before training a classifier In this paper we try two algorithms for training set modification weighted k-nearestneighbor (WkNN) and soft supervised learning (SoftSL) Both of these approaches are based on similarity measurements betweendata vectors We performed the experiments on AgingPortfolio (text dataset) and then rechecked on the Yeast (nontext geneticdata) We tried SVM and RF classifiers for the original datasets and then for the modified ones For each dataset our experimentsdemonstrated that both classification algorithms performed considerably better when preceded by the training set modificationstep
1 Background and Significance
Multilabel classification with supervised machine learning isa widespread problem in data analysis However very oftenwe have to perform multilabel classification when we are notguaranteed that our training set itself is perfectly preclassifiedThis is especially actual in the the case of national biomedicalgrants with ambiguous classification schemes A particulargrant may belong to several classes or may be miscategorizedin the case of a keyword-based classification scheme
An interesting illustration is the project titled ldquoLevelsof Literacy of Men with Prostate Cancerrdquo This project maybe classified by an algorithm as ldquoprostate cancerrdquo ldquocancerbiomarkersrdquo or ldquocancer educationrdquo whereas a researcherwould consider it appropriately in relation to literacy Thiskind of context makes the generation of training sets morecomplicated and costly Many experts need to collaborate
extensively in the selection of the full set of document cate-gories from the large number available for classification Sincesuch collaboration seldom happens we end up assigning anincomplete set of categories to the training set document
When a document that is relevant to a particular class 119860does not bear its label it turns into a negative instance of class119860 during the learning process As a consequence the deci-sion rules are distorted and the classification performancedegrades
With an increase in the amount of textual information inthe biomedical sphere such problems become recurrent andneed our attention For example about half a million newrecords are added each year on PubMed and thousands ofresearch initiatives funded by grants are conducted annuallyaround the world Grant application abstracts are usuallymade public and the IARP project adds over 250 thousandnew projects each year
Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2014 Article ID 781807 11 pageshttpdxdoiorg1011552014781807
2 Computational and Mathematical Methods in Medicine
b c
a
L998400
o
L998400
x
L998400
Δ
(a) Before label restoration
b c
a
L998400
o
L998400
x
Lx
LΔ
(b) After label restoration
Figure 1 Decision rules before and after missing label restoration
In addition to classifying publication abstracts and grantdatabases methods described in this paper may be applied toother classification tasks in biomedical sciences as well
2 Objective
In this article we address the problem of classification whenfor each training object some proper labels may be omittedIn order to understand the properties of incompletely labeledtraining set and its impact on learning outcomes let usconsider an artificial example
Figure 1(a) shows the initial training set for themultilabelclassification task for 3 classes of points on the plane In ourwork we used Support Vector Machine (SVM) classificationwhich is a popular classical classificationmethodwith a broadrange of applications ranging from text to tumor selection [1]gene expression classification [2] and mitotic cell modeling[3] problems
With the Binary Relevance [4] approach based on a linearSVM we can obtain the decision rules for the classes ofcrosses circles and triangles Please note that this is anexample of an error-free classification Let us assume thatobject a really belongs to ldquocrossesrdquo and ldquocirclesrdquo and object abelongs to ldquocrossesrdquo and ldquotrianglesrdquo But in real life the trainingset is often incompletely labeled Figure 1(a) shows us sucha situation when object a is labeled only as ldquocirclerdquo (ldquocrossrdquois missed) and object b is labeled only as ldquotrianglerdquo (missedldquocirclerdquo)
In Figure 1(b) the missing tags and bold lines are addedand the newdecision rules for the classes of crosses and circlesafter recovering lost tags are depicted This example showsthat in the case of incompletely labeled dataset a decisionrule may be quite distorted and have a negative effect on theclassification performance
In order to reduce the negative impact of incompletelylabeled datasets we proposed a special approach based ontraining set modification that reduces contradictions Afterapplying the algorithm to a real collection of data the resultsof the Support Vector Machine (SVM) and Random Forest(RF) classification schemes improved Here RF is known to
be one of the most effective machine learning techniques[5ndash7]
To address the incompleteness of training sets in thispaper we shall describe a new strategy for constructingclassification algorithms On the one hand the performanceof this strategy is evaluated using data collections from theAgingPortfolio resource available on the Web [8] On theother hand its effectiveness is confirmed by applying it to theYeast dataset described below
Several methods like data cleaning [9] outlier detection[10] reference object selection [11] and hybrid classifica-tion algorithms [12] for improving performance have beenproposed for training set modification To date the abilityof these approaches to provide real text classification hasnot been sufficiently studied Furthermore none of thesemethods of training set modification is suitable for solvingclassification problems with an incompletely labeled trainingset
3 Methods
The already-proposed algorithms are based on the followingthree assumptions about totally new input classifier data
(1) A large number of training set objects are assumedto have an incomplete set of labels By definitiona complete set of labels is a set which leads toa perfect consensus among experts regarding theimpossibility of further adding or removing a labelfrom a document in the data collection
(2) Experts are not expected tomake an error in assigningcategory labels to documents That is to say thetraining set generation may involve errors of type1 only (checking hypotheses of the type ldquoobject 119889belongs to category label clrdquo)
(3) The compactness hypothesis is assumed to hold Thismeans similar objects are likely to belong to the samecategories as compact subsets located in the objectspace The solution of a classification problem underthese assumptions requires that an algorithm treatdocument relevancy on the basis of data geometry
Computational and Mathematical Methods in Medicine 3
We developed alternative approaches because the algo-rithms for training set modification were not designed towork with these assumptions Then we used the followingtwo detailed algorithms in our experiments
(1) themethod based on a recent soft supervised learningapproach [13] (was labeled as ldquoSoftSLrdquo)
(2) weighted k-nearest neighbour classifier algorithm(labeled as ldquoWkNNrdquo [14ndash16])
These algorithms use the nearest neighbour set of a doc-ument which is in line with our third assumption
The first step in the modification of the training setinvolves the generation of a set PC of document-categoryrelevancy pairs overlooked by the experts
PC = (119889 cl) | 120595 (119889 cl) = 1 (1)
where 119889 is a document cl is a class label (category) and 120595
is the function of our training set modification algorithm(WkNN or SoftSL)
Consider
120595 (119889 cl)
= 1 if our algorithm places document 119889 into class cl0 otherwise
(2)
Then two possible outcomes are considered
(1) complete inclusion of PC into the training set (optionwas denoted as ldquoaddrdquo)
(2) exclusion of document 119889 from the negative examplesof the category cl for all relevancy pairs (119889 cl) isin PC(option was denoted as ldquodelrdquo)
The modified training set will still contain the objectswhich according to the algorithm 120595 do not belong to the setlabeled by the expert It is possible to find the next missinglabels in the documents of the training set
31 SoftSL Algorithm for Finding Missing Labels In thissection we outline the application of a new graph algorithmfor Soft-supervised learning also called SoftSL [13] Eachdocument is represented by a vertex within a weighted undi-rected graph and our proposed framework minimizes theweighted Kullback-Leibler divergence between distributionsthat encode the classmembership probabilities of each vertex
The advantages of this graph algorithm include directapplicability to the multilabel categorization problem as wellas improved performance compare to alternatives [14] Themain idea of the SoftSL algorithm is the following
Let119863 = 119863119897 119863119906 be a set of labeled and unlabeled objectswith
119863119897 = (119909119894 119910119894)119897
119894=1 119863119906 = (119909119894)
119899
119894=119897+1 (3)
where 119909119894 is the input vector representing the objects to becategorized and 119910119894 is the category label
Let119866 = (119881 119864) be a weighted undirected graph Here119881 =
1 119899 where 119899 is the cardinality of 119863 and 119864 = 119881 times 119881 and120596119894119895 is the weight of the edge linking objects 119894 and 119895
The weight of the edge is defined as
119908119894119895 = sim (119909119894 119909119895) if 119895 isin 119870 (119894)
0 if 119895 notin 119870 (119894) (4)
Here sim(119909119894 119909119895) is the measure of similarity between 119894thand 119895th objects (eg cosine measure) and119870(119894) is the set of knearest neighbours of object 119909119894
Each object is associated with a set of probabilities119901119894 = (119901
119905
119894)119898
119905=1of belonging to each of the 119898 classes 119871 =
cl119894119898
119894=1 According to information from 119863119897 we determined
the probabilities 119903119894 = (119903119905
119894)119898
119905=1119897
119894=1that documents 119909119894
119897
119894=1
belong for each of m classes thus 119903119905119894gt 0 if (119889119894 cl119905) isin 119863119897 Each
labeled object also has a known set of probabilities 119903119894 assignedby the expertsOur intention is tominimize themisalignmentfunction 1198621(119901) over sets of probabilities
1198621 (119901) =
119897
sum
119894=1
119863KL (119903119894 119901119894)
+ 120583
119899
sum
119894=1
sum
119895isin119870(119894)
120596119894119895119863KL (119901119894 119901119895) minus ]119899
sum
119894=1
119867(119901119894)
119863KL (119901119894 119901119895)def=
119898
sum
119905=1
119901119905
119894log119901119905119895
119867 (119901119894)def=
119898
sum
119905=1
119901119905
119894log119901119905119894
(5)
where 119863KL(119901119894 119901119895) means Kullback-Leibler distance and119867(119901119894)means entropy
120583 and ] are the parameters of the algorithm definingcontribution of each term into 1198621(119901) The meanings of allterms are listed below
(1) The first term in the expression of1198621 shows how closethe generated probabilities are to the ones assigned bythe experts
(2) The second term accounts for the graph geometry andguarantees that the objects close to one another on thegraph will have similar probability distributions overclasses
(3) The third term is included in case other terms in theexpression are not contradictory Its purpose is to pro-duce a regular and uniform probability distributionover classes
Numerically the problem is solved using AlternatingMinimization (AM) [13] Note that119863119906 is absent in the case ofunlabeled dataTheminimization of the objectivemin1199011198621(119901)leads to the set of probabilities 119901119894 = 119901
1
119894 119901
119898
119894 for each
document 119889119894 isin 119863 We introduce a threshold 119879 isin [0 1]
to assign additional categories relevant to each document if119901119895
119894⩾ 119879 then 119889119894 isin cl119895
4 Computational and Mathematical Methods in Medicine
32 Weighted kNN Algorithm for Finding Missing Labels Inthis section we shall briefly describe the weighted k-nearestneighbour algorithm [17] that is capable of directly solvingthe multilabel categorization problem
Let 120588(119889 1198891015840) be a distance function between the docu-ments 119889 and 1198891015840 The function which assigns document 119889 toclass label cl isin 119871 is then defined as
119878 (119889 cl) =sum1198891015840isinkNN(119889) 120588 (119889 119889
1015840) 119868 (1198891015840 cl)
sum1198891015840isinkNN(119889) 120588 (119889 119889
1015840)
119868 (1198891015840 cl) =
1 if 1198891015840 isin cl0 otherwise
(6)
Here kNN(119889) is the set of k nearest neighbours ofdocument 119889 in the training set
We introduce a threshold 119879 such that if 119878(119889 cl) ⩾ 119879 then119889 isin cl Then the algorithm counts 119878(119889 cl) for all possible(119889 cl) combinations When 119878(119889 cl) ⩾ 119879 every combinationis considered to be a missing label and used to modify thetraining set
33 Support Vector Machine We will use the Linear SupportVector Machine as a classification algorithm in this caseSince SVM is mainly a binary classifier the Binary Relevanceapproach is therefore chosen to address multilabel problemsThis method implies training a separate decision rule 119908119897119909 +119887119897 gt 0 for every category 119897 isin 119871 More details are availablein our previous work on methods for structuring scientificknowledge [18] In our study as the implementation of SVMwe used Weka binding of the LIBLINEAR library [19]
34 Random Forest Random Forest is an ensemble ofmachine learning methods which combines tree predictorsIn this combination each tree depends on the values ofa random vector sampled independently All trees in theforest have the same distribution More details about thismethod can be found in [20 21] In our study we used theimplementation of Random Forest fromWeka [22]
4 Experimental Results
In this section we describe how did we perform text classifi-cation experiments We applied the classification algorithmsto initial (unmodified) training sets as well as to the trainingsets modified with ldquoaddrdquo or ldquodelrdquo methods
We shall first discuss the scheme of training set transfor-mation and its usefulness Then we shall present the processof data generation Finally we shall consider the performancemeasures used in the experiments experimental setting andthe results of the parameter estimation and final validation
41 Training Set Modification Training set modification stepis described in detail in Section 3 (methods) However it isimportant to notice that in both cases documents that donot belong to set PC according to the relevance algorithm 120595
(too far from documents labeled to the given category) arestill retained in the training set The reason for this choice is
that we assume that experts are not supposed to make anymistake of type II (when they give a document an odd label)only the absence of proper label is supposed to encounter
The omission of the relevance pair (119889 119860) in the trainingset makes document 119889move into the set of negative examplesfor learning a classifier for the class 119860 This problem altersthe decision rule and negatively affects performance Theproposed set modification scheme is designed to avoid suchproblems during the training session of the classifier
42 Datasets and Data Preprocessing
421 AgingPortfolio Dataset The first experiment was car-ried out using data from the AgingPortfolio informationresource [18 23] The AgingPortfolio system includes adatabase of projects related to aging and is funded bythe National Institutes of Health (NIH) and the EuropeanCommission (ECCORDIS)This database currently containsmore than one million projects Each of its records writtenin English displays information related to the authorrsquos nametitle a brief description of the motivation and researchobjectives the name of the organization and the fundingperiod of the project Some projects contain additionalkeywords with an average description in 100 words In thisexperiment we used only the title a brief description andtag fields
A taxonomy contains 335 categories with 6 hierarchicallevels used for document classification A detailed informa-tion about the taxonomy is available on the Internationalaging research portfolio Web site [23] Biomedical expertsmanually put the category labels on the document trainingand test sets In the process they used a special procedurefor labeling the document test set Two sets of categoriescarefully selected by different experts were assigned to eachdocument of the test set Then a combination of these cate-gories was used to achieve amore complete category labelingDifferent participants like the AgingPortfolio resource userscreated the training set with little control A visual inspectionsuggests that the training set contained a significant numberof projects with incomplete sets of category labels The sameconclusion is also achieved by comparing the average numberof categories per project This average is 44 in the sample setcompared to 979 in the more thoroughly designed test setThe total number of projects was 3 246 for the training set183 for the development set and 1 000 for the test sets
Throughout our study we used the vector model of textrepresentation The list of keywords and their combinationsfrom the TerMine [24] system (National Centre for TextMining or NaCTeM) provided the terms used in our studyThe method used in this system combines linguistic andstatistical information of candidate terms
Later we conducted the analysis and processing of theset of keyword combinations Whenever the short keywordcombinations were present in longer ones the latter weresplit into shorter ones The algorithms and the code usedfor the keyword combinations decomposition are availablefrom theAgingPortfolioWeb site [8] According to the resultsof our previous experiments the new vectorization method
Computational and Mathematical Methods in Medicine 5
provided a 3 increase by the 1198651-measure compared to thegeneral ldquobag-of-wordsrdquo model We assigned feature weightsaccording to the TFIDF rule in the BM25 formulation [25 26]and then normalized vectors representing the documents inthe Euclidean metric of 119899-dimensional space
422 Yeast Dataset Yeast dataset [27] is a biomedical datasetof Yeast genes divided into 14 different functional classesEach instance in the dataset is a gene represented by a vectorwhose features are the microarray expression levels undervarious conditions We used it to reveal if our methods aresuitable for the classification of genetic information as well asfor textual data
Let us describe the methods for an incomplete datasetmodeling Since the dataset is well annotated andwidely usedthe objects (genes) have complete sets of category labels Byrandom deletion of labels from documents wemade amodelof the incomplete sets of labels in the training set Parameter119901 was introduced as the fraction of deleted labels
We deleted labels using the following conditions
(1) for each class 119901 represents the fraction of deletedlabelsWe keep the distribution of labels by categoriesafter modeling the incomplete sets of labels
(2) the number of objects in the training set remains thesame At least one label is preserved after the labeldeletion process
No preprocessing step was necessary because the data issupplied already prepared as a matrix of numbers [27]
43 Performance Measurements The following characteris-tics were used to evaluate and compare different classificationalgorithms
(i) Microaveraged precision recall and1198651-measure [27](ii) CROC curves and their AUC values computed for
selected categories [28] CROC curve is a modifi-cation of a ROC curve where 119909 axis is rescaledas 119909new(119909) We used a standard exponent scaling119909new(119909) = (1 minus 119890
minus120572119909)(1 minus 119890
minus120572) with 120572 = 7
44 Experimental Setup Theprocedures for selecting impor-tant parameters of the algorithms outlined are described next
441 Parameters for SVM
AgingPortfolio Dataset The following SVM parameters weretuned for each decision rule
(i) cost parameter 119862 controls a trade-off between maxi-mization of the separation margin and minimizationof the total error [15]
(ii) parameter 119887119897 that plays the role of a classification thre-shold in the decision rule
We performed a parameter tuning by using a slidingcontrol method with 5-fold cross-validation according to
the following strategyThe 119862-parameter was varied on a gridfollowed by 119887119897-parameter (for every value of 119862) tuning forevery category A set PC = (119862 119887119897)119897isin119871 of parameter pairswas considered optimal if it maximized the 1198651-measure withaveraging over documentsWhile119862 has the same value for allcategories the 119887119897 threshold parameter was tuned (for a givenvalue of 119862) for each class label 119897 isin 119871
Yeast Dataset In experiments with the Yeast dataset theselection of SVM parameters was not performed (ie 119862 = 1119887119897 = 0 for all values of 119897 isin 119871)
442 Parameters for RF The 30 solution trees were used tobuild the Random Forest The number of inputs to considerwhile splitting a tree node is the square root of featuresrsquonumber The procedure was done according to the [29] LeoBreiman who developed the Random Forest Algorithm
443 AgingPortfolio Dataset Parameters 119896 and119879were tunedon a grid as follows We prepared a validation set 119863devof 183 documents as in the case of the test set The per-formance metrics for SVM classifiers trained on modifieddocument sets were then evaluated on 119863dev A combinationof parameters was considered optimal if it maximized the 1198651-measure Parameter 120583 of the SoftSL algorithm was tuned ona grid keeping 119896 fixed at its optimal value A fixed categoryassignment threshold of 119879 = 0005 is used for the SoftSLtraining set modification algorithm We used ] = 0 sinceall documents in the experiments contained some categorylabels and regularization was unnecessary
444 Yeast Dataset Themethod for selecting the parametersfor the Yeast dataset is the same as in the AgingPortfolio119863devwas composed of 300 (20 of 1 500) randomly selected genesfor training The SoftSL training set modification algorithmwas not used for this dataset
45 A Comparison of Methods
451 AgingPortfolio We evaluated the general performancebased on a total of 62 categories that contained at least 30documents in the training set and at least 7 documents inthe test set The results for precision recall and 1198651-measureare presented in Table 1 It is evident that a parameter tuningsignificantly boosts both precision and recall
Also all of our training set modification methods paylower precision for higher recall values If we consider 1198651measure as a general quality function such trade-offmay lookquiet reasonable especially for add+WkNNmethod
The average numbers of training set categories per doc-ument and documents per category are listed in Table 2 Aswe can see the SoftSL approach alters the training set moresignificantly As a result a larger number of relevancy tagsare added This is consistent with higher recall and lowerprecision values of add+SoftSL and del+SoftSL as comparedto WkNN-based methods in Table 1
Figure 2 compares CROC curves for representative cate-gories of AgingPortfolio dataset computed for SVM without
6 Computational and Mathematical Methods in Medicine
Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods
Method Precision Recall 1198651
SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576
Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification
Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887
training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification
AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL
CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL
Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods
Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well
452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917
The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly
Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677
Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)
Method Precision Recall 1198651
SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866
Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817
Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044
Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications
Method Precision Recall 1198651
RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194
Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06
is used
Computational and Mathematical Methods in Medicine 7
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03919
SVM add+WkNN AUC = 04323
SVM delRandom selection
+WkNN AUC = 04181
(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 06830
SVM add+WkNN AUC = 07484
SVM delRandom selection
+WkNN AUC = 07559
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
0806040200
Transformed false positive rate
SVM only AUC = 04843
SVM add+WkNN AUC = 05340
SVM delRandom selection
+WkNN AUC = 05238
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 04892
SVM add+WkNN AUC = 05603
SVM delRandom selection
+WkNN AUC = 05460
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)
Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)
The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)
A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly
improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)
5 Discussion
Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our
8 Computational and Mathematical Methods in Medicine
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03904
SVM add+SoftSL AUC = 04747
SVM delRandom selection
+SoftSL AUC = 04498
(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 06818
SVM add+SoftSL AUC = 06374
SVM delRandom selection
+SoftSL AUC = 07018
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04847
SVM add+SoftSL AUC = 05447
SVM delRandom selection
+SoftSL AUC = 05423
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04900
SVM add+SoftSL AUC = 05757
SVM delRandom selection
+SoftSL AUC = 05587
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)
Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)
training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all
To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL
Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels
We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than
Computational and Mathematical Methods in Medicine 9
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC =
=
04752
SVM add+WkNN AUC 04311
(a) ldquoExperimental techniques in vivo methodsrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 06380
= 07375SVM add+WkNN AUC
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category
=
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05454
SVM add+WkNN AUC 05348
(c) ldquoAging mechanisms by anatomy cell levelrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05757
= 05541SVM add+WkNN AUC
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory
Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio
SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement
Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant
documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold
One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result
10 Computational and Mathematical Methods in Medicine
Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0
Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716
Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224
Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707
should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones
Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets
Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set
Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets
Conflict of Interests
The authors declare that there is no conflict of interests
Acknowledgments
The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch
References
[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012
[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013
[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013
[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010
[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010
[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002
Computational and Mathematical Methods in Medicine 11
[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage
[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009
[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007
[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008
[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010
[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008
[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009
[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998
[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967
[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009
[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011
[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008
[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001
[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009
[23] International aging research portfolio httpagingportfolioorg
[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000
[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994
[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005
[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004
[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010
[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm
[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
2 Computational and Mathematical Methods in Medicine
b c
a
L998400
o
L998400
x
L998400
Δ
(a) Before label restoration
b c
a
L998400
o
L998400
x
Lx
LΔ
(b) After label restoration
Figure 1 Decision rules before and after missing label restoration
In addition to classifying publication abstracts and grantdatabases methods described in this paper may be applied toother classification tasks in biomedical sciences as well
2 Objective
In this article we address the problem of classification whenfor each training object some proper labels may be omittedIn order to understand the properties of incompletely labeledtraining set and its impact on learning outcomes let usconsider an artificial example
Figure 1(a) shows the initial training set for themultilabelclassification task for 3 classes of points on the plane In ourwork we used Support Vector Machine (SVM) classificationwhich is a popular classical classificationmethodwith a broadrange of applications ranging from text to tumor selection [1]gene expression classification [2] and mitotic cell modeling[3] problems
With the Binary Relevance [4] approach based on a linearSVM we can obtain the decision rules for the classes ofcrosses circles and triangles Please note that this is anexample of an error-free classification Let us assume thatobject a really belongs to ldquocrossesrdquo and ldquocirclesrdquo and object abelongs to ldquocrossesrdquo and ldquotrianglesrdquo But in real life the trainingset is often incompletely labeled Figure 1(a) shows us sucha situation when object a is labeled only as ldquocirclerdquo (ldquocrossrdquois missed) and object b is labeled only as ldquotrianglerdquo (missedldquocirclerdquo)
In Figure 1(b) the missing tags and bold lines are addedand the newdecision rules for the classes of crosses and circlesafter recovering lost tags are depicted This example showsthat in the case of incompletely labeled dataset a decisionrule may be quite distorted and have a negative effect on theclassification performance
In order to reduce the negative impact of incompletelylabeled datasets we proposed a special approach based ontraining set modification that reduces contradictions Afterapplying the algorithm to a real collection of data the resultsof the Support Vector Machine (SVM) and Random Forest(RF) classification schemes improved Here RF is known to
be one of the most effective machine learning techniques[5ndash7]
To address the incompleteness of training sets in thispaper we shall describe a new strategy for constructingclassification algorithms On the one hand the performanceof this strategy is evaluated using data collections from theAgingPortfolio resource available on the Web [8] On theother hand its effectiveness is confirmed by applying it to theYeast dataset described below
Several methods like data cleaning [9] outlier detection[10] reference object selection [11] and hybrid classifica-tion algorithms [12] for improving performance have beenproposed for training set modification To date the abilityof these approaches to provide real text classification hasnot been sufficiently studied Furthermore none of thesemethods of training set modification is suitable for solvingclassification problems with an incompletely labeled trainingset
3 Methods
The already-proposed algorithms are based on the followingthree assumptions about totally new input classifier data
(1) A large number of training set objects are assumedto have an incomplete set of labels By definitiona complete set of labels is a set which leads toa perfect consensus among experts regarding theimpossibility of further adding or removing a labelfrom a document in the data collection
(2) Experts are not expected tomake an error in assigningcategory labels to documents That is to say thetraining set generation may involve errors of type1 only (checking hypotheses of the type ldquoobject 119889belongs to category label clrdquo)
(3) The compactness hypothesis is assumed to hold Thismeans similar objects are likely to belong to the samecategories as compact subsets located in the objectspace The solution of a classification problem underthese assumptions requires that an algorithm treatdocument relevancy on the basis of data geometry
Computational and Mathematical Methods in Medicine 3
We developed alternative approaches because the algo-rithms for training set modification were not designed towork with these assumptions Then we used the followingtwo detailed algorithms in our experiments
(1) themethod based on a recent soft supervised learningapproach [13] (was labeled as ldquoSoftSLrdquo)
(2) weighted k-nearest neighbour classifier algorithm(labeled as ldquoWkNNrdquo [14ndash16])
These algorithms use the nearest neighbour set of a doc-ument which is in line with our third assumption
The first step in the modification of the training setinvolves the generation of a set PC of document-categoryrelevancy pairs overlooked by the experts
PC = (119889 cl) | 120595 (119889 cl) = 1 (1)
where 119889 is a document cl is a class label (category) and 120595
is the function of our training set modification algorithm(WkNN or SoftSL)
Consider
120595 (119889 cl)
= 1 if our algorithm places document 119889 into class cl0 otherwise
(2)
Then two possible outcomes are considered
(1) complete inclusion of PC into the training set (optionwas denoted as ldquoaddrdquo)
(2) exclusion of document 119889 from the negative examplesof the category cl for all relevancy pairs (119889 cl) isin PC(option was denoted as ldquodelrdquo)
The modified training set will still contain the objectswhich according to the algorithm 120595 do not belong to the setlabeled by the expert It is possible to find the next missinglabels in the documents of the training set
31 SoftSL Algorithm for Finding Missing Labels In thissection we outline the application of a new graph algorithmfor Soft-supervised learning also called SoftSL [13] Eachdocument is represented by a vertex within a weighted undi-rected graph and our proposed framework minimizes theweighted Kullback-Leibler divergence between distributionsthat encode the classmembership probabilities of each vertex
The advantages of this graph algorithm include directapplicability to the multilabel categorization problem as wellas improved performance compare to alternatives [14] Themain idea of the SoftSL algorithm is the following
Let119863 = 119863119897 119863119906 be a set of labeled and unlabeled objectswith
119863119897 = (119909119894 119910119894)119897
119894=1 119863119906 = (119909119894)
119899
119894=119897+1 (3)
where 119909119894 is the input vector representing the objects to becategorized and 119910119894 is the category label
Let119866 = (119881 119864) be a weighted undirected graph Here119881 =
1 119899 where 119899 is the cardinality of 119863 and 119864 = 119881 times 119881 and120596119894119895 is the weight of the edge linking objects 119894 and 119895
The weight of the edge is defined as
119908119894119895 = sim (119909119894 119909119895) if 119895 isin 119870 (119894)
0 if 119895 notin 119870 (119894) (4)
Here sim(119909119894 119909119895) is the measure of similarity between 119894thand 119895th objects (eg cosine measure) and119870(119894) is the set of knearest neighbours of object 119909119894
Each object is associated with a set of probabilities119901119894 = (119901
119905
119894)119898
119905=1of belonging to each of the 119898 classes 119871 =
cl119894119898
119894=1 According to information from 119863119897 we determined
the probabilities 119903119894 = (119903119905
119894)119898
119905=1119897
119894=1that documents 119909119894
119897
119894=1
belong for each of m classes thus 119903119905119894gt 0 if (119889119894 cl119905) isin 119863119897 Each
labeled object also has a known set of probabilities 119903119894 assignedby the expertsOur intention is tominimize themisalignmentfunction 1198621(119901) over sets of probabilities
1198621 (119901) =
119897
sum
119894=1
119863KL (119903119894 119901119894)
+ 120583
119899
sum
119894=1
sum
119895isin119870(119894)
120596119894119895119863KL (119901119894 119901119895) minus ]119899
sum
119894=1
119867(119901119894)
119863KL (119901119894 119901119895)def=
119898
sum
119905=1
119901119905
119894log119901119905119895
119867 (119901119894)def=
119898
sum
119905=1
119901119905
119894log119901119905119894
(5)
where 119863KL(119901119894 119901119895) means Kullback-Leibler distance and119867(119901119894)means entropy
120583 and ] are the parameters of the algorithm definingcontribution of each term into 1198621(119901) The meanings of allterms are listed below
(1) The first term in the expression of1198621 shows how closethe generated probabilities are to the ones assigned bythe experts
(2) The second term accounts for the graph geometry andguarantees that the objects close to one another on thegraph will have similar probability distributions overclasses
(3) The third term is included in case other terms in theexpression are not contradictory Its purpose is to pro-duce a regular and uniform probability distributionover classes
Numerically the problem is solved using AlternatingMinimization (AM) [13] Note that119863119906 is absent in the case ofunlabeled dataTheminimization of the objectivemin1199011198621(119901)leads to the set of probabilities 119901119894 = 119901
1
119894 119901
119898
119894 for each
document 119889119894 isin 119863 We introduce a threshold 119879 isin [0 1]
to assign additional categories relevant to each document if119901119895
119894⩾ 119879 then 119889119894 isin cl119895
4 Computational and Mathematical Methods in Medicine
32 Weighted kNN Algorithm for Finding Missing Labels Inthis section we shall briefly describe the weighted k-nearestneighbour algorithm [17] that is capable of directly solvingthe multilabel categorization problem
Let 120588(119889 1198891015840) be a distance function between the docu-ments 119889 and 1198891015840 The function which assigns document 119889 toclass label cl isin 119871 is then defined as
119878 (119889 cl) =sum1198891015840isinkNN(119889) 120588 (119889 119889
1015840) 119868 (1198891015840 cl)
sum1198891015840isinkNN(119889) 120588 (119889 119889
1015840)
119868 (1198891015840 cl) =
1 if 1198891015840 isin cl0 otherwise
(6)
Here kNN(119889) is the set of k nearest neighbours ofdocument 119889 in the training set
We introduce a threshold 119879 such that if 119878(119889 cl) ⩾ 119879 then119889 isin cl Then the algorithm counts 119878(119889 cl) for all possible(119889 cl) combinations When 119878(119889 cl) ⩾ 119879 every combinationis considered to be a missing label and used to modify thetraining set
33 Support Vector Machine We will use the Linear SupportVector Machine as a classification algorithm in this caseSince SVM is mainly a binary classifier the Binary Relevanceapproach is therefore chosen to address multilabel problemsThis method implies training a separate decision rule 119908119897119909 +119887119897 gt 0 for every category 119897 isin 119871 More details are availablein our previous work on methods for structuring scientificknowledge [18] In our study as the implementation of SVMwe used Weka binding of the LIBLINEAR library [19]
34 Random Forest Random Forest is an ensemble ofmachine learning methods which combines tree predictorsIn this combination each tree depends on the values ofa random vector sampled independently All trees in theforest have the same distribution More details about thismethod can be found in [20 21] In our study we used theimplementation of Random Forest fromWeka [22]
4 Experimental Results
In this section we describe how did we perform text classifi-cation experiments We applied the classification algorithmsto initial (unmodified) training sets as well as to the trainingsets modified with ldquoaddrdquo or ldquodelrdquo methods
We shall first discuss the scheme of training set transfor-mation and its usefulness Then we shall present the processof data generation Finally we shall consider the performancemeasures used in the experiments experimental setting andthe results of the parameter estimation and final validation
41 Training Set Modification Training set modification stepis described in detail in Section 3 (methods) However it isimportant to notice that in both cases documents that donot belong to set PC according to the relevance algorithm 120595
(too far from documents labeled to the given category) arestill retained in the training set The reason for this choice is
that we assume that experts are not supposed to make anymistake of type II (when they give a document an odd label)only the absence of proper label is supposed to encounter
The omission of the relevance pair (119889 119860) in the trainingset makes document 119889move into the set of negative examplesfor learning a classifier for the class 119860 This problem altersthe decision rule and negatively affects performance Theproposed set modification scheme is designed to avoid suchproblems during the training session of the classifier
42 Datasets and Data Preprocessing
421 AgingPortfolio Dataset The first experiment was car-ried out using data from the AgingPortfolio informationresource [18 23] The AgingPortfolio system includes adatabase of projects related to aging and is funded bythe National Institutes of Health (NIH) and the EuropeanCommission (ECCORDIS)This database currently containsmore than one million projects Each of its records writtenin English displays information related to the authorrsquos nametitle a brief description of the motivation and researchobjectives the name of the organization and the fundingperiod of the project Some projects contain additionalkeywords with an average description in 100 words In thisexperiment we used only the title a brief description andtag fields
A taxonomy contains 335 categories with 6 hierarchicallevels used for document classification A detailed informa-tion about the taxonomy is available on the Internationalaging research portfolio Web site [23] Biomedical expertsmanually put the category labels on the document trainingand test sets In the process they used a special procedurefor labeling the document test set Two sets of categoriescarefully selected by different experts were assigned to eachdocument of the test set Then a combination of these cate-gories was used to achieve amore complete category labelingDifferent participants like the AgingPortfolio resource userscreated the training set with little control A visual inspectionsuggests that the training set contained a significant numberof projects with incomplete sets of category labels The sameconclusion is also achieved by comparing the average numberof categories per project This average is 44 in the sample setcompared to 979 in the more thoroughly designed test setThe total number of projects was 3 246 for the training set183 for the development set and 1 000 for the test sets
Throughout our study we used the vector model of textrepresentation The list of keywords and their combinationsfrom the TerMine [24] system (National Centre for TextMining or NaCTeM) provided the terms used in our studyThe method used in this system combines linguistic andstatistical information of candidate terms
Later we conducted the analysis and processing of theset of keyword combinations Whenever the short keywordcombinations were present in longer ones the latter weresplit into shorter ones The algorithms and the code usedfor the keyword combinations decomposition are availablefrom theAgingPortfolioWeb site [8] According to the resultsof our previous experiments the new vectorization method
Computational and Mathematical Methods in Medicine 5
provided a 3 increase by the 1198651-measure compared to thegeneral ldquobag-of-wordsrdquo model We assigned feature weightsaccording to the TFIDF rule in the BM25 formulation [25 26]and then normalized vectors representing the documents inthe Euclidean metric of 119899-dimensional space
422 Yeast Dataset Yeast dataset [27] is a biomedical datasetof Yeast genes divided into 14 different functional classesEach instance in the dataset is a gene represented by a vectorwhose features are the microarray expression levels undervarious conditions We used it to reveal if our methods aresuitable for the classification of genetic information as well asfor textual data
Let us describe the methods for an incomplete datasetmodeling Since the dataset is well annotated andwidely usedthe objects (genes) have complete sets of category labels Byrandom deletion of labels from documents wemade amodelof the incomplete sets of labels in the training set Parameter119901 was introduced as the fraction of deleted labels
We deleted labels using the following conditions
(1) for each class 119901 represents the fraction of deletedlabelsWe keep the distribution of labels by categoriesafter modeling the incomplete sets of labels
(2) the number of objects in the training set remains thesame At least one label is preserved after the labeldeletion process
No preprocessing step was necessary because the data issupplied already prepared as a matrix of numbers [27]
43 Performance Measurements The following characteris-tics were used to evaluate and compare different classificationalgorithms
(i) Microaveraged precision recall and1198651-measure [27](ii) CROC curves and their AUC values computed for
selected categories [28] CROC curve is a modifi-cation of a ROC curve where 119909 axis is rescaledas 119909new(119909) We used a standard exponent scaling119909new(119909) = (1 minus 119890
minus120572119909)(1 minus 119890
minus120572) with 120572 = 7
44 Experimental Setup Theprocedures for selecting impor-tant parameters of the algorithms outlined are described next
441 Parameters for SVM
AgingPortfolio Dataset The following SVM parameters weretuned for each decision rule
(i) cost parameter 119862 controls a trade-off between maxi-mization of the separation margin and minimizationof the total error [15]
(ii) parameter 119887119897 that plays the role of a classification thre-shold in the decision rule
We performed a parameter tuning by using a slidingcontrol method with 5-fold cross-validation according to
the following strategyThe 119862-parameter was varied on a gridfollowed by 119887119897-parameter (for every value of 119862) tuning forevery category A set PC = (119862 119887119897)119897isin119871 of parameter pairswas considered optimal if it maximized the 1198651-measure withaveraging over documentsWhile119862 has the same value for allcategories the 119887119897 threshold parameter was tuned (for a givenvalue of 119862) for each class label 119897 isin 119871
Yeast Dataset In experiments with the Yeast dataset theselection of SVM parameters was not performed (ie 119862 = 1119887119897 = 0 for all values of 119897 isin 119871)
442 Parameters for RF The 30 solution trees were used tobuild the Random Forest The number of inputs to considerwhile splitting a tree node is the square root of featuresrsquonumber The procedure was done according to the [29] LeoBreiman who developed the Random Forest Algorithm
443 AgingPortfolio Dataset Parameters 119896 and119879were tunedon a grid as follows We prepared a validation set 119863devof 183 documents as in the case of the test set The per-formance metrics for SVM classifiers trained on modifieddocument sets were then evaluated on 119863dev A combinationof parameters was considered optimal if it maximized the 1198651-measure Parameter 120583 of the SoftSL algorithm was tuned ona grid keeping 119896 fixed at its optimal value A fixed categoryassignment threshold of 119879 = 0005 is used for the SoftSLtraining set modification algorithm We used ] = 0 sinceall documents in the experiments contained some categorylabels and regularization was unnecessary
444 Yeast Dataset Themethod for selecting the parametersfor the Yeast dataset is the same as in the AgingPortfolio119863devwas composed of 300 (20 of 1 500) randomly selected genesfor training The SoftSL training set modification algorithmwas not used for this dataset
45 A Comparison of Methods
451 AgingPortfolio We evaluated the general performancebased on a total of 62 categories that contained at least 30documents in the training set and at least 7 documents inthe test set The results for precision recall and 1198651-measureare presented in Table 1 It is evident that a parameter tuningsignificantly boosts both precision and recall
Also all of our training set modification methods paylower precision for higher recall values If we consider 1198651measure as a general quality function such trade-offmay lookquiet reasonable especially for add+WkNNmethod
The average numbers of training set categories per doc-ument and documents per category are listed in Table 2 Aswe can see the SoftSL approach alters the training set moresignificantly As a result a larger number of relevancy tagsare added This is consistent with higher recall and lowerprecision values of add+SoftSL and del+SoftSL as comparedto WkNN-based methods in Table 1
Figure 2 compares CROC curves for representative cate-gories of AgingPortfolio dataset computed for SVM without
6 Computational and Mathematical Methods in Medicine
Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods
Method Precision Recall 1198651
SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576
Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification
Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887
training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification
AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL
CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL
Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods
Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well
452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917
The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly
Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677
Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)
Method Precision Recall 1198651
SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866
Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817
Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044
Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications
Method Precision Recall 1198651
RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194
Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06
is used
Computational and Mathematical Methods in Medicine 7
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03919
SVM add+WkNN AUC = 04323
SVM delRandom selection
+WkNN AUC = 04181
(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 06830
SVM add+WkNN AUC = 07484
SVM delRandom selection
+WkNN AUC = 07559
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
0806040200
Transformed false positive rate
SVM only AUC = 04843
SVM add+WkNN AUC = 05340
SVM delRandom selection
+WkNN AUC = 05238
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 04892
SVM add+WkNN AUC = 05603
SVM delRandom selection
+WkNN AUC = 05460
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)
Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)
The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)
A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly
improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)
5 Discussion
Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our
8 Computational and Mathematical Methods in Medicine
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03904
SVM add+SoftSL AUC = 04747
SVM delRandom selection
+SoftSL AUC = 04498
(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 06818
SVM add+SoftSL AUC = 06374
SVM delRandom selection
+SoftSL AUC = 07018
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04847
SVM add+SoftSL AUC = 05447
SVM delRandom selection
+SoftSL AUC = 05423
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04900
SVM add+SoftSL AUC = 05757
SVM delRandom selection
+SoftSL AUC = 05587
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)
Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)
training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all
To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL
Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels
We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than
Computational and Mathematical Methods in Medicine 9
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC =
=
04752
SVM add+WkNN AUC 04311
(a) ldquoExperimental techniques in vivo methodsrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 06380
= 07375SVM add+WkNN AUC
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category
=
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05454
SVM add+WkNN AUC 05348
(c) ldquoAging mechanisms by anatomy cell levelrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05757
= 05541SVM add+WkNN AUC
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory
Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio
SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement
Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant
documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold
One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result
10 Computational and Mathematical Methods in Medicine
Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0
Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716
Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224
Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707
should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones
Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets
Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set
Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets
Conflict of Interests
The authors declare that there is no conflict of interests
Acknowledgments
The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch
References
[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012
[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013
[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013
[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010
[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010
[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002
Computational and Mathematical Methods in Medicine 11
[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage
[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009
[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007
[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008
[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010
[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008
[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009
[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998
[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967
[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009
[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011
[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008
[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001
[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009
[23] International aging research portfolio httpagingportfolioorg
[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000
[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994
[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005
[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004
[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010
[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm
[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 3
We developed alternative approaches because the algo-rithms for training set modification were not designed towork with these assumptions Then we used the followingtwo detailed algorithms in our experiments
(1) themethod based on a recent soft supervised learningapproach [13] (was labeled as ldquoSoftSLrdquo)
(2) weighted k-nearest neighbour classifier algorithm(labeled as ldquoWkNNrdquo [14ndash16])
These algorithms use the nearest neighbour set of a doc-ument which is in line with our third assumption
The first step in the modification of the training setinvolves the generation of a set PC of document-categoryrelevancy pairs overlooked by the experts
PC = (119889 cl) | 120595 (119889 cl) = 1 (1)
where 119889 is a document cl is a class label (category) and 120595
is the function of our training set modification algorithm(WkNN or SoftSL)
Consider
120595 (119889 cl)
= 1 if our algorithm places document 119889 into class cl0 otherwise
(2)
Then two possible outcomes are considered
(1) complete inclusion of PC into the training set (optionwas denoted as ldquoaddrdquo)
(2) exclusion of document 119889 from the negative examplesof the category cl for all relevancy pairs (119889 cl) isin PC(option was denoted as ldquodelrdquo)
The modified training set will still contain the objectswhich according to the algorithm 120595 do not belong to the setlabeled by the expert It is possible to find the next missinglabels in the documents of the training set
31 SoftSL Algorithm for Finding Missing Labels In thissection we outline the application of a new graph algorithmfor Soft-supervised learning also called SoftSL [13] Eachdocument is represented by a vertex within a weighted undi-rected graph and our proposed framework minimizes theweighted Kullback-Leibler divergence between distributionsthat encode the classmembership probabilities of each vertex
The advantages of this graph algorithm include directapplicability to the multilabel categorization problem as wellas improved performance compare to alternatives [14] Themain idea of the SoftSL algorithm is the following
Let119863 = 119863119897 119863119906 be a set of labeled and unlabeled objectswith
119863119897 = (119909119894 119910119894)119897
119894=1 119863119906 = (119909119894)
119899
119894=119897+1 (3)
where 119909119894 is the input vector representing the objects to becategorized and 119910119894 is the category label
Let119866 = (119881 119864) be a weighted undirected graph Here119881 =
1 119899 where 119899 is the cardinality of 119863 and 119864 = 119881 times 119881 and120596119894119895 is the weight of the edge linking objects 119894 and 119895
The weight of the edge is defined as
119908119894119895 = sim (119909119894 119909119895) if 119895 isin 119870 (119894)
0 if 119895 notin 119870 (119894) (4)
Here sim(119909119894 119909119895) is the measure of similarity between 119894thand 119895th objects (eg cosine measure) and119870(119894) is the set of knearest neighbours of object 119909119894
Each object is associated with a set of probabilities119901119894 = (119901
119905
119894)119898
119905=1of belonging to each of the 119898 classes 119871 =
cl119894119898
119894=1 According to information from 119863119897 we determined
the probabilities 119903119894 = (119903119905
119894)119898
119905=1119897
119894=1that documents 119909119894
119897
119894=1
belong for each of m classes thus 119903119905119894gt 0 if (119889119894 cl119905) isin 119863119897 Each
labeled object also has a known set of probabilities 119903119894 assignedby the expertsOur intention is tominimize themisalignmentfunction 1198621(119901) over sets of probabilities
1198621 (119901) =
119897
sum
119894=1
119863KL (119903119894 119901119894)
+ 120583
119899
sum
119894=1
sum
119895isin119870(119894)
120596119894119895119863KL (119901119894 119901119895) minus ]119899
sum
119894=1
119867(119901119894)
119863KL (119901119894 119901119895)def=
119898
sum
119905=1
119901119905
119894log119901119905119895
119867 (119901119894)def=
119898
sum
119905=1
119901119905
119894log119901119905119894
(5)
where 119863KL(119901119894 119901119895) means Kullback-Leibler distance and119867(119901119894)means entropy
120583 and ] are the parameters of the algorithm definingcontribution of each term into 1198621(119901) The meanings of allterms are listed below
(1) The first term in the expression of1198621 shows how closethe generated probabilities are to the ones assigned bythe experts
(2) The second term accounts for the graph geometry andguarantees that the objects close to one another on thegraph will have similar probability distributions overclasses
(3) The third term is included in case other terms in theexpression are not contradictory Its purpose is to pro-duce a regular and uniform probability distributionover classes
Numerically the problem is solved using AlternatingMinimization (AM) [13] Note that119863119906 is absent in the case ofunlabeled dataTheminimization of the objectivemin1199011198621(119901)leads to the set of probabilities 119901119894 = 119901
1
119894 119901
119898
119894 for each
document 119889119894 isin 119863 We introduce a threshold 119879 isin [0 1]
to assign additional categories relevant to each document if119901119895
119894⩾ 119879 then 119889119894 isin cl119895
4 Computational and Mathematical Methods in Medicine
32 Weighted kNN Algorithm for Finding Missing Labels Inthis section we shall briefly describe the weighted k-nearestneighbour algorithm [17] that is capable of directly solvingthe multilabel categorization problem
Let 120588(119889 1198891015840) be a distance function between the docu-ments 119889 and 1198891015840 The function which assigns document 119889 toclass label cl isin 119871 is then defined as
119878 (119889 cl) =sum1198891015840isinkNN(119889) 120588 (119889 119889
1015840) 119868 (1198891015840 cl)
sum1198891015840isinkNN(119889) 120588 (119889 119889
1015840)
119868 (1198891015840 cl) =
1 if 1198891015840 isin cl0 otherwise
(6)
Here kNN(119889) is the set of k nearest neighbours ofdocument 119889 in the training set
We introduce a threshold 119879 such that if 119878(119889 cl) ⩾ 119879 then119889 isin cl Then the algorithm counts 119878(119889 cl) for all possible(119889 cl) combinations When 119878(119889 cl) ⩾ 119879 every combinationis considered to be a missing label and used to modify thetraining set
33 Support Vector Machine We will use the Linear SupportVector Machine as a classification algorithm in this caseSince SVM is mainly a binary classifier the Binary Relevanceapproach is therefore chosen to address multilabel problemsThis method implies training a separate decision rule 119908119897119909 +119887119897 gt 0 for every category 119897 isin 119871 More details are availablein our previous work on methods for structuring scientificknowledge [18] In our study as the implementation of SVMwe used Weka binding of the LIBLINEAR library [19]
34 Random Forest Random Forest is an ensemble ofmachine learning methods which combines tree predictorsIn this combination each tree depends on the values ofa random vector sampled independently All trees in theforest have the same distribution More details about thismethod can be found in [20 21] In our study we used theimplementation of Random Forest fromWeka [22]
4 Experimental Results
In this section we describe how did we perform text classifi-cation experiments We applied the classification algorithmsto initial (unmodified) training sets as well as to the trainingsets modified with ldquoaddrdquo or ldquodelrdquo methods
We shall first discuss the scheme of training set transfor-mation and its usefulness Then we shall present the processof data generation Finally we shall consider the performancemeasures used in the experiments experimental setting andthe results of the parameter estimation and final validation
41 Training Set Modification Training set modification stepis described in detail in Section 3 (methods) However it isimportant to notice that in both cases documents that donot belong to set PC according to the relevance algorithm 120595
(too far from documents labeled to the given category) arestill retained in the training set The reason for this choice is
that we assume that experts are not supposed to make anymistake of type II (when they give a document an odd label)only the absence of proper label is supposed to encounter
The omission of the relevance pair (119889 119860) in the trainingset makes document 119889move into the set of negative examplesfor learning a classifier for the class 119860 This problem altersthe decision rule and negatively affects performance Theproposed set modification scheme is designed to avoid suchproblems during the training session of the classifier
42 Datasets and Data Preprocessing
421 AgingPortfolio Dataset The first experiment was car-ried out using data from the AgingPortfolio informationresource [18 23] The AgingPortfolio system includes adatabase of projects related to aging and is funded bythe National Institutes of Health (NIH) and the EuropeanCommission (ECCORDIS)This database currently containsmore than one million projects Each of its records writtenin English displays information related to the authorrsquos nametitle a brief description of the motivation and researchobjectives the name of the organization and the fundingperiod of the project Some projects contain additionalkeywords with an average description in 100 words In thisexperiment we used only the title a brief description andtag fields
A taxonomy contains 335 categories with 6 hierarchicallevels used for document classification A detailed informa-tion about the taxonomy is available on the Internationalaging research portfolio Web site [23] Biomedical expertsmanually put the category labels on the document trainingand test sets In the process they used a special procedurefor labeling the document test set Two sets of categoriescarefully selected by different experts were assigned to eachdocument of the test set Then a combination of these cate-gories was used to achieve amore complete category labelingDifferent participants like the AgingPortfolio resource userscreated the training set with little control A visual inspectionsuggests that the training set contained a significant numberof projects with incomplete sets of category labels The sameconclusion is also achieved by comparing the average numberof categories per project This average is 44 in the sample setcompared to 979 in the more thoroughly designed test setThe total number of projects was 3 246 for the training set183 for the development set and 1 000 for the test sets
Throughout our study we used the vector model of textrepresentation The list of keywords and their combinationsfrom the TerMine [24] system (National Centre for TextMining or NaCTeM) provided the terms used in our studyThe method used in this system combines linguistic andstatistical information of candidate terms
Later we conducted the analysis and processing of theset of keyword combinations Whenever the short keywordcombinations were present in longer ones the latter weresplit into shorter ones The algorithms and the code usedfor the keyword combinations decomposition are availablefrom theAgingPortfolioWeb site [8] According to the resultsof our previous experiments the new vectorization method
Computational and Mathematical Methods in Medicine 5
provided a 3 increase by the 1198651-measure compared to thegeneral ldquobag-of-wordsrdquo model We assigned feature weightsaccording to the TFIDF rule in the BM25 formulation [25 26]and then normalized vectors representing the documents inthe Euclidean metric of 119899-dimensional space
422 Yeast Dataset Yeast dataset [27] is a biomedical datasetof Yeast genes divided into 14 different functional classesEach instance in the dataset is a gene represented by a vectorwhose features are the microarray expression levels undervarious conditions We used it to reveal if our methods aresuitable for the classification of genetic information as well asfor textual data
Let us describe the methods for an incomplete datasetmodeling Since the dataset is well annotated andwidely usedthe objects (genes) have complete sets of category labels Byrandom deletion of labels from documents wemade amodelof the incomplete sets of labels in the training set Parameter119901 was introduced as the fraction of deleted labels
We deleted labels using the following conditions
(1) for each class 119901 represents the fraction of deletedlabelsWe keep the distribution of labels by categoriesafter modeling the incomplete sets of labels
(2) the number of objects in the training set remains thesame At least one label is preserved after the labeldeletion process
No preprocessing step was necessary because the data issupplied already prepared as a matrix of numbers [27]
43 Performance Measurements The following characteris-tics were used to evaluate and compare different classificationalgorithms
(i) Microaveraged precision recall and1198651-measure [27](ii) CROC curves and their AUC values computed for
selected categories [28] CROC curve is a modifi-cation of a ROC curve where 119909 axis is rescaledas 119909new(119909) We used a standard exponent scaling119909new(119909) = (1 minus 119890
minus120572119909)(1 minus 119890
minus120572) with 120572 = 7
44 Experimental Setup Theprocedures for selecting impor-tant parameters of the algorithms outlined are described next
441 Parameters for SVM
AgingPortfolio Dataset The following SVM parameters weretuned for each decision rule
(i) cost parameter 119862 controls a trade-off between maxi-mization of the separation margin and minimizationof the total error [15]
(ii) parameter 119887119897 that plays the role of a classification thre-shold in the decision rule
We performed a parameter tuning by using a slidingcontrol method with 5-fold cross-validation according to
the following strategyThe 119862-parameter was varied on a gridfollowed by 119887119897-parameter (for every value of 119862) tuning forevery category A set PC = (119862 119887119897)119897isin119871 of parameter pairswas considered optimal if it maximized the 1198651-measure withaveraging over documentsWhile119862 has the same value for allcategories the 119887119897 threshold parameter was tuned (for a givenvalue of 119862) for each class label 119897 isin 119871
Yeast Dataset In experiments with the Yeast dataset theselection of SVM parameters was not performed (ie 119862 = 1119887119897 = 0 for all values of 119897 isin 119871)
442 Parameters for RF The 30 solution trees were used tobuild the Random Forest The number of inputs to considerwhile splitting a tree node is the square root of featuresrsquonumber The procedure was done according to the [29] LeoBreiman who developed the Random Forest Algorithm
443 AgingPortfolio Dataset Parameters 119896 and119879were tunedon a grid as follows We prepared a validation set 119863devof 183 documents as in the case of the test set The per-formance metrics for SVM classifiers trained on modifieddocument sets were then evaluated on 119863dev A combinationof parameters was considered optimal if it maximized the 1198651-measure Parameter 120583 of the SoftSL algorithm was tuned ona grid keeping 119896 fixed at its optimal value A fixed categoryassignment threshold of 119879 = 0005 is used for the SoftSLtraining set modification algorithm We used ] = 0 sinceall documents in the experiments contained some categorylabels and regularization was unnecessary
444 Yeast Dataset Themethod for selecting the parametersfor the Yeast dataset is the same as in the AgingPortfolio119863devwas composed of 300 (20 of 1 500) randomly selected genesfor training The SoftSL training set modification algorithmwas not used for this dataset
45 A Comparison of Methods
451 AgingPortfolio We evaluated the general performancebased on a total of 62 categories that contained at least 30documents in the training set and at least 7 documents inthe test set The results for precision recall and 1198651-measureare presented in Table 1 It is evident that a parameter tuningsignificantly boosts both precision and recall
Also all of our training set modification methods paylower precision for higher recall values If we consider 1198651measure as a general quality function such trade-offmay lookquiet reasonable especially for add+WkNNmethod
The average numbers of training set categories per doc-ument and documents per category are listed in Table 2 Aswe can see the SoftSL approach alters the training set moresignificantly As a result a larger number of relevancy tagsare added This is consistent with higher recall and lowerprecision values of add+SoftSL and del+SoftSL as comparedto WkNN-based methods in Table 1
Figure 2 compares CROC curves for representative cate-gories of AgingPortfolio dataset computed for SVM without
6 Computational and Mathematical Methods in Medicine
Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods
Method Precision Recall 1198651
SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576
Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification
Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887
training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification
AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL
CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL
Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods
Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well
452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917
The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly
Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677
Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)
Method Precision Recall 1198651
SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866
Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817
Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044
Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications
Method Precision Recall 1198651
RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194
Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06
is used
Computational and Mathematical Methods in Medicine 7
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03919
SVM add+WkNN AUC = 04323
SVM delRandom selection
+WkNN AUC = 04181
(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 06830
SVM add+WkNN AUC = 07484
SVM delRandom selection
+WkNN AUC = 07559
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
0806040200
Transformed false positive rate
SVM only AUC = 04843
SVM add+WkNN AUC = 05340
SVM delRandom selection
+WkNN AUC = 05238
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 04892
SVM add+WkNN AUC = 05603
SVM delRandom selection
+WkNN AUC = 05460
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)
Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)
The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)
A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly
improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)
5 Discussion
Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our
8 Computational and Mathematical Methods in Medicine
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03904
SVM add+SoftSL AUC = 04747
SVM delRandom selection
+SoftSL AUC = 04498
(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 06818
SVM add+SoftSL AUC = 06374
SVM delRandom selection
+SoftSL AUC = 07018
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04847
SVM add+SoftSL AUC = 05447
SVM delRandom selection
+SoftSL AUC = 05423
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04900
SVM add+SoftSL AUC = 05757
SVM delRandom selection
+SoftSL AUC = 05587
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)
Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)
training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all
To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL
Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels
We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than
Computational and Mathematical Methods in Medicine 9
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC =
=
04752
SVM add+WkNN AUC 04311
(a) ldquoExperimental techniques in vivo methodsrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 06380
= 07375SVM add+WkNN AUC
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category
=
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05454
SVM add+WkNN AUC 05348
(c) ldquoAging mechanisms by anatomy cell levelrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05757
= 05541SVM add+WkNN AUC
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory
Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio
SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement
Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant
documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold
One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result
10 Computational and Mathematical Methods in Medicine
Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0
Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716
Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224
Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707
should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones
Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets
Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set
Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets
Conflict of Interests
The authors declare that there is no conflict of interests
Acknowledgments
The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch
References
[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012
[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013
[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013
[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010
[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010
[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002
Computational and Mathematical Methods in Medicine 11
[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage
[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009
[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007
[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008
[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010
[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008
[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009
[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998
[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967
[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009
[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011
[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008
[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001
[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009
[23] International aging research portfolio httpagingportfolioorg
[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000
[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994
[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005
[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004
[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010
[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm
[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
4 Computational and Mathematical Methods in Medicine
32 Weighted kNN Algorithm for Finding Missing Labels Inthis section we shall briefly describe the weighted k-nearestneighbour algorithm [17] that is capable of directly solvingthe multilabel categorization problem
Let 120588(119889 1198891015840) be a distance function between the docu-ments 119889 and 1198891015840 The function which assigns document 119889 toclass label cl isin 119871 is then defined as
119878 (119889 cl) =sum1198891015840isinkNN(119889) 120588 (119889 119889
1015840) 119868 (1198891015840 cl)
sum1198891015840isinkNN(119889) 120588 (119889 119889
1015840)
119868 (1198891015840 cl) =
1 if 1198891015840 isin cl0 otherwise
(6)
Here kNN(119889) is the set of k nearest neighbours ofdocument 119889 in the training set
We introduce a threshold 119879 such that if 119878(119889 cl) ⩾ 119879 then119889 isin cl Then the algorithm counts 119878(119889 cl) for all possible(119889 cl) combinations When 119878(119889 cl) ⩾ 119879 every combinationis considered to be a missing label and used to modify thetraining set
33 Support Vector Machine We will use the Linear SupportVector Machine as a classification algorithm in this caseSince SVM is mainly a binary classifier the Binary Relevanceapproach is therefore chosen to address multilabel problemsThis method implies training a separate decision rule 119908119897119909 +119887119897 gt 0 for every category 119897 isin 119871 More details are availablein our previous work on methods for structuring scientificknowledge [18] In our study as the implementation of SVMwe used Weka binding of the LIBLINEAR library [19]
34 Random Forest Random Forest is an ensemble ofmachine learning methods which combines tree predictorsIn this combination each tree depends on the values ofa random vector sampled independently All trees in theforest have the same distribution More details about thismethod can be found in [20 21] In our study we used theimplementation of Random Forest fromWeka [22]
4 Experimental Results
In this section we describe how did we perform text classifi-cation experiments We applied the classification algorithmsto initial (unmodified) training sets as well as to the trainingsets modified with ldquoaddrdquo or ldquodelrdquo methods
We shall first discuss the scheme of training set transfor-mation and its usefulness Then we shall present the processof data generation Finally we shall consider the performancemeasures used in the experiments experimental setting andthe results of the parameter estimation and final validation
41 Training Set Modification Training set modification stepis described in detail in Section 3 (methods) However it isimportant to notice that in both cases documents that donot belong to set PC according to the relevance algorithm 120595
(too far from documents labeled to the given category) arestill retained in the training set The reason for this choice is
that we assume that experts are not supposed to make anymistake of type II (when they give a document an odd label)only the absence of proper label is supposed to encounter
The omission of the relevance pair (119889 119860) in the trainingset makes document 119889move into the set of negative examplesfor learning a classifier for the class 119860 This problem altersthe decision rule and negatively affects performance Theproposed set modification scheme is designed to avoid suchproblems during the training session of the classifier
42 Datasets and Data Preprocessing
421 AgingPortfolio Dataset The first experiment was car-ried out using data from the AgingPortfolio informationresource [18 23] The AgingPortfolio system includes adatabase of projects related to aging and is funded bythe National Institutes of Health (NIH) and the EuropeanCommission (ECCORDIS)This database currently containsmore than one million projects Each of its records writtenin English displays information related to the authorrsquos nametitle a brief description of the motivation and researchobjectives the name of the organization and the fundingperiod of the project Some projects contain additionalkeywords with an average description in 100 words In thisexperiment we used only the title a brief description andtag fields
A taxonomy contains 335 categories with 6 hierarchicallevels used for document classification A detailed informa-tion about the taxonomy is available on the Internationalaging research portfolio Web site [23] Biomedical expertsmanually put the category labels on the document trainingand test sets In the process they used a special procedurefor labeling the document test set Two sets of categoriescarefully selected by different experts were assigned to eachdocument of the test set Then a combination of these cate-gories was used to achieve amore complete category labelingDifferent participants like the AgingPortfolio resource userscreated the training set with little control A visual inspectionsuggests that the training set contained a significant numberof projects with incomplete sets of category labels The sameconclusion is also achieved by comparing the average numberof categories per project This average is 44 in the sample setcompared to 979 in the more thoroughly designed test setThe total number of projects was 3 246 for the training set183 for the development set and 1 000 for the test sets
Throughout our study we used the vector model of textrepresentation The list of keywords and their combinationsfrom the TerMine [24] system (National Centre for TextMining or NaCTeM) provided the terms used in our studyThe method used in this system combines linguistic andstatistical information of candidate terms
Later we conducted the analysis and processing of theset of keyword combinations Whenever the short keywordcombinations were present in longer ones the latter weresplit into shorter ones The algorithms and the code usedfor the keyword combinations decomposition are availablefrom theAgingPortfolioWeb site [8] According to the resultsof our previous experiments the new vectorization method
Computational and Mathematical Methods in Medicine 5
provided a 3 increase by the 1198651-measure compared to thegeneral ldquobag-of-wordsrdquo model We assigned feature weightsaccording to the TFIDF rule in the BM25 formulation [25 26]and then normalized vectors representing the documents inthe Euclidean metric of 119899-dimensional space
422 Yeast Dataset Yeast dataset [27] is a biomedical datasetof Yeast genes divided into 14 different functional classesEach instance in the dataset is a gene represented by a vectorwhose features are the microarray expression levels undervarious conditions We used it to reveal if our methods aresuitable for the classification of genetic information as well asfor textual data
Let us describe the methods for an incomplete datasetmodeling Since the dataset is well annotated andwidely usedthe objects (genes) have complete sets of category labels Byrandom deletion of labels from documents wemade amodelof the incomplete sets of labels in the training set Parameter119901 was introduced as the fraction of deleted labels
We deleted labels using the following conditions
(1) for each class 119901 represents the fraction of deletedlabelsWe keep the distribution of labels by categoriesafter modeling the incomplete sets of labels
(2) the number of objects in the training set remains thesame At least one label is preserved after the labeldeletion process
No preprocessing step was necessary because the data issupplied already prepared as a matrix of numbers [27]
43 Performance Measurements The following characteris-tics were used to evaluate and compare different classificationalgorithms
(i) Microaveraged precision recall and1198651-measure [27](ii) CROC curves and their AUC values computed for
selected categories [28] CROC curve is a modifi-cation of a ROC curve where 119909 axis is rescaledas 119909new(119909) We used a standard exponent scaling119909new(119909) = (1 minus 119890
minus120572119909)(1 minus 119890
minus120572) with 120572 = 7
44 Experimental Setup Theprocedures for selecting impor-tant parameters of the algorithms outlined are described next
441 Parameters for SVM
AgingPortfolio Dataset The following SVM parameters weretuned for each decision rule
(i) cost parameter 119862 controls a trade-off between maxi-mization of the separation margin and minimizationof the total error [15]
(ii) parameter 119887119897 that plays the role of a classification thre-shold in the decision rule
We performed a parameter tuning by using a slidingcontrol method with 5-fold cross-validation according to
the following strategyThe 119862-parameter was varied on a gridfollowed by 119887119897-parameter (for every value of 119862) tuning forevery category A set PC = (119862 119887119897)119897isin119871 of parameter pairswas considered optimal if it maximized the 1198651-measure withaveraging over documentsWhile119862 has the same value for allcategories the 119887119897 threshold parameter was tuned (for a givenvalue of 119862) for each class label 119897 isin 119871
Yeast Dataset In experiments with the Yeast dataset theselection of SVM parameters was not performed (ie 119862 = 1119887119897 = 0 for all values of 119897 isin 119871)
442 Parameters for RF The 30 solution trees were used tobuild the Random Forest The number of inputs to considerwhile splitting a tree node is the square root of featuresrsquonumber The procedure was done according to the [29] LeoBreiman who developed the Random Forest Algorithm
443 AgingPortfolio Dataset Parameters 119896 and119879were tunedon a grid as follows We prepared a validation set 119863devof 183 documents as in the case of the test set The per-formance metrics for SVM classifiers trained on modifieddocument sets were then evaluated on 119863dev A combinationof parameters was considered optimal if it maximized the 1198651-measure Parameter 120583 of the SoftSL algorithm was tuned ona grid keeping 119896 fixed at its optimal value A fixed categoryassignment threshold of 119879 = 0005 is used for the SoftSLtraining set modification algorithm We used ] = 0 sinceall documents in the experiments contained some categorylabels and regularization was unnecessary
444 Yeast Dataset Themethod for selecting the parametersfor the Yeast dataset is the same as in the AgingPortfolio119863devwas composed of 300 (20 of 1 500) randomly selected genesfor training The SoftSL training set modification algorithmwas not used for this dataset
45 A Comparison of Methods
451 AgingPortfolio We evaluated the general performancebased on a total of 62 categories that contained at least 30documents in the training set and at least 7 documents inthe test set The results for precision recall and 1198651-measureare presented in Table 1 It is evident that a parameter tuningsignificantly boosts both precision and recall
Also all of our training set modification methods paylower precision for higher recall values If we consider 1198651measure as a general quality function such trade-offmay lookquiet reasonable especially for add+WkNNmethod
The average numbers of training set categories per doc-ument and documents per category are listed in Table 2 Aswe can see the SoftSL approach alters the training set moresignificantly As a result a larger number of relevancy tagsare added This is consistent with higher recall and lowerprecision values of add+SoftSL and del+SoftSL as comparedto WkNN-based methods in Table 1
Figure 2 compares CROC curves for representative cate-gories of AgingPortfolio dataset computed for SVM without
6 Computational and Mathematical Methods in Medicine
Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods
Method Precision Recall 1198651
SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576
Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification
Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887
training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification
AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL
CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL
Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods
Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well
452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917
The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly
Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677
Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)
Method Precision Recall 1198651
SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866
Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817
Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044
Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications
Method Precision Recall 1198651
RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194
Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06
is used
Computational and Mathematical Methods in Medicine 7
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03919
SVM add+WkNN AUC = 04323
SVM delRandom selection
+WkNN AUC = 04181
(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 06830
SVM add+WkNN AUC = 07484
SVM delRandom selection
+WkNN AUC = 07559
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
0806040200
Transformed false positive rate
SVM only AUC = 04843
SVM add+WkNN AUC = 05340
SVM delRandom selection
+WkNN AUC = 05238
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 04892
SVM add+WkNN AUC = 05603
SVM delRandom selection
+WkNN AUC = 05460
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)
Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)
The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)
A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly
improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)
5 Discussion
Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our
8 Computational and Mathematical Methods in Medicine
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03904
SVM add+SoftSL AUC = 04747
SVM delRandom selection
+SoftSL AUC = 04498
(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 06818
SVM add+SoftSL AUC = 06374
SVM delRandom selection
+SoftSL AUC = 07018
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04847
SVM add+SoftSL AUC = 05447
SVM delRandom selection
+SoftSL AUC = 05423
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04900
SVM add+SoftSL AUC = 05757
SVM delRandom selection
+SoftSL AUC = 05587
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)
Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)
training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all
To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL
Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels
We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than
Computational and Mathematical Methods in Medicine 9
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC =
=
04752
SVM add+WkNN AUC 04311
(a) ldquoExperimental techniques in vivo methodsrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 06380
= 07375SVM add+WkNN AUC
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category
=
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05454
SVM add+WkNN AUC 05348
(c) ldquoAging mechanisms by anatomy cell levelrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05757
= 05541SVM add+WkNN AUC
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory
Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio
SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement
Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant
documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold
One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result
10 Computational and Mathematical Methods in Medicine
Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0
Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716
Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224
Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707
should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones
Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets
Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set
Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets
Conflict of Interests
The authors declare that there is no conflict of interests
Acknowledgments
The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch
References
[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012
[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013
[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013
[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010
[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010
[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002
Computational and Mathematical Methods in Medicine 11
[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage
[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009
[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007
[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008
[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010
[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008
[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009
[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998
[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967
[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009
[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011
[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008
[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001
[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009
[23] International aging research portfolio httpagingportfolioorg
[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000
[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994
[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005
[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004
[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010
[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm
[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 5
provided a 3 increase by the 1198651-measure compared to thegeneral ldquobag-of-wordsrdquo model We assigned feature weightsaccording to the TFIDF rule in the BM25 formulation [25 26]and then normalized vectors representing the documents inthe Euclidean metric of 119899-dimensional space
422 Yeast Dataset Yeast dataset [27] is a biomedical datasetof Yeast genes divided into 14 different functional classesEach instance in the dataset is a gene represented by a vectorwhose features are the microarray expression levels undervarious conditions We used it to reveal if our methods aresuitable for the classification of genetic information as well asfor textual data
Let us describe the methods for an incomplete datasetmodeling Since the dataset is well annotated andwidely usedthe objects (genes) have complete sets of category labels Byrandom deletion of labels from documents wemade amodelof the incomplete sets of labels in the training set Parameter119901 was introduced as the fraction of deleted labels
We deleted labels using the following conditions
(1) for each class 119901 represents the fraction of deletedlabelsWe keep the distribution of labels by categoriesafter modeling the incomplete sets of labels
(2) the number of objects in the training set remains thesame At least one label is preserved after the labeldeletion process
No preprocessing step was necessary because the data issupplied already prepared as a matrix of numbers [27]
43 Performance Measurements The following characteris-tics were used to evaluate and compare different classificationalgorithms
(i) Microaveraged precision recall and1198651-measure [27](ii) CROC curves and their AUC values computed for
selected categories [28] CROC curve is a modifi-cation of a ROC curve where 119909 axis is rescaledas 119909new(119909) We used a standard exponent scaling119909new(119909) = (1 minus 119890
minus120572119909)(1 minus 119890
minus120572) with 120572 = 7
44 Experimental Setup Theprocedures for selecting impor-tant parameters of the algorithms outlined are described next
441 Parameters for SVM
AgingPortfolio Dataset The following SVM parameters weretuned for each decision rule
(i) cost parameter 119862 controls a trade-off between maxi-mization of the separation margin and minimizationof the total error [15]
(ii) parameter 119887119897 that plays the role of a classification thre-shold in the decision rule
We performed a parameter tuning by using a slidingcontrol method with 5-fold cross-validation according to
the following strategyThe 119862-parameter was varied on a gridfollowed by 119887119897-parameter (for every value of 119862) tuning forevery category A set PC = (119862 119887119897)119897isin119871 of parameter pairswas considered optimal if it maximized the 1198651-measure withaveraging over documentsWhile119862 has the same value for allcategories the 119887119897 threshold parameter was tuned (for a givenvalue of 119862) for each class label 119897 isin 119871
Yeast Dataset In experiments with the Yeast dataset theselection of SVM parameters was not performed (ie 119862 = 1119887119897 = 0 for all values of 119897 isin 119871)
442 Parameters for RF The 30 solution trees were used tobuild the Random Forest The number of inputs to considerwhile splitting a tree node is the square root of featuresrsquonumber The procedure was done according to the [29] LeoBreiman who developed the Random Forest Algorithm
443 AgingPortfolio Dataset Parameters 119896 and119879were tunedon a grid as follows We prepared a validation set 119863devof 183 documents as in the case of the test set The per-formance metrics for SVM classifiers trained on modifieddocument sets were then evaluated on 119863dev A combinationof parameters was considered optimal if it maximized the 1198651-measure Parameter 120583 of the SoftSL algorithm was tuned ona grid keeping 119896 fixed at its optimal value A fixed categoryassignment threshold of 119879 = 0005 is used for the SoftSLtraining set modification algorithm We used ] = 0 sinceall documents in the experiments contained some categorylabels and regularization was unnecessary
444 Yeast Dataset Themethod for selecting the parametersfor the Yeast dataset is the same as in the AgingPortfolio119863devwas composed of 300 (20 of 1 500) randomly selected genesfor training The SoftSL training set modification algorithmwas not used for this dataset
45 A Comparison of Methods
451 AgingPortfolio We evaluated the general performancebased on a total of 62 categories that contained at least 30documents in the training set and at least 7 documents inthe test set The results for precision recall and 1198651-measureare presented in Table 1 It is evident that a parameter tuningsignificantly boosts both precision and recall
Also all of our training set modification methods paylower precision for higher recall values If we consider 1198651measure as a general quality function such trade-offmay lookquiet reasonable especially for add+WkNNmethod
The average numbers of training set categories per doc-ument and documents per category are listed in Table 2 Aswe can see the SoftSL approach alters the training set moresignificantly As a result a larger number of relevancy tagsare added This is consistent with higher recall and lowerprecision values of add+SoftSL and del+SoftSL as comparedto WkNN-based methods in Table 1
Figure 2 compares CROC curves for representative cate-gories of AgingPortfolio dataset computed for SVM without
6 Computational and Mathematical Methods in Medicine
Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods
Method Precision Recall 1198651
SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576
Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification
Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887
training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification
AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL
CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL
Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods
Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well
452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917
The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly
Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677
Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)
Method Precision Recall 1198651
SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866
Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817
Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044
Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications
Method Precision Recall 1198651
RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194
Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06
is used
Computational and Mathematical Methods in Medicine 7
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03919
SVM add+WkNN AUC = 04323
SVM delRandom selection
+WkNN AUC = 04181
(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 06830
SVM add+WkNN AUC = 07484
SVM delRandom selection
+WkNN AUC = 07559
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
0806040200
Transformed false positive rate
SVM only AUC = 04843
SVM add+WkNN AUC = 05340
SVM delRandom selection
+WkNN AUC = 05238
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 04892
SVM add+WkNN AUC = 05603
SVM delRandom selection
+WkNN AUC = 05460
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)
Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)
The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)
A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly
improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)
5 Discussion
Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our
8 Computational and Mathematical Methods in Medicine
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03904
SVM add+SoftSL AUC = 04747
SVM delRandom selection
+SoftSL AUC = 04498
(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 06818
SVM add+SoftSL AUC = 06374
SVM delRandom selection
+SoftSL AUC = 07018
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04847
SVM add+SoftSL AUC = 05447
SVM delRandom selection
+SoftSL AUC = 05423
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04900
SVM add+SoftSL AUC = 05757
SVM delRandom selection
+SoftSL AUC = 05587
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)
Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)
training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all
To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL
Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels
We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than
Computational and Mathematical Methods in Medicine 9
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC =
=
04752
SVM add+WkNN AUC 04311
(a) ldquoExperimental techniques in vivo methodsrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 06380
= 07375SVM add+WkNN AUC
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category
=
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05454
SVM add+WkNN AUC 05348
(c) ldquoAging mechanisms by anatomy cell levelrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05757
= 05541SVM add+WkNN AUC
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory
Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio
SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement
Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant
documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold
One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result
10 Computational and Mathematical Methods in Medicine
Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0
Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716
Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224
Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707
should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones
Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets
Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set
Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets
Conflict of Interests
The authors declare that there is no conflict of interests
Acknowledgments
The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch
References
[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012
[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013
[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013
[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010
[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010
[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002
Computational and Mathematical Methods in Medicine 11
[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage
[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009
[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007
[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008
[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010
[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008
[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009
[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998
[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967
[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009
[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011
[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008
[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001
[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009
[23] International aging research portfolio httpagingportfolioorg
[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000
[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994
[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005
[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004
[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010
[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm
[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
6 Computational and Mathematical Methods in Medicine
Table 1 Microaveraged precision recall and 1198651-measure (1198651)obtained on AgingPortfilio dataset with different classificationmethods
Method Precision Recall 1198651
SVM with fixed parameters 08649 01983 02977SVM with parameter tuning 07727 03302 04159SVM del+WkNN 05538 04452 04439SVM add+WkNN 04664 05684 04707SVM del+SoftSL 02132 06914 03259SVM add+SoftSL 03850 05639 04576
Table 2 Average number of categories per document and docu-ments per category in AgingPortfolio training set before and aftermodification
Modification method Categories per doc Docs in categoryNo modification 44 4509Add+WkNN 1515 15514Add+SoftSL 166 16887
training set modifications SVM with del+WkNN modifi-cation and SVM with add+WkNN modification We cannotice that SVM classification with incorporate training setmodification outperforms simple SVM classification
AUC values calculated for the del+WkNN curves aregenerally only slightly lower and in some cases even exceedthe corresponding values for add+WkNNA similar situationcan be seen in Figure 3 where CROC curves are comparedwith SVM del+SoftSL and add+SoftSL
CROC curves for add+WkNN and add+SoftSL SVMclassifiers are compared in Figure 4 It is difficult to determinea ldquowinner hererdquo In most of the cases the results are prettyequivalent Sometimes add+WkNN looks slightly worse thanadd+SoftSL and sometimes add+WkNN has a good advan-tage against add+SoftSL
Additional data relevant to the algorithm comparison ispresented in Tables 3 4 5 and 6 There are precision recalland 1198651 measure for different categories taken with differentmethods These results are more relief it can be seen thatadd+WkNN outperforms the other methods
Some values of the metrics of the Random Forest Clas-sification experiments are provided in Table 7 The resultsin Table 7 show that the modification of the training setsimproves the classification performance in this case as well
452 Yeast Dataset The Comparison of the ExperimentalResults The dataset is made of 2 417 examples Each objectis related to one or more of the 14 labels (1st FunCat level)with 42 labels per example in average The standard method[30] is used to separate the objects into training and test setsso that the first kind of sets contains 1500 examples and thesecond contains 917
The method for modeling the incomplete dataset and thecomparison is described above in Section 422 We created6 different training sets by deleting a varying fraction (119901) ofdocument-class pairs The concrete document-class pairs fordeletion were selected randomly
Table 3 Microaveraged results for category ldquoexperimental tech-niques in vivo methodsrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 06 012 02With del+WkNN 07879 026 0391With add+WkNN 066 033 044With del+SoftSL 02140 064 03208With add+SoftSL 04653 047 04677
Table 4 Microaveraged results for category ldquocancer and relateddiseases malignant neoplasms including in siturdquo (AgingPortfiliodataset)
Method Precision Recall 1198651
SVM only 04444 04 04211with del+WkNN 01277 09 02236with add+WkNN 024 09 03789with del+SoftSL 00549 10 01040with add+SoftSL 01032 0975 01866
Table 5 Microaveraged results for category ldquoaging mechanisms byanatomy cell levelrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 05167 03827 04397With del+WkNN 05946 02716 03729With add+WkNN 04123 05802 04821With del+SoftSL 05342 04815 05065With add+SoftSL 04182 05679 04817
Table 6 Microaveraged results for category ldquoaging mechanisms byanatomy cell level cellular substructuresrdquo (AgingPortfilio dataset)
Method Precision Recall 1198651
SVM only 052 02167 03059With del+WkNN 10 00167 00328With add+WkNN 04493 05167 04806With del+SoftSL 075 015 025With add+SoftSL 03667 055 044
Table 7 Microaveraged results for AgingPortfolio dataset obtainedwith Random Forest Classification with different training set modi-fications
Method Precision Recall 1198651
RF only 04738 02033 02467With del+WkNN 04852 02507 02870With add+WkNN 03058 04255 03194
Weproceeded some classification experiments before andafter modifying the training set To compare the methods wealso included the classification results obtained by the SVMor RF on the original nonmodified training set with the com-plete set of labels (119901 = 0) Here 119901 isin 01 02 03 04 05 06
is used
Computational and Mathematical Methods in Medicine 7
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03919
SVM add+WkNN AUC = 04323
SVM delRandom selection
+WkNN AUC = 04181
(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 06830
SVM add+WkNN AUC = 07484
SVM delRandom selection
+WkNN AUC = 07559
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
0806040200
Transformed false positive rate
SVM only AUC = 04843
SVM add+WkNN AUC = 05340
SVM delRandom selection
+WkNN AUC = 05238
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 04892
SVM add+WkNN AUC = 05603
SVM delRandom selection
+WkNN AUC = 05460
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)
Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)
The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)
A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly
improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)
5 Discussion
Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our
8 Computational and Mathematical Methods in Medicine
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03904
SVM add+SoftSL AUC = 04747
SVM delRandom selection
+SoftSL AUC = 04498
(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 06818
SVM add+SoftSL AUC = 06374
SVM delRandom selection
+SoftSL AUC = 07018
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04847
SVM add+SoftSL AUC = 05447
SVM delRandom selection
+SoftSL AUC = 05423
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04900
SVM add+SoftSL AUC = 05757
SVM delRandom selection
+SoftSL AUC = 05587
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)
Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)
training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all
To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL
Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels
We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than
Computational and Mathematical Methods in Medicine 9
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC =
=
04752
SVM add+WkNN AUC 04311
(a) ldquoExperimental techniques in vivo methodsrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 06380
= 07375SVM add+WkNN AUC
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category
=
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05454
SVM add+WkNN AUC 05348
(c) ldquoAging mechanisms by anatomy cell levelrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05757
= 05541SVM add+WkNN AUC
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory
Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio
SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement
Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant
documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold
One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result
10 Computational and Mathematical Methods in Medicine
Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0
Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716
Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224
Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707
should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones
Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets
Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set
Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets
Conflict of Interests
The authors declare that there is no conflict of interests
Acknowledgments
The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch
References
[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012
[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013
[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013
[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010
[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010
[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002
Computational and Mathematical Methods in Medicine 11
[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage
[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009
[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007
[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008
[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010
[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008
[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009
[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998
[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967
[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009
[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011
[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008
[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001
[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009
[23] International aging research portfolio httpagingportfolioorg
[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000
[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994
[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005
[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004
[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010
[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm
[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 7
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03919
SVM add+WkNN AUC = 04323
SVM delRandom selection
+WkNN AUC = 04181
(a) ldquoExperimental techniques in vivo methodsrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 06830
SVM add+WkNN AUC = 07484
SVM delRandom selection
+WkNN AUC = 07559
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
0806040200
Transformed false positive rate
SVM only AUC = 04843
SVM add+WkNN AUC = 05340
SVM delRandom selection
+WkNN AUC = 05238
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (WkNN)
10
08
06
04
02
00
Reca
ll
100806040200
Transformed false positive rate
SVM only AUC = 04892
SVM add+WkNN AUC = 05603
SVM delRandom selection
+WkNN AUC = 05460
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cate-gory (WkNN)
Figure 2 CROC curves for different categories of AgingPortfoliio (SVM classification with WkNN analysis)
The results of SVM classification with add+WkNNtraining set modification presented in Table 9 show thatthis modification significantly improves the 1198651 measure incomparison with raw SVM results (Table 8)
A Notable Fact Add+WkNN slightly reduced the precisionon low 119901 but in the worst cases with 119901 = 03 and 119901 = 04 theprecision even rose up However the significant improve ofrecall in all cases is a good trade-off Recall also significantly
improved when the RF algorithmwas used in addition to thismethod (Tables 10 and 11)
5 Discussion
Our experiments have shown that the direct applicationof SVM or RF gives unsatisfactory results for incompletelylabeled datasets (ie when for each document in our
8 Computational and Mathematical Methods in Medicine
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03904
SVM add+SoftSL AUC = 04747
SVM delRandom selection
+SoftSL AUC = 04498
(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 06818
SVM add+SoftSL AUC = 06374
SVM delRandom selection
+SoftSL AUC = 07018
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04847
SVM add+SoftSL AUC = 05447
SVM delRandom selection
+SoftSL AUC = 05423
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04900
SVM add+SoftSL AUC = 05757
SVM delRandom selection
+SoftSL AUC = 05587
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)
Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)
training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all
To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL
Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels
We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than
Computational and Mathematical Methods in Medicine 9
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC =
=
04752
SVM add+WkNN AUC 04311
(a) ldquoExperimental techniques in vivo methodsrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 06380
= 07375SVM add+WkNN AUC
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category
=
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05454
SVM add+WkNN AUC 05348
(c) ldquoAging mechanisms by anatomy cell levelrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05757
= 05541SVM add+WkNN AUC
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory
Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio
SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement
Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant
documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold
One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result
10 Computational and Mathematical Methods in Medicine
Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0
Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716
Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224
Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707
should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones
Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets
Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set
Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets
Conflict of Interests
The authors declare that there is no conflict of interests
Acknowledgments
The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch
References
[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012
[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013
[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013
[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010
[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010
[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002
Computational and Mathematical Methods in Medicine 11
[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage
[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009
[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007
[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008
[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010
[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008
[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009
[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998
[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967
[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009
[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011
[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008
[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001
[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009
[23] International aging research portfolio httpagingportfolioorg
[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000
[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994
[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005
[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004
[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010
[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm
[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
8 Computational and Mathematical Methods in Medicine
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 03904
SVM add+SoftSL AUC = 04747
SVM delRandom selection
+SoftSL AUC = 04498
(a) ldquoExperimental techniques in vivo methodsrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 06818
SVM add+SoftSL AUC = 06374
SVM delRandom selection
+SoftSL AUC = 07018
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04847
SVM add+SoftSL AUC = 05447
SVM delRandom selection
+SoftSL AUC = 05423
(c) ldquoAging mechanisms by anatomy cell levelrdquo category (SoftSL)
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM only AUC = 04900
SVM add+SoftSL AUC = 05757
SVM delRandom selection
+SoftSL AUC = 05587
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquo cat-egory (SoftSL)
Figure 3 CROC curves for different categories of AgingPortfoliio (SVM classification with SoftSL analysis)
training set some correct labels may be omitted) The caseof incompletely labeled dataset strikingly differs from thePU-learning (learning with only positive and unlabeled data)httpciteseerxistpsueduviewdocsummarydoi=1011919914 httpdlacmorgcitationcfmid=1401920 approachin case of PU training some of dataset items are consideredfully labeled and the other items are not labeled at all
To overcome the problem we proposed two differentprocedures for training set modification WkNN and SoftSL
Both of these approaches are intended to restore the missingdocument labels using different similarity measurementsbetween each given document and other documents withsimilar labels
We trained both SVM and RF on several incompletelylabeled datasets with pretraining label restoration and with-out it According to our experimental results the labelrestorationmethods were able to improve the performance ofboth SVM and RF In our opinion WkNN works better than
Computational and Mathematical Methods in Medicine 9
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC =
=
04752
SVM add+WkNN AUC 04311
(a) ldquoExperimental techniques in vivo methodsrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 06380
= 07375SVM add+WkNN AUC
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category
=
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05454
SVM add+WkNN AUC 05348
(c) ldquoAging mechanisms by anatomy cell levelrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05757
= 05541SVM add+WkNN AUC
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory
Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio
SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement
Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant
documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold
One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result
10 Computational and Mathematical Methods in Medicine
Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0
Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716
Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224
Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707
should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones
Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets
Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set
Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets
Conflict of Interests
The authors declare that there is no conflict of interests
Acknowledgments
The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch
References
[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012
[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013
[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013
[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010
[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010
[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002
Computational and Mathematical Methods in Medicine 11
[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage
[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009
[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007
[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008
[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010
[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008
[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009
[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998
[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967
[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009
[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011
[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008
[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001
[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009
[23] International aging research portfolio httpagingportfolioorg
[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000
[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994
[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005
[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004
[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010
[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm
[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 9
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC =
=
04752
SVM add+WkNN AUC 04311
(a) ldquoExperimental techniques in vivo methodsrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 06380
= 07375SVM add+WkNN AUC
(b) ldquoCancer and related diseases malignant neoplasms (including insitu)rdquo category
=
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05454
SVM add+WkNN AUC 05348
(c) ldquoAging mechanisms by anatomy cell levelrdquo category
10
10
08
08
06
06
04
04
02
02
00
00
Reca
ll
Transformed false positive rate
SVM addRandom selection
+SoftSL AUC = 05757
= 05541SVM add+WkNN AUC
(d) ldquoAging mechanism by anatomy cell level cellular substructuresrdquocategory
Figure 4 Comparison of WkNN and SoftSL analysis with SVM classification CROC curves for different categories of AgingPortfolio
SoftSL it has a better 1198651-measure than SoftSL and at last itis simpler to implement
Furthermore the comparison of CROC curves for thedifferent methods demonstrated that the classifiers performslightly worse for some categories and better for others Thispattern appears for classifiers trained on document sets whereelements identified as relevant are removed from the nega-tive examples These observations can be attributed to bettertuning of the classification threshold as additional relevant
documents are added This is a particularly important aspectfor categories containing a small number of documentswhereadditional information about a given category allows betterselection of the classification threshold
One more problem is the evaluation of the incompletelylabeled dataset classification results and performance sincethe labels in the test set are incomplete as well One way toovercome this problem is to perform the additional manualpost factum validation any document classification result
10 Computational and Mathematical Methods in Medicine
Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0
Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716
Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224
Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707
should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones
Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets
Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set
Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets
Conflict of Interests
The authors declare that there is no conflict of interests
Acknowledgments
The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch
References
[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012
[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013
[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013
[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010
[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010
[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002
Computational and Mathematical Methods in Medicine 11
[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage
[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009
[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007
[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008
[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010
[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008
[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009
[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998
[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967
[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009
[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011
[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008
[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001
[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009
[23] International aging research portfolio httpagingportfolioorg
[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000
[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994
[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005
[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004
[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010
[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm
[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
10 Computational and Mathematical Methods in Medicine
Table 8 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 07176 05707 0635801 07337 05233 0610902 07354 04056 0522903 06260 02442 0351304 03544 01191 0178305 0 0 006 0 0 0
Table 9 Microaveraged results for SVM trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 06582 06847 0671202 10 025 06525 06811 0666503 10 015 06357 07137 0672504 10 01 06604 0669 0664805 10 005 06225 07259 0670206 10 005 06248 07261 06716
Table 10 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)
119901 parameter Precision Recall 1198651
0 06340 05087 0531501 06081 04613 0495902 05693 03648 0413303 05788 03240 0387304 05068 02471 0309405 05194 02431 0310406 04354 01621 02224
Table 11 Microaveraged results for RF trained on ldquoincompletelylabeledrdquo Yeast dataset (with different fraction of deleted labels 119901)with add+WkNN label restoration Optimal WkNN parameters 119896and 119879 are acquired via grid search
119901 parameter Optimal 119896 Optimal 119879 Precision Recall 1198651
0 mdash mdash mdash mdash mdash01 10 03 05940 07120 0622402 10 025 05626 07462 0614603 10 015 05282 07992 0609504 10 010 05223 07648 0594005 10 005 04672 08388 0572606 15 005 04547 08695 05707
should be reviewed by the experts in order to reveal if it wasassigned any odd labels Otherwise the observed results areguaranteed to be lower than the real ones
Another way to evaluate the classification results andperformance is to artificially ldquodepleterdquo a completely-labeleddataset We did it with the Yeast dataset Our experimentswith the modification methods applied to artificially par-tially delabeled Yeast biological dataset confirmed that ourapproach significantly improves classification performance ofSVM on incompletely labeled datasets
Moreover the experimental results presented inSection 452 prove a notable aspect When we artificiallymade an incompletely labeled training set and then usedour label restoration techniques on it the 1198651 measure forSVM classification was even greater then for the originalcompletely-labeled set
Hence we are confident that combining the WkNNtraining set modification procedure with the SVM or RFalgorithms will be practically useful to scientists and analystswhen addressing the problem of incompletely labeled train-ing sets
Conflict of Interests
The authors declare that there is no conflict of interests
Acknowledgments
The authors thank Dr Charles R Cantor (Chief ScientificOfficer at Sequenom Inc and Professor at Boston University)for his contribution to this work The authors would like tothank the UMA Foundation for its help in preparation ofthe paper They would like to thank the reviewers for manyconstructive and meaningful comments and suggestions thathelped improve the paper and laid the foundation for furtherresearch
References
[1] S M Hosseini M J Abdi and M Rezghi ldquoA novel weightedsupport vector machine based on particle swarm optimizationfor gene selection and tumor classificationrdquo Computational andMathematicalMethods inMedicine vol 2012 Article ID 3206987 pages 2012
[2] S C Li J Liu and X Luo ldquoIterative reweighted nonintegernorm regularizing svm for gene expression data classificationrdquoComputational and Mathematical Methods in Medicine vol2013 Article ID 768404 10 pages 2013
[3] Z Gao Y Su A Liu T Hao and Z Yang ldquoNonnegative mixed-norm convex optimization for mitotic cell detection in phasecontrast microscopyrdquo Computational and Mathematical Meth-ods in Medicine vol 2013 Article ID 176272 10 pages 2013
[4] I IKatakis G Tsoumakas and I VlahavasMining Multi-LabelData in Data Mining and Knowledge Discovery HandbookSpringer US 2010
[5] G Forman ldquoAn extensive empirical study of feature selectionmetrics for text classificationrdquo Journal of Machine LearningResearch vol 3 pp 1289ndash1305 2003
[6] G Yizhang Methods for Pattern Classification New Advancesin Machine Learning 2010
[7] E Leopold and J Kindermann ldquoText categorization with sup-port vector machines How to represent texts in input spacerdquoMachine Learning vol 46 no 1ndash3 pp 423ndash444 2002
Computational and Mathematical Methods in Medicine 11
[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage
[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009
[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007
[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008
[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010
[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008
[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009
[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998
[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967
[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009
[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011
[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008
[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001
[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009
[23] International aging research portfolio httpagingportfolioorg
[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000
[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994
[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005
[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004
[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010
[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm
[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 11
[8] International Aging Research Portfolio httpwwwagingport-folioorgwikidokuphpid=datapage
[9] A Esuli and F Sebastiani ldquoTraining data cleaning for textclassificationrdquo inProceedings of the 2nd International Conferenceon the Theory of Information Retrieval (ICTIR rsquo09) pp 29ndash41Cambridge UK 2009
[10] J Tang Z Chen A W Fu and D W Cheung ldquoCapabilitiesof outlier detection schemes in large datasets framework andmethodologiesrdquoKnowledge and Information Systems vol 11 no1 pp 45ndash84 2007
[11] N G Zagoruiko I A Borisova V V Dyubanov and O AKutnenko ldquoMethods of recognition based on the function ofrival similarityrdquo Pattern Recognition and Image Analysis vol 18no 1 pp 1ndash6 2008
[12] L H Lee C H Wan T F Yong and H M Kok ldquoA review ofnearest neighbor-support vector machines hybrid classificationmodelsrdquo Journal of Applied Sciences vol 10 no 17 pp 1841ndash18582010
[13] A Subramanya and J Bilmes ldquoSoft-supervised learning for textclassificationrdquo in Proceedings of the Conference on EmpiricalMethods in Natural Language Processing pp 1090ndash1099 Octo-ber 2008
[14] A Subramanya and J Bilmes ldquoEntropic graph regularization innon-parametric semi-supervised classificationrdquo in Proceedingsof the 23rd Annual Conference on Neural Information ProcessingSystems (NIPS rsquo09) pp 1803ndash1811 December 2009
[15] T Joachims ldquoText categorizationwith support vectormachineslearningwithmany relevant featuresrdquo inConference onMachineLearning (ECML rsquo98) pp 137ndash142 Springer Berlin Germany1998
[16] P Hart and T Cover ldquoNearest neighbor pattern classificationrdquoIEEE Transactions on Information Theory vol 13 no 1 pp 21ndash27 1967
[17] P Raghavan C D Manning and H Schuetze An Introductionto Information Retrieval Cambridge University Press Cam-bridge UK 2009
[18] A Zhavoronkov and C R Cantor ldquoMethods for structuringscientific knowledge frommany areas related to aging researchrdquoPLoS ONE vol 6 no 7 Article ID e22597 2011
[19] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR a library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008
[20] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001
[21] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[22] MHall E Frank G Holmes B P fahringer P Reutemann andI HWitten ldquoThe weka data mining software an updaterdquo ACMSIGKDD Explorations Newsletter vol 11 pp 10ndash18 2009
[23] International aging research portfolio httpagingportfolioorg
[24] K Frantzi S Ananiadou andHMima ldquoAutomatic recognitionof multiword termsrdquo International Journal of Digital Librariesvol 3 pp 117ndash132 2000
[25] S E Robertson S Walker S Jones et al ldquoOkapi at trec-3rdquo inProceedings of the 3rd Text Retrieval Conference pp 109ndash126Gaitherburg Md USA 1994
[26] F Sebastiani ldquoText categorizationrdquo Text Mining and Its Applica-tions vol 34 pp 109ndash129 2005
[27] S Godbole and S Sarawagi ldquoDiscriminativemethods formulti-labeled classificationrdquo in Proceedings of the 8th Pacific-AsiaConference (PAKDD rsquo04) pp 22ndash30 May 2004
[28] S J Swamidass C-A Azencott K Daily and P Baldi ldquoACROCstronger than ROC measuring visualizing and optimizingearly retrievalrdquo Bioinformatics vol 26 no 10 Article ID btq140pp 1348ndash1356 2010
[29] Random forests leo breiman and adele cutler 2013 httpwwwstatberkeleyedusimbreimanRandomForestscc manualhtm
[30] A Elisseeff and J Weston Advances in Neural InformationProcessing Systems vol 14 2001
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom