research article a novel sample selection strategy for...

Research ArticleA Novel Sample Selection Strategy for Imbalanced Data ofBiomedical Event Extraction with Joint Scoring Mechanism

Yang Lu12 Xiaolei Ma12 Yinan Lu1 Yuxin Zhou1 and Zhili Pei3

1College of Computer Science and Technology Jilin University Changchun Jilin 130000 China2Library Inner Mongolia University for Nationalities Tongliao Inner Mongolia 028000 China3College of Computer Science and Technology Inner Mongolia University for Nationalities Tongliao Inner Mongolia 028000 China

Correspondence should be addressed to Yinan Lu luynjlueducn

Received 30 June 2016 Revised 4 October 2016 Accepted 31 October 2016

Academic Editor Seiya Imoto

Copyright copy 2016 Yang Lu et al This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Biomedical event extraction is an important and difficult task in bioinformatics With the rapid growth of biomedical literature theextraction of complex events from unstructured text has attracted more attention However the annotated biomedical corpus ishighly imbalanced which affects the performance of the classification algorithms In this study a sample selection algorithm basedon sequential pattern is proposed to filter negative samples in the training phase Considering the joint information between thetrigger and argument of multiargument events we extract triplets of multiargument events directly using a support vector machineclassifier A joint scoring mechanism which is based on sentence similarity and importance of trigger in the training data is usedto correct the predicted results Experimental results indicate that the proposed method can extract events efficiently

1 Introduction

With the rapid growth of the amount of unstructuredor semistructured biomedical literature researchers needconsiderable time and effort to read and obtain relevantscientific knowledge Event extraction from biomedical textis the task of extracting the semantic and role information ofbiological events which are often complex structures suchas the relationship between the disease and the drug [1]the relationship between the disease and the gene [2] theinteraction between drugs [3] and the interaction betweenproteins [4 5] Automatic extraction of biomedical eventscan be applied to many biomedical applications Thereforebiomedical textmining technology is useful for people to findbiological information more accurately and effectively

The official BioNLP challenges have been held for severalyears since 2009 [6ndash8] The BioNLP shared task (BioNLP-ST) [9] aims to extract fine-grained biomolecular eventsIt includes a number of subtasks such as GENIA EventExtraction (GE) Cancer Genetics (CG) Pathway Curation(PC) and Gene Regulation Ontology (GRO) Increasingattention has been given to the task of event extraction

where the major task is GE in BioNLP-ST and it aimsto extract structured events from biomedical text such asevent types triggers and parameters An event is definedby GE using a formula including an event trigger and oneor several arguments Nine types of events were definedin BioNLP-ST GENIA Event Extraction 2011 (GErsquo11) andextended to fourteen types of events in BioNLP-ST GENIAEvent Extraction 2013 (GErsquo13)Due to scarce samples of newlydefined event types for good training the study presented inthis paper is still based on the nine types defined in GErsquo11

Table 1 shows the event types which can be divided intothree categories the simple event class (SVT) Binding eventclass (BIND) and regulation event class (REG) where thereare five simple events including Gene expression Transcrip-tion Protein catabolism Localization and PhosphorylationEach event has only one argument that is to say one themeThemes in the Binding event comprise up to two argumentsTheREG event class includes Regulation Positive regulationandNegative regulationThey are complex because they havetwo arguments a theme and an optional cause Figure 1shows an example of an event where ldquoIRF-4rdquo and ldquoIFN-alphardquo are proteins ldquoexpressionrdquo and ldquoinducedrdquo are triggers

Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2016 Article ID 7536494 11 pageshttpdxdoiorg10115520167536494

2 Computational and Mathematical Methods in Medicine

Theme E1 CauseProtein

Protein

Since IRF-4 ExpressionGene-expression

Theme

IFN-alpha therapy demethylation and successful

in cell lines and CML can be induced by

E2Positive_regulation

Figure 1 Structured representation of biomedical event

Table 1 Class event types and their arguments for the GE task

Event class Event type Primaryargument

Secondaryargument

SVT

Gene expression Theme(P)Transcription Theme(P)

Localization Theme(P) AtLocToLoc

Protein catabolism Theme(P)Phosphorylation Theme(P) Site

BIND Binding Theme(P)+ Site+

REG

Regulation Theme(PEv)Cause(PEv) Site Csite

Positive regulation Theme(PEv)Cause(PEv) Site Csite

Negative regulation Theme(PEv)Cause(PEv) Site Csite

P is protein Ev is event

and two events can be expressed as E1 Gene expressionldquoexpressionrdquo Theme ldquoIRF-4rdquo and E2 Positive regulationldquoinducedrdquoTheme E1 Cause ldquoIFN-alphardquoWe aim to extractthese event structures from the text automatically

Pattern-based methods are used in biomedical relationextraction [10 11] but are less used in biomedical event extrac-tion These methods mainly extract the relations betweenentities by manually defined patterns and automaticallylearned patterns from the training data set Rule-basedmethods [12ndash15] and machine learning-based methods [16ndash18] are the main methods in an event extraction task Rule-based methods are similar to the pattern-based methodswhich manually define syntax rules and learn new rules fromthe training data Machine learning-based methods regardthe extraction task as a classification problem The problemof highly unbalanced training data sets in biomedical eventextraction is seldom addressed by most systems The solu-tions with support vector machines (SVMs) usually use thesimple class weighting strategy [19ndash21] Other approachessuch as active learning [22 23] and semisupervised learning[24 25] solve this problem by increasing the positive samplesize In this study a sample selection method based ona sequential pattern is proposed to solve the problem ofimbalanced data in classification and a joint scoring mech-anism based on sentence semantic similarity and importanceof triggers is introduced to correct further false positivepredictions

The paper is organized as follows related work is pre-sented in Section 2 Our work the sequence pattern-basedsample selection algorithm detection of multiargumentevents and the joint scoring mechanism are presented inSection 3 Section 4 describes the experiment results in GErsquo11and in GErsquo13 test sets Finally a conclusion is presented inSection 5

2 Related Work

Since the organizers of the BioNLP-ST held the first com-petition on the fine-grained information extraction task ofbiomedical events in 2009 a variety of methods have beenproposed to solve the task At present the event extractionsystems are mainly divided into two types rule-based eventextraction systems andmachine learning-based event extrac-tion systems The overview papers of BioNLP-ST 2011 and2013 [7 8] show that the results of machine learning-basedmethods are better than the results of rule-based methods

Rule-based event extraction systems [26ndash29] are basedon sentence structure grammatical relation and semanticrelation which make it more flexible However the resultsobtained by those methods have high precision and lowrecall which are noticeable in simple event extraction Toimprove recall rule-based event extraction systems are forcedto relax constraints in the automatic access of learning rules

The system based on machine learning is generallydivided into three groups The first group is the pipelinemodel [30ndash32] which has an event extraction process thatcan be divided into three steps The first step predicts thetriggerThe second step is the edge detection and assignmentof arguments based on the first step The final step is theevent element detection The pipeline model in the eventextraction task has achieved excellent results such as thechampion of GErsquo09 [30] (Turku) and the champion of GErsquo13[32] (EVEX) Zhou et al [33] proposed a novel method basedon the pipeline model for event trigger identification Theyembed the knowledge learned from a large text corpus intoword features using neural language modeling Experimentalresults show that the 119865-score of event trigger identificationimproves by 25 compared with the approach proposed in[34] Campos et al [35] optimized the feature set and trainingarguments for each event type but only predicted the events inthe GErsquo09 test sets A linear SVM with ldquoone-versus-the-restrdquomulticlass strategy is used to solve multiclass and multilabelclassification problems based on an imbalanced data set ateach stage Although the performance of the pipeline model

Computational and Mathematical Methods in Medicine 3

Sequential patternsset1


Pair extraction(trigger argument)

Unlabeled candidate pair select

Unlabeled candidate triplet select

Classification(SVM1)

Classification(SVM2)Training corpus

Testing corpus

Prediction1

Prediction2

(2) Correct based on score

Preprocessing

Final prediction

Triplet extraction(Trigger argument argument2)

(Multiargument events)

Postprocessing(1) Multiargument events integrate

Figure 2 The framework of the proposed method

is excellent its time complexity is high and each step is carriedout based on the last step which make its performancedependent on the first step of trigger detection Thus if anerror occurs at the first step it will propagate to the next stepthus causing a cascade of errors

The second group is called the joint model [16 17] whichovercomes the problem mentioned previously McClosky etal [36] used the dual-decomposition method for detectingtriggers and arguments and extracted the events using thedependence analysis method Li et al [37] integrated rich fea-tures and word embedding based on the dual-decompositionto extract the biomedical event However the optimal stateof this joint model requires considering the combination ofeach token including the unlikely token in the search spacemaking its calculation too complicated

The third group is called the pairwise model [38 39]which is a combination of the pipeline and joint models thatdirectly extracts trigger and argument instead of detecting thetrigger and edge Considering the relevance of the triggersand arguments the accuracy of the pairwise model is higherthan that of the pipeline model and it is faster than thejoint model in execution time because of the application of asmall amount of inference However the pairwise model stilluses SVM with the ldquoone-versus-the-restrdquo multiclass strategyto solve multiclass and multilabel classification problemswithout dealing with the problem of data imbalance

3 Methods

This section presents the major steps in the proposed systemThe system is based on the pairwise structure in the pair-wise model The event extraction process is summarized inFigure 2 First the sequential patterns are generated from thetraining data after text preprocessing The unlabeled samplepairs in the generation of candidate pair (trigger argument)will be selected based on the sequence pattern Then they

will be trained together with the labeled samples Secondthe triplets are extracted directly for multiargument eventsand then the predicted results between multiargument andsingle argument events will be integrated Finally the jointscoring mechanism is applied in the postprocessing and thepredicted results are optimized

31 Text Preprocessing Text preprocessing is the first stepin natural language processing (NLP) In the preprocessingstage nonstandard symbols will be removed by NLP toolsWeuse nltk (nltkorg) to split thewords and sentences andusethe CharniakndashJohnson parser with McCloskyrsquos biomedicalparsing model (McClosky et al [36]) to analyze the depen-dency path After the sentences and words are split and thefull dependence path is obtained we use the four featuresrsquo setof the TEES [30] system

Token features base stem character 119899-grams (119899 =1 2 3) POS-tag and spelling featuresSentence features the number of candidate entitiesbag-of-wordsSentence dependency features dependency chainsfeatures the shortest dependency path featuresExternal resource features Wordnet hypernyms

32 Sample Selection Based on Sequential Pattern Sequen-tial pattern mining is one of the most important researchsubjects in the field of data mining It aims to find frequentsubsequences or sequential events that satisfy the minimumsupport There are many efficient sequential pattern miningalgorithms that are widely used

Given a sequence database 119878 which is a set of differentsequences let 119878 = 1199041 1199042 119904119899 where each sequence 119904119909 is anordered list of items and 119904119909 = 1199091 1199092 119909119901 where 119909119894 is anitem and 119901 is the number of items The length of 119904119909 sequence


Table 2 Part of frequent sequential patterns

ID Sequence database Frequentsequential pattern

1199041 ⟨amod prep to prep in nn⟩

⟨amod nn⟩⟨prep to nn⟩

1199042 ⟨prep to nn prep in⟩1199043 ⟨amod nn⟩1199044 ⟨prep to dobj prep in nn⟩1199045 ⟨dobj amod prep to nn⟩1199046 ⟨prep to nn⟩

is 119901 Let sequences 1199041 = 1198861 1198862 119886119894 and s2 = b1 b2 bjbe the two sequences in 119878 where 119886119894 and 119887119895 are items If thereexist some integers 1 le 1198981 lt 1198982 lt sdot sdot sdot lt 119898119899 le 119895 make1198861 sube 1198871198981 1198862 sube 1198871198982 119886119899 sube 119887119898119899 then sequence 1199041 is calleda subsequence of 1199042 or 1199042 contains 1199041 which is denoted as1199041 sube 1199042 The support of the sequence 1199041 is the number ofsequences in the sequence database S containing s1 denotedas 119878119906119901119901119900119903119905(1199041) Given a minimum support thresholdminsupif the support 1199041 is no less than 119898119894119899119904119906119901 in 119878 sequence 1199041 iscalled a frequent sequential pattern in 119878 which is denotedas 1198651199041 1198651199041 = 1199041 | 119878119906119901119901119900119903119905(1199041) ge 119898119894119899119904119906119901 1199041 sube 1199042 1199042 isin119878 In this study the sequence patterns are generated toselect samples in combination with the PrefixSpan algorithm[40]The PrefixSpan algorithm uses the principle ldquodivide andconquerrdquo by generating a prefix pattern and then connects itwith the suffix pattern to obtain the sequential patterns thusavoiding generating candidate sequences

321 Extraction of Sequential Patterns in Texts A sequencedatabase 119863119878 is constructed We denote 119862119878 = 119888119894 119894 =1 2 119899 as the set of candidate triggers which come fromthe trigger dictionary and 119860119878 = 119886119895 119895 = 1 2 119898as the set of candidate arguments which come from thetraining corpusThe set of pair (trigger argument) is denotedas 119875119878 = (119888119894 119886119895) | (119888119894 119886119895) isin 119862119878 times 119860119878 119888119894 = 119886119895The dependency path between the labeled candidate pairs(119888119894 119886119895) from the training data is extracted and it consistsof typed dependency sequence For example the sequence1199041 = 119899119904119906119887119895 119901119903119890119901 119887119910 119899119899 is the dependency path between thelabeled candidate pair (119888119894 119886119895) (the dependency path refers totyped dependency sequence from 119888119894 to 119886119895) The dependencypaths from all labeled candidate pairs make up the sequencedatabase 119863119878 where 1199041 is one of the sequences Table 2 showspart of the sequences in 119863119878 and frequent subsequences Thesequence 1199043 is shown as a subsequence of 1199041 and 1199045 therefore119878119906119901119901119900119903119905(1199043) is 3 in DS If we set 119898119894119899119904119906119901 = 3 we obtain s3 asa frequent sequential pattern in DS

We select each unlabeled candidate pair (ci aj) based onthe frequent sequential patternsTheoutput frequent patternsset is denoted as 119871119878 and the typed dependency sequence ofthe pair (119888119894 119886119895) is denoted as 119871 (119888119894 119886119895) If 119871 (119888119894 119886119895) contains enoughnumber of sequences in 119871119878 where the number is denotedas 119865(119888119894 119886119895) Θ is a threshold if 119865(119888119894 119886119895) gt Θ then the pair isselected This makes selecting a threshold for selecting theunlabeled candidate pair We select the suitable thresholdwith respect to the performance on the development set and

(1) Sample119872 documents(2) Generate sequence database DS(3) Initialize119863 = 0(4) for each 119889 isin 119872(5) LS = SeqPatten(lt gt 0 119863119878)(6) Initialize parameter threshold and set of sentences 119878119878 isin 119889(7) for each 119904 isin 119878(8) Initialize candidate entities 119862119878 = 119888119894 andcandidate arguments 119860119878 = 119886119895 and then generate119875119878 = (119888119894 119886119895) | (119888119894 119886119895) isin 119862119878 times 119860119878 119888119894 = 119886119895(9) for each (119888119894 119886119895) isin 119875119904(10) if 119865(119888119894 119886119895) gt Θ 119865(119888119894 119886119895) using Eq (1)(11) Update119863 larr 119863(119888119894 119886119895)(12) end for(13) end for(14) end for(15) return (119863)

Algorithm 1 Sample filter

discuss threshold in more detail in the experiment section(Section 411) The formula is as follows

119865(119888119894 119886119895) = sum119904119894 119904119894 sube 119871 (119888119894 119886119895) 119904119894 isin 119871119878 (1)

For example let sequences 120572 = 119901119903119890119901 119900119891 119899119899 120573 = 119899119899and 120574 = 119899119904119906119887119895 119901119903119890119901 119900119891 119899119899 be the three frequent sequencesin LS and sequence 120574 is the typed dependency sequence ofthe candidate pair (119888119894 119886119895) The sequences 120572 and 120573 are thesubsequences of 120574 Set threshold as 2 and obtain 119865(119888119894 119886119895) gt2 where the candidate pair (119888119894 119886119895) is selected Algorithm 1summarizes sample selection based on sequential patternalgorithm

33Detection ofMultiargument Events BINDandREGeventclasses are more complex because of the involvement ofprimary and secondary arguments In the primary argu-ments some are unary and others can be related to twoarguments In this study only the primary arguments (theme(proteinevent) cause (proteinevent) and theme (protein)+) are taken into account To better solve the multiargumentevents which can be represented as a triplet (trigger argu-ment argument2) we propose a method that extracts tripletrelations directly

For the single argument events the pairs (trigger argu-ment) are extracted directly For multiargument eventsthey are usually detected based on single argument eventsextraction Then the second arguments are assigned andreclassified to predict This approach will result in cascadingerrors Considering the Binding multiargument event of thepairwise model [32] as an example the detect process mainlyincluded two phases (1) detected pairs For example thereare two pairs (119888119894 119886119895) and (119888119894 119886119896) that are extracted from thesame sentence with the same trigger 119888119894 and labeled as Bindingtype (2) Based on the previous step evaluate the potentialtriplet using a dedicated classifier For example the triplet


middot middot middot



middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middotmiddot middot middot middot middot middot

⟨s⟩⟨s⟩ w1

Semantic layer

Max-pooling layer

Convolutional layer

Word hashing layer

Word sequence

Max-pooling

Figure 3 The architecture of the C-DSSM

(119888119894 119886119895 119886119896) is evaluated as a potential Binding event Here 119888119894is a trigger labeled previously 119886119895 and 119886119896 are proteins labeledpreviously in pairs The result of the first step affects thesecond step If pair (119888119894 119886119895) or pair (119888119894 119886119896) is not labeled triplet(119888119894 119886119895 119886119896) will not be detected too Therefore for the eventsthat include two arguments the solution is to extract tripletrelations directly This method uses a single dictionary andclassifier for multiargument events The detail is as follows

(1) Generate dictionary for BIND event class and REGevent class from the training data

(2) Select candidate triplets based on the sequentialpattern

(3) Train the model with an SVM classifier(4) Predict the set of triplets (119888119894 119886119895 119886119896) | 119895 = 119896 119888119894 isin

119862119878 119886119894 isin 119860119878 119886119896 isin 119860119878 after training the model withSVM classifier

Here 119862119878 = 119888119894 is the set of candidate entities and 119860119878 =119886119895 is the set of candidate arguments in a sentence 119878 where119886119895 is labeled proteins and candidate entities from the trainingdata

For the Binding event if the triplet (119888119894 119886119895 119886119896) is predictedas true the single argument of events 119890119894119895 = (119861119894119899119889119894119899119892 (119888119894 119886119895))and 119890119894119896 = (119861119894119899119889119894119899119892 (119888119894 119886119896)) predicted will be removed inthe step of integrating the prediction results of the singleargument and multiargument of a Binding event 119886119896 will beoutput as the cause argument for the REG event class

If the triplet (119888119894 119886119895 119886119896) is irrelevant to the pair (119888119894 119886119898) fromthe same sentence it is output directly Compared to pairwisemodel this approach considers joint information amongthe triplet (trigger argument argument2) from the start Itperforms better in the multiargument events extraction

34 Joint Scoring Mechanism Due to the introduction of thesequential pattern method to balance the training data therecall performance is significantly improved Meanwhile tocorrect the false positive examples a joint scoringmechanismis proposed for the predicted resultsThe scoring mechanismconsiders two aspects of sentences similarity and the impor-tance of trigger where those less than the threshold will befalse positive examples

Sentence similarity is widely used in the field of onlinesearch and question answering system It is an importantresearch subject in the field of NLP Here we use thetool sentence2vec based on convolutional deep structuredsemantic models (C-DSSM) [41 42] to calculate the semanticrelevance score

Latent semantic analysis (LSA) is a better-knownmethodfor index and retrievalThere are many newmethods extend-ing from LSA and C-DSSM is one of them It combines deeplearning with LSA and extends C-DSSM is mainly used inweb search where it maps the query and the documents to acommon semantic space through a nonlinear projectionThismodel uses a typical convolutional neural network (CNN)architecture to rank the relevant documents The C-DSSMmodel is mainly divided into two stages

(1)Map theword vectors to their corresponding semanticconcept vectors Here there are three hidden layers in thearchitecture of CNN The first layer is word hashing andit is mainly based on the method of letter 119899-gram Theword hashing method reduces the dimension of the bag-of-words term vectors After the word hashing layer it hasa convolutional layer that extracts local contextual featuresFurthermore it uses max-pooling technology to integratelocal feature vectors into global feature vectors A high-levelsemantic feature vector is received at the final semantic layerThe learning of CNN has been effectively improved Figure 3describes the architecture of the C-DSSM

119909 is denoted as the input term vector and y is the outputvector 119897119894 119894 = 1 119873 minus 1 are the intermediate hiddenlayers119882119894 is the 119894th weight matrix and 119887119894 is the 119894th bias termTherefore the problem becomes

1198971 = 1198821119909

119897119894 = 119891 (119882119894119897119894minus1 + 119887119894) 119894 = 2 119873 minus 1

119910 = 119891 (119882119873119897119873minus1 + 119887119873) (2)

(2) Calculate the relevance score between the documentand query By calculating the cosine similarity of the semantic


Table 3 Statistics on the GE data sets

Data set Papers Abstracts EventsTraining Devel Test Training Devel Test Training Devel Test

GErsquo13 10 10 14 0 0 0 2817 3199 3348GErsquo11 5 5 4 800 150 260 10310 4690 5301Training is training data Devel is development data and Test is test data

concept vector of ⟨query document⟩ the score is obtainedand is measured as

119877 (119876119863) = cosine (119910119876 119910119863) =119910119879119876119910119863100381710038171003817100381711991011987610038171003817100381710038171003817100381710038171003817119910119863

1003817100381710038171003817 (3)

The process to calculate the joint score for each predictedresult (119905119910119901 (119905119894 119886119895 119886119896)) is described as follows

Step 1 Calculate the similarity between the sentence 1199041015840 wherethe predicted result is located and all the related sentencesin 119889 Denote 119889 = 1199041 1199042 119904119899 as the set of sentences thatcontain the same trigger with 1199041015840 and obtain the maximumvalue

119878119894119898 (1199041015840 119889) =

max1le119894le119899

119877 (1199041015840 119904119894) 119904119894 isin 119889 119889 = 00 otherwise

(4)

Step 2 Compute the importance of the trigger 119875119877 = (119905119910119901(119905119894 119886119895 119886119896)) 119905119910119901 isin 119890V119890119899119905119879119910119901 119886119895 = 0

1198751 =119891 (119905119905119910119901119894 )

sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

1198752 =sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

sum119905119894isin119863119891 (119905119894)

119875119905119894 =11990811198751 + 119908211987521199081 + 1199082

(5)

where1198751 and1198752 are the importance of trigger in training data119891(119905119905119910119901119894 ) refers to the number of trigger 119905119894 in the event type 1199051199101199011199081 is the number of trigger 119905119894 belonging to 119905119910119901 in the predictedresult set 119875119877 1199082 is the number of trigger 119905119894 in the predictedresult set 119875119877 and 119890V119890119899119905119879119910119901 is the event type described inTable 1

Step 3 Combine 119875119905119894 and 119878119894119898(119905119894 119886119895 119886119896) to score the predictedresults The calculation formula is given as follows

119878119888119900119903119890 (119905119894 119886119895 119886119896) = (1 minus 120590) 119875119905119894 + 120590119878119894119898 (1199041015840 119889)

(119905119894 119886119895 119886119896) isin 1199041015840(6)

where 120590 represents a weight The sentence similarity compu-tation is based on the semantic analysis which can correctthe false positive example very well Therefore weight 120590 informula (6) will be given a higher value

Step 4 Given a threshold 120575 if 119878119888119900119903119890(119905119894 119886119895 119886119896) lt 120575 theexample is considered as negative

4 Experiments

41 Experimental Setup The experiments on GErsquo11 andGErsquo13 corpus are conducted Nine types of events weredefined in GErsquo11 and extended to fourteen types of eventsin GErsquo13 The study presented in this paper is still based onthe nine types defined in GErsquo11 The data sets of GErsquo11 andGErsquo13 are different No abstracts were included in GErsquo13and the number of papers in GErsquo13 is more than that of thepapers in GErsquo11 Table 3 shows the statistics on the differentdata sets We merge GErsquo11 and GErsquo13 training data anddevelopment data as the final training dataThe final trainingdata which eliminate duplicate papers contain 16375 eventsAll parameters of our system have been optimized onthe development set The approximate spanapproximaterecursive assessment is reported using the online toolprovided by the shared task organizers Our method ismainly divided into three steps sample selection based onthe sequential pattern for imbalanced data pairs and tripletsextraction formultiargument events and integration betweenthem and a joint scoring mechanism based on sentencesemantic similarity and trigger importance

411 Filter Imbalanced Data In the selection stage forsequential patterns sample we optimize the parameters ofthe sequence pattern on the GErsquo11 development set where adifferent sequence pattern of minimum support and thresh-old results in different 119865-score We merge GErsquo11 and GErsquo13training data as training data of optimized parameters Weaim to improve the recall through sequential pattern sampleselection in the event extraction and then to improve theprecision of each event while maintaining the recall thusimproving the final 119865-score

Table 4 shows the ratio of the positive and negativesamples after different minimum support and thresholdselection by the sequential pattern on the parameter trainingdata Here we use the number of events as the number ofsamples The ratio of positive samples and negative samplesis 1 13163 in the annotated corpus Reducing the negativesamples too much or too little will lead to offset datathus affecting the classifier performance which is not ouroriginal intention Therefore we choose to reduce about50 of the negative samples by setting a minimum supportand threshold (119898119894119899119904119906119901 Θ) = (119898119894119899119904119906119901Θ) | 119898119894119899119904119906119901 isin4 5 Θ isin 2 3 Figure 4(a) shows the 119865-score of the foursequences in the GErsquo11 development set when the minimumsupport is 4 and the threshold is 2 the119865-score of each event issignificantly higher than that of the other sequences Table 5shows the ratio of the positive and negative samples afterdifferent minimum support and threshold selection by the


70

60

50

40

30

20

10

0

F-s

core

Event typeALLREGBindingSVT

Minsup 4 threshold 2Minsup 4 threshold 3


(a)

Minsub 4 threshold 2

90

80

70

60

50

40

30

20

F-s

core

Gen

e_ex

pres

sion

Tran

scrip

tion

Prot

ein_

cata

bolis

m

Phos

phor

ylat

ion

Loca

lizat

ion

Bind

ing

Regu

latio

n

Posit

ive_

regu

latio

n

Neg

ativ

e_re

gula

tion

Event type

Original(b)

Figure 4 (a) The 119865-score of four sequences on the GErsquo11 development data (b) Comparison of the 119865-score of the sequence and original foreach event

Table 4 The ratio of the positive and negative samples on the training data

119883-119884 3-3 4-1 4-2 4-3 5-1 5-2 5-3 6-1 OriginalPositive samples 11419Negative samples 60742 77829 70901 66809 84910 74851 69180 85247 150308P N 1 5319 1 6816 1 6209 1 5851 1 7436 1 6555 1 6058 1 7465 1 13163119883 is minimum support (minsup) 119884 is threshold Θ and P N is the ratio of the positive and negative samples

sequential pattern on the final training data From Tables 4and 5 the ratio of the positive and negative samples is verycloseTherefore wewill use the sequencewhere theminimizesupport is 4 and the threshold is 2 on GErsquo11 and GErsquo13 testset Figure 4(b) shows that almost the 119865-score of every eventis less than the original model which is pairwise model onthe GErsquo11 development set after the sample selection basedon the sequential pattern Given that we reduce the negativesamples resulting in high recall and lower prediction wepropose a joint scoringmechanism to improve the predictionperformance

412 The Integration of Multiargument Events The resultsin Table 6 show that the recall and 119865-score have beensignificantly improved by extracting directly the triplets ofthe Binding events The REG event class includes nestedevents therefore the extraction of the multiargument hashigh complexity We only study the Binding events of themultiargument events in this paper This method whichextracts the triplets directly for themultiargument events willnot cause cascading errors Therefore it is effective to extractthe triplets of the events

42 Results and Discussion

421 Result on BioNLP-ST GENIA 2011 We evaluate theperformance of the method and compare it with the resultsof other systems Table 7 shows the results of the methodusing the official GErsquo11 online assessment tool Given thatGErsquo11 corpus contains abstract and full text we evaluate theperformance on whole abstract and full text The results ofabstract and the full text as well as the whole results arereported to illustrate that the method is outstanding in clas-sifying events Table 7 shows that the 119865-score of the full textis higher than the 119865-score of the abstract which is 8132 and7144 in the simple event respectively However the 119865-scoreof the abstract is higher than the 119865-score of the full text inBIND event class which is 5417 and 4593 respectively The119865-score of abstract is also higher than the 119865-score of the fulltext in REG event class which is 4136 and 4110 respectivelyThe total 119865-score of the full text is higher than that of theabstract which is 5423 and 5364 respectively Table 7 alsoillustrates that the method performs well on full text

Table 8 shows the comparison results of the proposedmethod with other GErsquo11 systems The results for FAUST


Table 5 The ratio of the positive and negative samples on the final training data


Table 6 Results withwithout triplets of the Binding events

Binding events Without triplets With triplets119877 119875 119865 119877 119875 119865

GErsquo11 4257 4976 4588 4766 5652 5171GErsquo13 3003 3135 3067 3934 4411 4159Performance is shown in recall (119877) precision (119875) and 119865-score (119865)

Table 7 Results of the proposed method on GErsquo11 test set

Event class Event type Whole Abstract Full text119877 119875 119865 119877 119875 119865 119877 119875 119865

SVT

Gene expression 7555 8019 7780 7244 7771 7498 8357 8635 8494Transcription 5287 6866 5974 5474 7075 6173 4595 6071 5231

Protein catabolism 6667 7692 7143 6429 8182 7200 10000 5000 6667Phosphorylation 8000 8810 8385 7704 8814 8221 8800 8800 8800Localization 3560 8608 5037 3276 9500 4872 6471 5789 6111

Total 6860 8034 7401 6497 7934 7144 7974 8297 8132BIND Binding 4766 5652 5171 4957 5972 5417 4306 4921 4593

REG

Regulation 3247 4753 3858 3196 5138 3941 3404 3902 3636Positive regulation 4103 4235 4168 4110 4064 4087 4087 4653 4352Negative regulation 3818 4638 4188 4037 4857 4409 3385 4194 3746

Total 3897 4388 4128 3932 4362 4136 3820 4446 4110ALL total 5035 5779 5381 4997 5790 5364 5129 5752 5423Performance is shown in recall (119877) precision (119875) and 119865-score (119865)

UMass UTurkuMSR-NLP and STSSmodels are reproducedfrom [7 24]Our approach in full text obtains the best119865-scoreof 5423 This score is higher than the best extraction systemsuch as FAUST of GErsquo11 (156 points) the STSS and UTurku(about 35 points)The performance in precision and recall offull text is also better than that of other systems However theprecision and recall of SVT and REG event classes are slightlylower than FAUST and UMass in the abstract but they arehigher in Binding events However the whole 119865-score isslightly lower than that of the FAUST and UMass and higherthan the UTurku STSS and MSR-NLP However the recallhas achieved the highest score which is mainly due to thesequential pattern sample selection of the unbalanced data

422 Result on BioNLP-ST GENIA 2013 The pipelineapproaches are the best methods performing on the GErsquo13where EVEX is the official winner We train the model onthe training set and development set and evaluate it on thetest set using the official GErsquo13 online assessment tool Table 9shows the evaluation results GErsquo13 test data does not containabstracts therefore we evaluate the performance on fullpapers only Table 10 shows the comparison results of our

method with other GErsquo13 systems including TEES 21 andEVEX because they belong to the pipeline model We addBioSEM system in the table which is a rule-based system andhas achieved the best results in the Binding eventsThe resultsfor TEES 21 EVEX and BioSEM are reproduced from [8]Table 10 indicates that ourmethod is significantly higher thanthe other systems in the recall rate Our recall rate is 4865and TEES 21 EVEX and BioSEM are 4660 4587 and 4284respectively The 119865-score of our system is 5217 while the 119865-scores of TEES 21 EVEX and BioSEM are 5100 5124 and5094 respectively Although the total 119865-score of our systemis better than the systems previously mentioned it does notachieve the desired effect on Binding events This may be theresult of a not ideal extracting pair in the Binding simpleevent resulting in poor integration of the pairs and triplets inBinding events after the triplets extraction In general theseresults clearly demonstrate the effectiveness of the proposedmethod

5 Conclusions

In this study a new event extraction system was introducedComparing our system with other event extraction systems


Table 8 Comparison with other systems on GErsquo11 test set

System Event classSVT BIND REG ALL

FAUSTW 6847 80257390 442053714849 380254944494 494164755604A 661681047285 455358095105 393858184697 500067535746F 755878237688 409744704275 349948244056 479258475267

UMassW 670181407350 429756424879 375252674382 484964085520A 642180747154 435260895076 387855074551 487465945605F 755883147918 416747624444 347247514012 347247514012

UTurkuW 682276477211 429743604328 387247644272 495657655330A 649776727036 452450004750 404149014430 500659485437F 781875827698 375031763439 349944463916 483153385072

MSR-NLPW 689974307154 423640474139 366444084002 486454715150A 659974717008 432344514386 371445384085 485256475220F 78187324 7563 402832773614 355241343821 489450774984

STSSW mdash mdash mdash mdashA 649776657033 452449844743 404148834422 500659335430F 781875637688 375031583429 349944693925 483153435074

OursW 686080347401 476656525171 389743884128 503557795381A 649779347144 495759725417 393243624136 499757905364F 797482978132 430649214593 382044464110 512957525423

Evaluation results (recallprecision119865-score) in whole data set (W) abstracts only (A) and full papers only (F)


Event class Event type 119877 119875 119865

SVT

Gene expression 8288 7991 8136Transcription 5248 6543 5824

Protein catabolism 6429 5000 5625Phosphorylation 8125 7602 7855Localization 3131 8378 4559

Total 7412 7756 7580BIND Binding 3934 4411 4159

REG

Regulation 2361 3908 2944Positive regulation 3956 4671 4284Negative regulation 3973 4624 4274

Total 3724 4574 4105ALL total 4865 5624 5217Performance is shown in recall (119877) precision (119875) and 119865-score (119865)

we obtained some positive results First we proposed anew method of sample selection based on the sequentialpattern to balance the data set which has played an impor-tant role in the process of biomedical event extractionSecond taking into account the relevance of the triggerand argument of multiargument events the system extractsthe pair (trigger argument) and triplet (trigger argumentargument2) at the same time The integration of the pair andtriplet improves the performance of multiargument eventsprediction which improves the119865-score aswell Finally a jointscoring mechanism based on C-DSSM and the importanceof the trigger is proposed to correct the predictions Ingeneral samples selection based on the sequential patternachieved the desired effectiveness and it was combinedwith the joint scoring mechanism to further improve the

performance of the system The performance of this methodwas evaluated with extensive experiments Although ourmethod is a supervised learning method we provide a newidea on constructing a good predictive model because highrecall can be used in disease genes Although numerousefforts are made the extraction of complex events is still ahuge challenge In the future we will further optimize thejoint scoring mechanism and integrate external resourcesinto biomedical event extraction through semisupervised orunsupervised approach

Competing Interests

The authors declare that they have no competing interests




TEES 21 745277737609 423444344332 330844783805 466056325100EVEX 738277737572 411444774288 324147163841 458758035124BioSEM 700985607707 474552324976 281949063580 428462835094Ours 741277567580 393444114159 372445744105 486556245217Performance is shown in recall (119877) precision (119875) and 119865-score (119865)

Acknowledgments

This work was supported by the National Natural Sci-ence Foundation of China (no 61373067) and Science andTechnology Development Program of Jilin Province (no20140101195JC)

References

[1] Q-X Chen H-H Hu Q Zhou A-M Yu and S Zeng ldquoAnoverview of ABC and SLC drug transporter gene regulationrdquoCurrent Drug Metabolism vol 14 no 2 pp 253ndash264 2013

[2] C Gilissen A Hoischen H G Brunner and J A VeltmanldquoDisease gene identification strategies for exome sequencingrdquoEuropean Journal of HumanGenetics vol 20 no 5 pp 490ndash4972012

[3] I Segura-Bedmar P Martınez and C de Pablo-SanchezldquoExtracting drug-drug interactions from biomedical textsrdquoBMC Bioinformatics vol 11 supplement 5 article P9 2010

[4] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[5] J Hakenberg C Plake L Royer H Strobelt U Leser andM Schroeder ldquoGene mention normalization and interactionextraction with context models and sentence motifsrdquo GenomeBiology vol 9 supplement 2 p S14 2008

[6] J-D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLPrsquo09 shared task on event extractionrdquo in Proceedingsof the Workshop on Current Trends in Biomedical Natural Lan-guage Processing Shared Task (BioNLP rsquo09) pp 1ndash9 Associationfor Computational Linguistics 2009

[7] J-D Kim Y Wang T Takagi and A Yonezawa ldquoOverview ofGenia event task in BioNLP Shared Task 2011rdquo in Proceedingsof the BioNLP Shared Task Workshop pp 7ndash15 Association forComputational Linguistics 2011

[8] J-D Kim Y Wang and Y Yasunori ldquoThe genia event extrac-tion shared task 2013 edition-overviewrdquo in Proceedings ofthe BioNLP Shared Task Workshop pp 5ndash18 Association forComputational Linguistics Sofia Bulgaria August 2013

[9] J-D Kim S Pyysalo T Ohta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task Workshop (BioNLP Shared Task rsquo11) pp 1ndash6 Association for Computational Linguistics 2011

[10] M N Wass G Fuentes C Pons F Pazos and A ValencialdquoTowards the prediction of protein interaction partners usingphysical dockingrdquoMolecular Systems Biology vol 7 article 4692011

[11] X-W Chen and J C Jeong ldquoSequence-based prediction ofprotein interaction sites with an integrative methodrdquo Bioinfor-matics vol 25 no 5 pp 585ndash591 2009

[12] Q-C Bui and P Sloot ldquoExtracting biological events fromtext using simple syntactic patternsrdquo in Proceedings of theBioNLP SharedTask 2011Workshop pp 143ndash146 Association forComputational Linguistics Portland Ore USA June 2011

[13] K B Cohen K Verspoor H L Johnson et al ldquoHigh-precisionbiological event extraction with a concept recognizerrdquo inProceedings of the Workshop on Current Trends in BiomedicalNatural Language Processing Shared Task (BioNLP rsquo09) pp 50ndash58 Association for Computational Linguistics 2009

[14] K Kaljurand G Schneider and F Rinaldi ldquoUZurich in theBioNLP 2009 shared taskrdquo in Proceedings of the Workshopon Current Trends in Biomedical Natural Language ProcessingShared Task (BioNLP rsquo09) pp 28ndash36 Association for Compu-tational Linguistics 2009

[15] H Kilicoglu and S Bergler ldquoAdapting a general semanticinterpretation approach to biological event extractionrdquo in Pro-ceedings of the BioNLP Shared Task 2011 Workshop pp 173ndash182Association for Computational Linguistics 2011

[16] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the Workshopon Current Trends in Biomedical Natural Language ProcessingShared Task (BioNLP rsquo09) pp 10ndash18 Association for Computa-tional Linguistics 2009

[17] M Miwa R Saeligtre J-D Kim and J Tsujii ldquoEvent extractionwith complex event classification using rich featuresrdquo Journal ofBioinformatics and Computational Biology vol 8 no 1 pp 131ndash146 2010

[18] E Buyko E Faessler J Wermter and U Hahn ldquoEvent extrac-tion from trimmed dependency graphsrdquo in Proceedings of theWorkshop on Current Trends in Biomedical Natural LanguageProcessing Shared Task (BioNLP rsquo09) pp 19ndash27 Association forComputational Linguistics 2009

[19] K Veropoulos C Campbell and N Cristianini ldquoControllingthe sensitivity of support vector machinesrdquo in Proceedings of theInternational Joint Conference on AI pp 55ndash60 1999

[20] R Akbani S Kwek and N Japkowicz ldquoApplying supportvector machines to imbalanced datasetsrdquo inMachine LearningECML 2004 15th European Conference on Machine LearningPisa Italy September 20ndash24 2004 Proceedings vol 3201 ofLecture Notes in Computer Science pp 39ndash50 Springer BerlinGermany 2004

[21] Y Tang Y-Q Zhang and N V Chawla ldquoSVMs modeling forhighly imbalanced classificationrdquo IEEETransactions on SystemsMan and Cybernetics Part B Cybernetics vol 39 no 1 pp 281ndash288 2009


[22] S Tong and D Koller ldquoSupport vector machine active learningwith applications to text classificationrdquo The Journal of MachineLearning Research vol 2 pp 45ndash66 2002

[23] J Li J M Bioucas-Dias and A Plaza ldquoSpectralndashspatial classifi-cation of hyperspectral data using loopy belief propagation andactive learningrdquo IEEE Transactions on Geoscience and RemoteSensing vol 51 no 2 pp 844ndash856 2013

[24] T Munkhdalai O Namsrai and K Ryu ldquoSelf-training insignificance space of support vectors for imbalanced biomedicalevent datardquo BMC Bioinformatics vol 16 supplement 7 p S62015

[25] A Criminisi J Shotton and E Konukoglu ldquoDecision forestsa unified framework for classification regression densityestimation manifold learning and semi-supervised learningrdquoFoundations and Trends in Computer Graphics and Vision 723vol 7 no 2-3 pp 81ndash227 2012

[26] A Vlachos P Buttery DO Seaghdha andT Briscoe ldquoBiomed-ical event extraction without training datardquo in Proceedingsof the Workshop on Current Trends in Biomedical NaturalLanguage Processing Shared Task pp 37ndash40 Association forComputational Linguistics Boulder Colo USA June 2009

[27] J Hakenberg I Solt D Tikk et al ldquoMolecular event extractionfrom link grammar parse treesrdquo in Proceedings of the Workshopon Current Trends in Biomedical Natural Language ProcessingShared Task (BioNLP rsquo09) pp 86ndash94 Association for Compu-tational Linguistics 2009

[28] W J Hou and B Ceesay ldquoEvent extraction for gene regulationnetwork using syntactic and semantic approachesrdquo in CurrentApproaches in Applied Artificial Intelligence M Ali Y S KwonC-H Lee J Kim and Y Kim Eds vol 9101 of Lecture Notesin Computer Science pp 559ndash570 Springer Berlin Germany2015

[29] X Q Pham M Q Le and B Q Ho ldquoA hybrid approachfor biomedical event extractionrdquo in Proceedings of the BioNLPShared TaskWorkshop Association for Computational Linguis-tics 2013

[30] J Bjorne F Ginter and T Salakoski ldquoUniversity of Turkuin the BioNLPrsquo11 shared taskrdquo BMC Bioinformatics vol 13supplement 11 article S4 2012

[31] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP 2013 shared taskrdquo in Proceedingsof the BioNLP Shared Task Workshop 2013

[32] K Hakala S Van Landeghem T Salakosk Y Van de Peerand F Ginter ldquoEVEX in STrsquo13 application of a large-scale textmining resource to event extraction and network constructionrdquoinProceedings of the BioNLP Shared TaskWorkshop Associationfor Computational Linguistics 2013

[33] D Zhou D Zhong and Y He ldquoEvent trigger identificationfor biomedical events extraction using domain knowledgerdquoBioinformatics vol 30 no 11 pp 1587ndash1594 2014

[34] S Pyysalo T Ohta M Miwa H-C Cho J Tsujii and S Ana-niadou ldquoEvent extraction across multiple levels of biologicalorganizationrdquo Bioinformatics vol 28 no 18 pp i575ndashi581 2012

[35] D Campos Q-C Bui S Matos and J L Oliveira ldquoTrigNERautomatically optimized biomedical event trigger recognitionon scientific documentsrdquo Source Code for Biology andMedicinevol 9 no 1 article 1 2014

[36] D McClosky M Surdeanu and C D Manning ldquoEventextraction as dependency parsingrdquo in Proceedings of the 49thAnnualMeeting of the Association for Computational LinguisticsHuman Language Technologies (HLT rsquo11) vol 1 pp 1626ndash1635Stroudsburg Pa USA June 2011

[37] L Li S Liu M Qin Y Wang and D Huang ldquoExtractingbiomedical event with dual decomposition integrating wordembeddingsrdquo IEEEACM Transactions on Computational Biol-ogy and Bioinformatics vol 13 no 4 pp 669ndash677 2016

[38] A Ozgur and D R Radev ldquoSupervised classification forextracting biomedical eventsrdquo in Proceedings of the Workshopon Current Trends in Biomedical Natural Language ProcessingShared Task Association for Computational Linguistics 2009

[39] X Liu A Bordes and Y Grandvalet ldquoBiomedical event extrac-tion by multi-class classification of pairs of text entitiesrdquo inProceedings of the BioNLP Shared Task Workshop 2013

[40] J Pei J Han BMortazavi-Asl et al ldquoMining sequential patternsefficiently by prefix-projected pattern growthrdquo in Proceedingsof the 17th International Conference on Data Engineering (ICDErsquo01) pp 215ndash224 Heidelberg Germany 2001

[41] P-S Huang X He J Gao L Deng A Acero and L HeckldquoLearning deep structured semantic models for web searchusing clickthrough datardquo in Proceedings of the 22nd ACM Inter-national Conference on Information amp Knowledge Management(CIKM rsquo13) pp 2333ndash2338 ACM San Francisco Calif USANovember 2013

[42] Y Shen X He J Gao L Deng and G Mesnil ldquoLearningsemantic representations using convolutional neural networksfor web searchrdquo in Proceedings of the the 23rd InternationalConference on World Wide Web pp 373ndash374 ACM SeoulRepublic of Korea April 2014

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


MEDIATORSINFLAMMATION

of


Behavioural Neurology

EndocrinologyInternational Journal of



Disease Markers


BioMed Research International

OncologyJournal of



Oxidative Medicine and Cellular Longevity


PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of



Computational and Mathematical Methods in Medicine

OphthalmologyJournal of


Diabetes ResearchJournal of



Research and TreatmentAIDS


Gastroenterology Research and Practice


Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom


Theme E1 CauseProtein

Protein

Since IRF-4 ExpressionGene-expression

Theme

IFN-alpha therapy demethylation and successful

in cell lines and CML can be induced by

E2Positive_regulation

Figure 1 Structured representation of biomedical event

Table 1 Class event types and their arguments for the GE task

Event class Event type Primaryargument

Secondaryargument

SVT

Gene expression Theme(P)Transcription Theme(P)

Localization Theme(P) AtLocToLoc

Protein catabolism Theme(P)Phosphorylation Theme(P) Site

BIND Binding Theme(P)+ Site+

REG

Regulation Theme(PEv)Cause(PEv) Site Csite

Positive regulation Theme(PEv)Cause(PEv) Site Csite

Negative regulation Theme(PEv)Cause(PEv) Site Csite

P is protein Ev is event

and two events can be expressed as E1 Gene expressionldquoexpressionrdquo Theme ldquoIRF-4rdquo and E2 Positive regulationldquoinducedrdquoTheme E1 Cause ldquoIFN-alphardquoWe aim to extractthese event structures from the text automatically

Pattern-based methods are used in biomedical relationextraction [10 11] but are less used in biomedical event extrac-tion These methods mainly extract the relations betweenentities by manually defined patterns and automaticallylearned patterns from the training data set Rule-basedmethods [12ndash15] and machine learning-based methods [16ndash18] are the main methods in an event extraction task Rule-based methods are similar to the pattern-based methodswhich manually define syntax rules and learn new rules fromthe training data Machine learning-based methods regardthe extraction task as a classification problem The problemof highly unbalanced training data sets in biomedical eventextraction is seldom addressed by most systems The solu-tions with support vector machines (SVMs) usually use thesimple class weighting strategy [19ndash21] Other approachessuch as active learning [22 23] and semisupervised learning[24 25] solve this problem by increasing the positive samplesize In this study a sample selection method based ona sequential pattern is proposed to solve the problem ofimbalanced data in classification and a joint scoring mech-anism based on sentence semantic similarity and importanceof triggers is introduced to correct further false positivepredictions

The paper is organized as follows related work is pre-sented in Section 2 Our work the sequence pattern-basedsample selection algorithm detection of multiargumentevents and the joint scoring mechanism are presented inSection 3 Section 4 describes the experiment results in GErsquo11and in GErsquo13 test sets Finally a conclusion is presented inSection 5

2 Related Work

Since the organizers of the BioNLP-ST held the first com-petition on the fine-grained information extraction task ofbiomedical events in 2009 a variety of methods have beenproposed to solve the task At present the event extractionsystems are mainly divided into two types rule-based eventextraction systems andmachine learning-based event extrac-tion systems The overview papers of BioNLP-ST 2011 and2013 [7 8] show that the results of machine learning-basedmethods are better than the results of rule-based methods

Rule-based event extraction systems [26ndash29] are basedon sentence structure grammatical relation and semanticrelation which make it more flexible However the resultsobtained by those methods have high precision and lowrecall which are noticeable in simple event extraction Toimprove recall rule-based event extraction systems are forcedto relax constraints in the automatic access of learning rules

The system based on machine learning is generallydivided into three groups The first group is the pipelinemodel [30ndash32] which has an event extraction process thatcan be divided into three steps The first step predicts thetriggerThe second step is the edge detection and assignmentof arguments based on the first step The final step is theevent element detection The pipeline model in the eventextraction task has achieved excellent results such as thechampion of GErsquo09 [30] (Turku) and the champion of GErsquo13[32] (EVEX) Zhou et al [33] proposed a novel method basedon the pipeline model for event trigger identification Theyembed the knowledge learned from a large text corpus intoword features using neural language modeling Experimentalresults show that the 119865-score of event trigger identificationimproves by 25 compared with the approach proposed in[34] Campos et al [35] optimized the feature set and trainingarguments for each event type but only predicted the events inthe GErsquo09 test sets A linear SVM with ldquoone-versus-the-restrdquomulticlass strategy is used to solve multiclass and multilabelclassification problems based on an imbalanced data set ateach stage Although the performance of the pipeline model









Testing corpus

Prediction1

Prediction2


Preprocessing

Final prediction








3 Methods



















119865(119888119894 119886119895) = sum119904119894 119904119894 sube 119871 (119888119894 119886119895) 119904119894 isin 119871119878 (1)










⟨s⟩⟨s⟩ w1

Semantic layer

Max-pooling layer

Convolutional layer

Word hashing layer

Word sequence

Max-pooling















1198971 = 1198821119909

119897119894 = 119891 (119882119894119897119894minus1 + 119887119894) 119894 = 2 119873 minus 1

119910 = 119891 (119882119873119897119873minus1 + 119887119873) (2)







119877 (119876119863) = cosine (119910119876 119910119863) =119910119879119876119910119863100381710038171003817100381711991011987610038171003817100381710038171003817100381710038171003817119910119863

1003817100381710038171003817 (3)



119878119894119898 (1199041015840 119889) =

max1le119894le119899

119877 (1199041015840 119904119894) 119904119894 isin 119889 119889 = 00 otherwise

(4)


1198751 =119891 (119905119905119910119901119894 )

sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

1198752 =sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

sum119905119894isin119863119891 (119905119894)

119875119905119894 =11990811198751 + 119908211987521199081 + 1199082

(5)



119878119888119900119903119890 (119905119894 119886119895 119886119896) = (1 minus 120590) 119875119905119894 + 120590119878119894119898 (1199041015840 119889)

(119905119894 119886119895 119886119896) isin 1199041015840(6)



4 Experiments





70

60

50

40

30

20

10

0

F-s

core




(a)


90

80

70

60

50

40

30

20

F-s

core

Gen

e_ex

pres

sion

Tran

scrip

tion

Prot

ein_

cata

bolis

m

Phos

phor

ylat

ion

Loca

lizat

ion

Bind

ing

Regu

latio

n

Posit

ive_

regu

latio

n

Neg

ativ

e_re

gula

tion

Event type

Original(b)

















SVT




REG






5 Conclusions





FAUSTW 6847 80257390 442053714849 380254944494 494164755604A 661681047285 455358095105 393858184697 500067535746F 755878237688 409744704275 349948244056 479258475267

UMassW 670181407350 429756424879 375252674382 484964085520A 642180747154 435260895076 387855074551 487465945605F 755883147918 416747624444 347247514012 347247514012

UTurkuW 682276477211 429743604328 387247644272 495657655330A 649776727036 452450004750 404149014430 500659485437F 781875827698 375031763439 349944463916 483153385072

MSR-NLPW 689974307154 423640474139 366444084002 486454715150A 659974717008 432344514386 371445384085 485256475220F 78187324 7563 402832773614 355241343821 489450774984


OursW 686080347401 476656525171 389743884128 503557795381A 649779347144 495759725417 393243624136 499757905364F 797482978132 430649214593 382044464110 512957525423




SVT




REG





Competing Interests






Acknowledgments


References

















































of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of
























Testing corpus

Prediction1

Prediction2


Preprocessing

Final prediction








3 Methods



















119865(119888119894 119886119895) = sum119904119894 119904119894 sube 119871 (119888119894 119886119895) 119904119894 isin 119871119878 (1)










⟨s⟩⟨s⟩ w1

Semantic layer

Max-pooling layer

Convolutional layer

Word hashing layer

Word sequence

Max-pooling















1198971 = 1198821119909

119897119894 = 119891 (119882119894119897119894minus1 + 119887119894) 119894 = 2 119873 minus 1

119910 = 119891 (119882119873119897119873minus1 + 119887119873) (2)







119877 (119876119863) = cosine (119910119876 119910119863) =119910119879119876119910119863100381710038171003817100381711991011987610038171003817100381710038171003817100381710038171003817119910119863

1003817100381710038171003817 (3)



119878119894119898 (1199041015840 119889) =

max1le119894le119899

119877 (1199041015840 119904119894) 119904119894 isin 119889 119889 = 00 otherwise

(4)


1198751 =119891 (119905119905119910119901119894 )

sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

1198752 =sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

sum119905119894isin119863119891 (119905119894)

119875119905119894 =11990811198751 + 119908211987521199081 + 1199082

(5)



119878119888119900119903119890 (119905119894 119886119895 119886119896) = (1 minus 120590) 119875119905119894 + 120590119878119894119898 (1199041015840 119889)

(119905119894 119886119895 119886119896) isin 1199041015840(6)



4 Experiments





70

60

50

40

30

20

10

0

F-s

core




(a)


90

80

70

60

50

40

30

20

F-s

core

Gen

e_ex

pres

sion

Tran

scrip

tion

Prot

ein_

cata

bolis

m

Phos

phor

ylat

ion

Loca

lizat

ion

Bind

ing

Regu

latio

n

Posit

ive_

regu

latio

n

Neg

ativ

e_re

gula

tion

Event type

Original(b)

















SVT




REG






5 Conclusions





FAUSTW 6847 80257390 442053714849 380254944494 494164755604A 661681047285 455358095105 393858184697 500067535746F 755878237688 409744704275 349948244056 479258475267

UMassW 670181407350 429756424879 375252674382 484964085520A 642180747154 435260895076 387855074551 487465945605F 755883147918 416747624444 347247514012 347247514012

UTurkuW 682276477211 429743604328 387247644272 495657655330A 649776727036 452450004750 404149014430 500659485437F 781875827698 375031763439 349944463916 483153385072

MSR-NLPW 689974307154 423640474139 366444084002 486454715150A 659974717008 432344514386 371445384085 485256475220F 78187324 7563 402832773614 355241343821 489450774984


OursW 686080347401 476656525171 389743884128 503557795381A 649779347144 495759725417 393243624136 499757905364F 797482978132 430649214593 382044464110 512957525423




SVT




REG





Competing Interests






Acknowledgments


References

















































of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of




























119865(119888119894 119886119895) = sum119904119894 119904119894 sube 119871 (119888119894 119886119895) 119904119894 isin 119871119878 (1)










⟨s⟩⟨s⟩ w1

Semantic layer

Max-pooling layer

Convolutional layer

Word hashing layer

Word sequence

Max-pooling















1198971 = 1198821119909

119897119894 = 119891 (119882119894119897119894minus1 + 119887119894) 119894 = 2 119873 minus 1

119910 = 119891 (119882119873119897119873minus1 + 119887119873) (2)







119877 (119876119863) = cosine (119910119876 119910119863) =119910119879119876119910119863100381710038171003817100381711991011987610038171003817100381710038171003817100381710038171003817119910119863

1003817100381710038171003817 (3)



119878119894119898 (1199041015840 119889) =

max1le119894le119899

119877 (1199041015840 119904119894) 119904119894 isin 119889 119889 = 00 otherwise

(4)


1198751 =119891 (119905119905119910119901119894 )

sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

1198752 =sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

sum119905119894isin119863119891 (119905119894)

119875119905119894 =11990811198751 + 119908211987521199081 + 1199082

(5)



119878119888119900119903119890 (119905119894 119886119895 119886119896) = (1 minus 120590) 119875119905119894 + 120590119878119894119898 (1199041015840 119889)

(119905119894 119886119895 119886119896) isin 1199041015840(6)



4 Experiments





70

60

50

40

30

20

10

0

F-s

core




(a)


90

80

70

60

50

40

30

20

F-s

core

Gen

e_ex

pres

sion

Tran

scrip

tion

Prot

ein_

cata

bolis

m

Phos

phor

ylat

ion

Loca

lizat

ion

Bind

ing

Regu

latio

n

Posit

ive_

regu

latio

n

Neg

ativ

e_re

gula

tion

Event type

Original(b)

















SVT




REG






5 Conclusions





FAUSTW 6847 80257390 442053714849 380254944494 494164755604A 661681047285 455358095105 393858184697 500067535746F 755878237688 409744704275 349948244056 479258475267

UMassW 670181407350 429756424879 375252674382 484964085520A 642180747154 435260895076 387855074551 487465945605F 755883147918 416747624444 347247514012 347247514012

UTurkuW 682276477211 429743604328 387247644272 495657655330A 649776727036 452450004750 404149014430 500659485437F 781875827698 375031763439 349944463916 483153385072

MSR-NLPW 689974307154 423640474139 366444084002 486454715150A 659974717008 432344514386 371445384085 485256475220F 78187324 7563 402832773614 355241343821 489450774984


OursW 686080347401 476656525171 389743884128 503557795381A 649779347144 495759725417 393243624136 499757905364F 797482978132 430649214593 382044464110 512957525423




SVT




REG





Competing Interests






Acknowledgments


References

















































of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of






















⟨s⟩⟨s⟩ w1

Semantic layer

Max-pooling layer

Convolutional layer

Word hashing layer

Word sequence

Max-pooling















1198971 = 1198821119909

119897119894 = 119891 (119882119894119897119894minus1 + 119887119894) 119894 = 2 119873 minus 1

119910 = 119891 (119882119873119897119873minus1 + 119887119873) (2)







119877 (119876119863) = cosine (119910119876 119910119863) =119910119879119876119910119863100381710038171003817100381711991011987610038171003817100381710038171003817100381710038171003817119910119863

1003817100381710038171003817 (3)



119878119894119898 (1199041015840 119889) =

max1le119894le119899

119877 (1199041015840 119904119894) 119904119894 isin 119889 119889 = 00 otherwise

(4)


1198751 =119891 (119905119905119910119901119894 )

sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

1198752 =sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

sum119905119894isin119863119891 (119905119894)

119875119905119894 =11990811198751 + 119908211987521199081 + 1199082

(5)



119878119888119900119903119890 (119905119894 119886119895 119886119896) = (1 minus 120590) 119875119905119894 + 120590119878119894119898 (1199041015840 119889)

(119905119894 119886119895 119886119896) isin 1199041015840(6)



4 Experiments





70

60

50

40

30

20

10

0

F-s

core




(a)


90

80

70

60

50

40

30

20

F-s

core

Gen

e_ex

pres

sion

Tran

scrip

tion

Prot

ein_

cata

bolis

m

Phos

phor

ylat

ion

Loca

lizat

ion

Bind

ing

Regu

latio

n

Posit

ive_

regu

latio

n

Neg

ativ

e_re

gula

tion

Event type

Original(b)

















SVT




REG






5 Conclusions





FAUSTW 6847 80257390 442053714849 380254944494 494164755604A 661681047285 455358095105 393858184697 500067535746F 755878237688 409744704275 349948244056 479258475267

UMassW 670181407350 429756424879 375252674382 484964085520A 642180747154 435260895076 387855074551 487465945605F 755883147918 416747624444 347247514012 347247514012

UTurkuW 682276477211 429743604328 387247644272 495657655330A 649776727036 452450004750 404149014430 500659485437F 781875827698 375031763439 349944463916 483153385072

MSR-NLPW 689974307154 423640474139 366444084002 486454715150A 659974717008 432344514386 371445384085 485256475220F 78187324 7563 402832773614 355241343821 489450774984


OursW 686080347401 476656525171 389743884128 503557795381A 649779347144 495759725417 393243624136 499757905364F 797482978132 430649214593 382044464110 512957525423




SVT




REG





Competing Interests






Acknowledgments


References

















































of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of





















119877 (119876119863) = cosine (119910119876 119910119863) =119910119879119876119910119863100381710038171003817100381711991011987610038171003817100381710038171003817100381710038171003817119910119863

1003817100381710038171003817 (3)



119878119894119898 (1199041015840 119889) =

max1le119894le119899

119877 (1199041015840 119904119894) 119904119894 isin 119889 119889 = 00 otherwise

(4)


1198751 =119891 (119905119905119910119901119894 )

sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

1198752 =sum119905119910119901isin119890V119890119899119905119879119910119901 119891 (119905119905119910119901119894 )

sum119905119894isin119863119891 (119905119894)

119875119905119894 =11990811198751 + 119908211987521199081 + 1199082

(5)



119878119888119900119903119890 (119905119894 119886119895 119886119896) = (1 minus 120590) 119875119905119894 + 120590119878119894119898 (1199041015840 119889)

(119905119894 119886119895 119886119896) isin 1199041015840(6)



4 Experiments





70

60

50

40

30

20

10

0

F-s

core




(a)


90

80

70

60

50

40

30

20

F-s

core

Gen

e_ex

pres

sion

Tran

scrip

tion

Prot

ein_

cata

bolis

m

Phos

phor

ylat

ion

Loca

lizat

ion

Bind

ing

Regu

latio

n

Posit

ive_

regu

latio

n

Neg

ativ

e_re

gula

tion

Event type

Original(b)

















SVT




REG






5 Conclusions





FAUSTW 6847 80257390 442053714849 380254944494 494164755604A 661681047285 455358095105 393858184697 500067535746F 755878237688 409744704275 349948244056 479258475267

UMassW 670181407350 429756424879 375252674382 484964085520A 642180747154 435260895076 387855074551 487465945605F 755883147918 416747624444 347247514012 347247514012

UTurkuW 682276477211 429743604328 387247644272 495657655330A 649776727036 452450004750 404149014430 500659485437F 781875827698 375031763439 349944463916 483153385072

MSR-NLPW 689974307154 423640474139 366444084002 486454715150A 659974717008 432344514386 371445384085 485256475220F 78187324 7563 402832773614 355241343821 489450774984


OursW 686080347401 476656525171 389743884128 503557795381A 649779347144 495759725417 393243624136 499757905364F 797482978132 430649214593 382044464110 512957525423




SVT




REG





Competing Interests






Acknowledgments


References

















































of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of

















70

60

50

40

30

20

10

0

F-s

core




(a)


90

80

70

60

50

40

30

20

F-s

core

Gen

e_ex

pres

sion

Tran

scrip

tion

Prot

ein_

cata

bolis

m

Phos

phor

ylat

ion

Loca

lizat

ion

Bind

ing

Regu

latio

n

Posit

ive_

regu

latio

n

Neg

ativ

e_re

gula

tion

Event type

Original(b)

















SVT




REG






5 Conclusions





FAUSTW 6847 80257390 442053714849 380254944494 494164755604A 661681047285 455358095105 393858184697 500067535746F 755878237688 409744704275 349948244056 479258475267

UMassW 670181407350 429756424879 375252674382 484964085520A 642180747154 435260895076 387855074551 487465945605F 755883147918 416747624444 347247514012 347247514012

UTurkuW 682276477211 429743604328 387247644272 495657655330A 649776727036 452450004750 404149014430 500659485437F 781875827698 375031763439 349944463916 483153385072

MSR-NLPW 689974307154 423640474139 366444084002 486454715150A 659974717008 432344514386 371445384085 485256475220F 78187324 7563 402832773614 355241343821 489450774984


OursW 686080347401 476656525171 389743884128 503557795381A 649779347144 495759725417 393243624136 499757905364F 797482978132 430649214593 382044464110 512957525423




SVT




REG





Competing Interests






Acknowledgments


References

















































of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of
























SVT




REG






5 Conclusions





FAUSTW 6847 80257390 442053714849 380254944494 494164755604A 661681047285 455358095105 393858184697 500067535746F 755878237688 409744704275 349948244056 479258475267

UMassW 670181407350 429756424879 375252674382 484964085520A 642180747154 435260895076 387855074551 487465945605F 755883147918 416747624444 347247514012 347247514012

UTurkuW 682276477211 429743604328 387247644272 495657655330A 649776727036 452450004750 404149014430 500659485437F 781875827698 375031763439 349944463916 483153385072

MSR-NLPW 689974307154 423640474139 366444084002 486454715150A 659974717008 432344514386 371445384085 485256475220F 78187324 7563 402832773614 355241343821 489450774984


OursW 686080347401 476656525171 389743884128 503557795381A 649779347144 495759725417 393243624136 499757905364F 797482978132 430649214593 382044464110 512957525423




SVT




REG





Competing Interests






Acknowledgments


References

















































of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of



















FAUSTW 6847 80257390 442053714849 380254944494 494164755604A 661681047285 455358095105 393858184697 500067535746F 755878237688 409744704275 349948244056 479258475267

UMassW 670181407350 429756424879 375252674382 484964085520A 642180747154 435260895076 387855074551 487465945605F 755883147918 416747624444 347247514012 347247514012

UTurkuW 682276477211 429743604328 387247644272 495657655330A 649776727036 452450004750 404149014430 500659485437F 781875827698 375031763439 349944463916 483153385072

MSR-NLPW 689974307154 423640474139 366444084002 486454715150A 659974717008 432344514386 371445384085 485256475220F 78187324 7563 402832773614 355241343821 489450774984


OursW 686080347401 476656525171 389743884128 503557795381A 649779347144 495759725417 393243624136 499757905364F 797482978132 430649214593 382044464110 512957525423




SVT




REG





Competing Interests






Acknowledgments


References

















































of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of




















Acknowledgments


References

















































of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of











































of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of
















research article a novel sample selection strategy for...

Documents