arxiv:1909.03881v1 [cs.lg] 9 sep 2019choosing the set as a small random subset of size from the...

11
Nearly-Unsupervised Hashcode Representations for Relation Extraction Sahil Garg USC ISI [email protected] Aram Galstyan USC ISI [email protected] Greg Ver Steeg USC ISI [email protected] Guillermo Cecchi IBM Research, NY [email protected] Abstract Recently, kernelized locality sensitive hash- codes have been successfully employed as representations of natural language text, especially showing high relevance to biomedical relation extraction tasks. In this paper, we propose to optimize the hashcode representations in a nearly unsu- pervised manner, in which we only use data points, but not their class labels, for learn- ing. The optimized hashcode representa- tions are then fed to a supervised classifier following the prior work. This nearly unsu- pervised approach allows fine-grained op- timization of each hash function, which is particularly suitable for building hashcode representations generalizing from a train- ing set to a test set. We empirically evalu- ate the proposed approach for biomedical relation extraction tasks, obtaining signif- icant accuracy improvements w.r.t. state- of-the-art supervised and semi-supervised approaches. 1 Introduction In natural language processing, one important but a highly challenging task is of identify- ing biological entities and their relations from biomedical text, as illustrated in Fig. 1, rele- vant for real world problems, such as personal- ized cancer treatments (Rzhetsky, 2016; Hahn and Surdeanu, 2015; Cohen, 2015). In the pre- vious works, the task of biomedical relation extraction is formulated as of binary classifi- cation of natural language structures; one of the primary challenges to solve the problem is that the number of data points annotated with class labels (in a training set) is small due to high cost of annotations by biomedi- cal domain experts, and further, bio-text sen- tences in a test set can vary significantly w.r.t. the ones from a training set due to practi- cal aspects, such as high diversity of research topics or writing styles, hedging, etc. Con- sidering such challenges for the task, many classification models based on kernel similar- ity functions have been proposed (Garg et al., 2016; Chang et al., 2016; Tikk et al., 2010; Miwa et al., 2009; Airola et al., 2008; Mooney and Bunescu, 2005), and recently, many neural networks based classification models have also been explored (Kavuluru et al., 2017; Peng and Lu, 2017; Hsieh et al., 2017; Rao et al., 2017; Nguyen and Grishman, 2015), including the ones doing adversarial learning using the knowledge of data points (excluding class la- bels) from a test set, or semi-supervised vari- ational autoencoders (Rios et al., 2018; Ganin et al., 2016; Zhang and Lu, 2019; Kingma et al., 2014). In a very recent work, kernelized local- ity sensitive hashcodes based representation learning approach has been proposed that has shown to be the most successful in terms of accuracy and computational efficiency for the task (Garg et al., 2019). The model param- eters, shared between all the hash functions, are optimized in a supervised manner, whereas an individual hash function is constructed in a randomized fashion. The authors suggest to obtain thousands of (randomized) semantic features extracted from natural language data points into binary hashcodes, and then making classification decision as per the features using hundreds of decision trees, which is the core of their robust classification approach. Even if we extract thousands of semantic features using the hashing approach, it is difficult to ensure that the features extracted from train- ing data points would generalize to a test set. While the inherent randomness in construct- arXiv:1909.03881v1 [cs.LG] 9 Sep 2019

Upload: others

Post on 25-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1909.03881v1 [cs.LG] 9 Sep 2019choosing the set as a small random subset of size from the superset S, i.e. SR l ˆSs.t. jSR l j= ˝N. In this manner, we can construct a large

Nearly-Unsupervised Hashcode Representationsfor Relation Extraction

Sahil GargUSC ISI

[email protected]

Aram GalstyanUSC ISI

[email protected]

Greg Ver SteegUSC ISI

[email protected]

Guillermo CecchiIBM Research, NY

[email protected]

Abstract

Recently, kernelized locality sensitive hash-codes have been successfully employedas representations of natural languagetext, especially showing high relevance tobiomedical relation extraction tasks. Inthis paper, we propose to optimize thehashcode representations in a nearly unsu-pervised manner, in which we only use datapoints, but not their class labels, for learn-ing. The optimized hashcode representa-tions are then fed to a supervised classifierfollowing the prior work. This nearly unsu-pervised approach allows fine-grained op-timization of each hash function, which isparticularly suitable for building hashcoderepresentations generalizing from a train-ing set to a test set. We empirically evalu-ate the proposed approach for biomedicalrelation extraction tasks, obtaining signif-icant accuracy improvements w.r.t. state-of-the-art supervised and semi-supervisedapproaches.

1 Introduction

In natural language processing, one importantbut a highly challenging task is of identify-ing biological entities and their relations frombiomedical text, as illustrated in Fig. 1, rele-vant for real world problems, such as personal-ized cancer treatments (Rzhetsky, 2016; Hahnand Surdeanu, 2015; Cohen, 2015). In the pre-vious works, the task of biomedical relationextraction is formulated as of binary classifi-cation of natural language structures; one ofthe primary challenges to solve the problemis that the number of data points annotatedwith class labels (in a training set) is smalldue to high cost of annotations by biomedi-cal domain experts, and further, bio-text sen-tences in a test set can vary significantly w.r.t.

the ones from a training set due to practi-cal aspects, such as high diversity of researchtopics or writing styles, hedging, etc. Con-sidering such challenges for the task, manyclassification models based on kernel similar-ity functions have been proposed (Garg et al.,2016; Chang et al., 2016; Tikk et al., 2010;Miwa et al., 2009; Airola et al., 2008; Mooneyand Bunescu, 2005), and recently, many neuralnetworks based classification models have alsobeen explored (Kavuluru et al., 2017; Pengand Lu, 2017; Hsieh et al., 2017; Rao et al.,2017; Nguyen and Grishman, 2015), includingthe ones doing adversarial learning using theknowledge of data points (excluding class la-bels) from a test set, or semi-supervised vari-ational autoencoders (Rios et al., 2018; Ganinet al., 2016; Zhang and Lu, 2019; Kingmaet al., 2014).

In a very recent work, kernelized local-ity sensitive hashcodes based representationlearning approach has been proposed that hasshown to be the most successful in terms ofaccuracy and computational efficiency for thetask (Garg et al., 2019). The model param-eters, shared between all the hash functions,are optimized in a supervised manner, whereasan individual hash function is constructed ina randomized fashion. The authors suggestto obtain thousands of (randomized) semanticfeatures extracted from natural language datapoints into binary hashcodes, and then makingclassification decision as per the features usinghundreds of decision trees, which is the coreof their robust classification approach. Evenif we extract thousands of semantic featuresusing the hashing approach, it is difficult toensure that the features extracted from train-ing data points would generalize to a test set.While the inherent randomness in construct-

arX

iv:1

909.

0388

1v1

[cs

.LG

] 9

Sep

201

9

Page 2: arXiv:1909.03881v1 [cs.LG] 9 Sep 2019choosing the set as a small random subset of size from the superset S, i.e. SR l ˆSs.t. jSR l j= ˝N. In this manner, we can construct a large

Figure 1: On the left, we show an abstract meaning representation (AMR) of a sentence. As per thesemantics of the sentence, there is a valid biomedical relationship between the two proteins, Ras andRaf, i.e. Ras catalyzes phosphorylation of Raf; the relation corresponds to a subgraph extracted fromthe AMR. On the other hand, one of the many invalid biomedical relationships that one could infer is,Ras catalyzes activation of Raf, for which we show the corresponding subgraph too. A given candidaterelation automatically hypothesized from the sentence, is binary classified, as valid or invalid, using thesubgraph as features.

ing hash functions from a training set can helptowards generalization in the case of absenceof a test set, there should be better alterna-tives if we do have the knowledge of a test setof data points, or a subset of a training settreated as a pseudo-test set. What if we con-struct hash functions in an intelligent mannervia exploiting the additional knowledge of un-labeled data points in a test/pseudo-test set,performing fine-grained optimization of eachhash function rather than relying upon ran-domness, so as to extract semantic featureswhich generalize?

Along these lines, we propose a new frame-work for learning hashcode representations ac-complishing two important (inter-related) ex-tensions w.r.t. the previous work:

(a) We propose to use a nearly unsuper-vised hashcode representation learning setting,in which we use only the knowledge of whichset a data point comes from, a training set or atest/pseudo-test set, along with the data pointitself, whereas the actual class labels of datapoints from a training set are input only tothe final supervised-classifier, such as a Ran-dom Forest, which takes input of the learnedhashcodes as representation (feature) vectorsof data points along with their class labels;

(b) We introduce multiple concepts for fine-grained (discrete) optimization of hash func-tions, employed in our novel information-theoretic algorithm that constructs hash func-tions greedily one by one. In supervised

settings, fine-grained (greedy) optimizationof hash functions could lead to overfittingwhereas, in our proposed nearly-unsupervisedframework, it allows flexibility for explicitlymaximizing the generalization capabilities ofhash functions.

For a task of biomedical relation extrac-tion, we evaluate our approach on four pub-lic datasets, and obtain significant gains in F1scores w.r.t. state-of-the-art models includ-ing kernel-based approaches as well the onesbased on semi-supervised learning of neuralnetworks. We also show how to employ ourframework for learning locality sensitive hash-code representations using neural networks.1

2 Problem Formulation &Background

In Fig. 1, we demonstrate how biomedical re-lations between entities are extracted from thesemantic (or syntactic) parse of a sentence.As we see, the task is formulated as of bi-nary classification of natural language sub-structures extracted from the semantic parse.Suppose we have natural language structures,S = {Si}N1 , such as parse trees, shortest paths,text sentences, etc, with corresponding classlabels, y = {yi}N1 . For the data points com-ing from a training set and a test set, we usenotations, ST , and S∗, respectively; same ap-plies for the class labels. In addition, we define

1Code: https://github.com/sgarg87/nearly_unsupervised_hashcode_representations

Page 3: arXiv:1909.03881v1 [cs.LG] 9 Sep 2019choosing the set as a small random subset of size from the superset S, i.e. SR l ˆSs.t. jSR l j= ˝N. In this manner, we can construct a large

(a) Hash function choices (b) Generalizing hash function

Figure 2: In this figure, we illustrate how to construct hash functions which generalize across a trainingand a test set. Blue and red colors denote training and test data points respectively. A hash function isconstructed by splitting a very small subset of data points into two parts. In Fig. 2(a), given the set offour data points selected randomly, there are many choices possible for splitting the set, correspondingto difference choices of hash functions which are denoted with dashed lines. The optimal choice of hashfunction is shown in Fig. 2(b) since it generalizes well, assigning same value to many data points fromboth training & test sets.

indicator variable, x = {xi}Ni=1, for S, withxi ∈ {0, 1} denoting if a data point Si is com-ing from a test/pseudo-test set or a trainingset. Our goal is to infer the class labels of thedata points from a test set, y∗.

2.1 Hashcode Representations

As per the hashcode representation approach,S is mapped to a set of locality sensitivehashcodes, C = {ci}N1 , using a set of Hbinary hash functions, i.e. ci = h(Si) ={h1(Si), · · · , hH(Si)}. hl(.;θ) is constructedsuch that it splits a set of data points, SRl ,into two subsets as shown in Fig. 2(a), whilechoosing the set as a small random subset ofsize α from the superset S, i.e. SRl ⊂ S s.t.|SRl | = α � N . In this manner, we canconstruct a large number of hash functions,{hl(.;θ)}Hl=1, from a reference set of size M ,SR = {SR1 ∪· · ·∪SRH}, |SR| = M ≤ αH � N .

While, mathematically, a locality sensitivehash function can be of any form, kernel-ized hash functions (Garg et al., 2019, 2018;Joly and Buisson, 2011), rely upon a convolu-tion kernel similarity function K(Si, Sj ;θ) de-fined for any pair of structures Si and Sj withkernel-parameters θ (Haussler, 1999). To con-struct hl(.), a kernel-trick based model, such askNN, SVM, is fit to {SRl , zl}, with a randomlysampled binary vector, zl ∈ {0, 1}α, that de-fines the split of SRl . For computing hash-code ci for Si, it requires only M number ofconvolution-kernel similarities of Si w.r.t. the

data points in SR, which makes this approachhighly scalable in compute cost terms.

In the previous work (Garg et al., 2019),it is proposed to optimize all the hash func-tions jointly by learning only the parameterswhich are shared amongst all the functions, i.e.learning kernel parameters, θ and the choice ofreference set, SR ⊂ ST . This optimization isperformed in a supervised manner via max-imization of the mutual information betweenhashcodes of data points and their class labels,using {ST ,yT } for training.

3 Nearly-Unsupervised HashcodeRepresentations

Our key insight in regards to limitation ofthe previous approach for supervised learningof hashcode representations, is that, to avoidoverfitting, learning is intentionally restrictedonly to the optimization of shared parame-ters whereas each hash function hl(.) is con-structed in a randomized manner, i.e. randomsub-sampling of a subset, SRl ⊂ S, and a ran-dom split of the subset. On the other hand, ina nearly-unsupervised hashcode learning set-tings as we introduce next, we can have theadditional knowledge of data points from atest/pseudo-test set which can be leveragedto extend the optimization from the shared(global) parameters to fine-grained optimiza-tion of hash functions, not only to avoid over-fitting but for higher generalization of hash-

Page 4: arXiv:1909.03881v1 [cs.LG] 9 Sep 2019choosing the set as a small random subset of size from the superset S, i.e. SR l ˆSs.t. jSR l j= ˝N. In this manner, we can construct a large

(a) Global sampling (b) Local sampling from a cluster (c) Local sampling from a high-entropy cluster

Figure 3: In all the three figures, dark-gray lines denote hash functions optimized previously, and in-tersections of the lines give us 2-D cells denoting hashcodes as well as clusters; black dots represent aset of data points which are used to construct a hash function, and some of the choices to split thesubset are shown as green dashed lines. The procedure of sampling the subset varies across the threefigures. In Fig. 3(a), since the four data points from global sampling have already unique hashcodes (2-Dcells), the newly constructed hash function (one of the green dashed-lines) adds little information to theirrepresentations. On the other hand in Fig. 3(b), a hash function constructed from a set of data points,which are sampled locally from within a cluster, puts some of the data points in the sampled set intotwo different cells (hashcodes), so adding more fine-grained information to their representations, hencemore advantageous from representation learning perspective. In Fig. 3(c), training and test data pointsare denoted with blue and red colors respectively, and the data points to construct a new hash functionare sampled locally from within a high entropy cluster, i.e. the one containing a balanced proportion ofthe training & test data points.

codes across training & test sets.

Nearly unsupervised learning settingsWe propose to learn hash functions, h(.), usingS = {ST ,S∗}, x, and optionally, yT . Herein,S∗ is a test set, or a pseudo-test set that canbe a random subset of the training set or alarge set of unlabeled data points outside thetraining set.

3.1 Basic Concepts for Fine-GrainedOptimization

In the prior works on kernel-similarity basedlocality sensitive hashing, the first step for con-structing a hash function is to randomly sam-ple a small subset of data points, from a su-perset S, and in the second step, the subset issplit into two parts using a kernel-trick basedmodel (Garg et al., 2019, 2018; Joly and Buis-son, 2011), serving as the hash function, asdescribed in Sec. 2.

In the following, we introduce basic conceptsfor improving upon these two key aspects ofconstructing a hash function, while later, inSec. 3.2, these concepts are incorporated in aunified manner in our proposed information-theoretic algorithm that greedily optimizes hashfunctions one by one.

Informative splitIn Fig. 2(a), construction of a hash function ispictorially illustrated, showing multiple possi-ble splits, as dotted lines, of a small set of fourdata points (black dots). (Note that a hashfunction is shown to be a linear hyperplaneonly for simplistic explanations of the basicconcepts.) While in the previous works, one ofthe many choices for splitting the set is chosenrandomly, we propose to optimize upon thischoice. Intuitively, one should choose a split ofthe set, corresponding to a hash function, suchthat it gives a balanced split for the whole setof data points, and it should also generalizeacross training & test sets. In reference to thefigure, one simple way to analyze the general-ization of a split (so the hash function) is tosee if there are training as well as test datapoints (or pseudo-test data points) on eitherside of the dotted line. As per this concept,an optimal split of the set of four data pointsis shown in Fig. 2(b).

Referring back to Sec. 2, clearly, this isa combinatorial optimization problem, wherewe need to choose an optimal choice of zl ∈{0, 1}α for set, SRl , to construct hl(.). For asmall value of α, one can either go through

Page 5: arXiv:1909.03881v1 [cs.LG] 9 Sep 2019choosing the set as a small random subset of size from the superset S, i.e. SR l ˆSs.t. jSR l j= ˝N. In this manner, we can construct a large

all the possible combinations in a brute forcemanner, or use Markov Chain Monte Carlosampling. It is interesting to note that, eventhough a hash function is constructed from avery small set of data points (of size α 6� 1),the generalization criterion, formulated in ourinfo-theoretic objective introduced in Sec. 3.2,is computed using all the data points availablefor the optimization, S.

Local sampling from a clusterAnother aspect of constructing a hash func-tion, having a scope for improvement, is sam-pling of a small subset of data points, SRl ⊂ S,that is used to construct a hash function. Inthe prior works, the selection of such a sub-set is purely random, i.e. random selection ofdata points globally from S. In Fig. 3, weillustrate that, it is wiser to (randomly) se-lect data points locally from one of the clus-ters of the data points in S, rather than sam-pling globally from S. Here, we propose thatclustering of all the data points in S can beobtained using the hash functions itself, dueto their locality sensitive property. While us-ing a large number of locality sensitive hashfunctions give us fine-grained representationsof data points, a small subset of the hash func-tions, of size ζ, defines a valid clustering of thedata points, since data points which are simi-lar to each other should have same hashcodesserving as cluster labels.

From this perspective, we can constructfirst few hash functions from global sam-pling of data points, what we refer as globalhash functions. These global hash functionsshould serve to provide hashcode represen-tations as well as clusters of data points.Then, via local sampling from the clusters,we can also construct local hash functionsto capture more finer details of data points.As per this concept, we can learn hierar-chical (multi-scale) hashcode representationsof data points, capturing differences betweendata points from coarser (global hash func-tions) to finer scales (local hash functions).

Further, we suggest to choose a clusterthat has a balanced proportion of training &test (pseudo-test) data points, which is desir-able from the perspective of having generalizedhashcode representations; see Fig. 3(c).

Figure 4: Dark-gray lines denote hash functionsoptimized previously, and gray color representsdata points used for the optimization. The inter-sections of the lines give us 2-D cells, correspond-ing to hashcodes as well as clusters. For the setof four data points sampled within a cluster, thereare many choices to split it (corresponding to hashfunctions choices), shown with dashed lines. Wepropose to choose the one which also splits the setsof data points in other (neighboring) clusters in abalanced manner, i.e. having data points in a clus-ter on either side of the dashed line and cuttingthrough as many clusters as possible. As per thiscriterion, the thick-green dashed lines are superiorchoices w.r.t. the thin-green ones.

Splitting other clustersIn reference to Fig. 4, non-redundancy of ahash function w.r.t. the other hash functionscan be characterized in terms of how well thehash function splits the clusters defined as perthe other hash functions.

Next we mathematically formalize all theconcepts introduced above for fine-grainedoptimization of hash functions into aninformation-theoretic objective function.

3.2 Information-Theoretic Learning

We optimize hash functions greedily, one byone. Referring back to Sec. 2, we define binaryrandom variable x denoting if a data pointcomes from a training set or a test/pseudo-test set. In a greedy step of optimizing a hashfunction, random variable, c, represents thehashcode of a data point S, as per the previ-ously optimized hash functions hl−1(.). Alongsame lines, c denotes the binary random vari-able corresponding from the present hash func-tion under optimization, hl(.). We maximizethe information-theoretic objective as below.

Page 6: arXiv:1909.03881v1 [cs.LG] 9 Sep 2019choosing the set as a small random subset of size from the superset S, i.e. SR l ˆSs.t. jSR l j= ˝N. In this manner, we can construct a large

Algorithm 1 Nearly-Unsupervised Hashcode Representation Learning

Require: Dataset {S,x}, and parameters, H,α, ζ.1: h(.)← {}, C ← {}, f ← {}

%Greedy step for optimizing hash function, hl(.)2: while |h(.)| < H do3: αl ← sampleSubsetSize(α) %Sample a size for the subset of data points, SRl , to construct

a hash function4: if |h(.)| < ζ then5: SRl ← randomSampleGlobally(S, αl) %Randomly sample data points, globally from

S, for constructing a global hash function.6: else

7: SRl ← randomSampleLocally(S,C,x, αl, ζ) %Sample data points randomly from, ahigh entropy cluster, for constructing a local hash function.

8: end if

9: hl(.), fl← optimizeSplit(SRl ,S,C,x

)%Optimize split of SRl .

10: c← computeHash(S, hl(.))11: C ← C ∪ c, h(.)← h(.) ∪ hl(.), f ← f ∪ fl.12: h(.),C,f ← deleteLowInfoFunc(h(.),C,f) %delete hash functions from the set with

lower objective values13: end while14: Return h(.),C.

argmaxhl()H(x, c)− I(c : c) +H(x|c); (1)

c = hl−1(S), c = hl(S)

Herein the optimization of a hash function,h(.), involves intelligent selection of SRl , an in-

formative split of SRl , i.e. optimizing zl forSRl , and learning of the parameters θ of a (ker-nel or neural) model, which is fit on {SRl , zl},acting as the hash function.

In the objective function above, maximiz-ing the first term, H(x, c), i.e. joint entropyon x and c, corresponds to the concept of in-formative split described above in Sec. 3.1; seeFig. 2. This term is cheap to compute since xand c are both 1-dimensional binary variables.

The second term in the objective, the mu-tual information term, ensures minimal redun-dancies between hash functions. This is re-lated to the concept of constructing a hashfunction such that it splits many of the ex-isting clusters, as mentioned above in Sec. 3.1;see Fig. 4. This mutual information functioncan be computed using the approximation inthe previous work by (Garg et al., 2019).

The last quantity in the objective is H(x|c),conditional entropy on x given c. We propose

to maximize this term indirectly via choosing acluster informatively, from which to randomlyselect data points for constructing the hashfunction, such that it contains a balanced ra-tio of the count of training & test data points,i.e. a cluster with high entropy on x, whichwe refer as a high entropy cluster. In refer-ence to Fig. 3(c), the new clusters emergingfrom a split of a high entropy cluster shouldhave higher chances to be high entropy clus-ters themselves, thus maximizing the last termindirectly. We compute marginal entropy on xfor each cluster, and an explicit computationof H(x|c) is not required.

Optionally one may extend the objective toinclude the term, −H(y|c, c), with y denotingthe random variable for a class label.

The above described learning framework issummarized in Alg. 1.

Besides kernelized locality sensitive hash-codes, the above framework allows neural lo-cality sensitive hashing. One can fit any (reg-ularized) neural model on {SRl , zl}, acting as aneural locality sensitive hash function. We ex-pect that some of the many possible choices fora split of SRl should lead to natural semanticcategorizations of the data points. For sucha natural split choice, even a parameterized

Page 7: arXiv:1909.03881v1 [cs.LG] 9 Sep 2019choosing the set as a small random subset of size from the superset S, i.e. SR l ˆSs.t. jSR l j= ˝N. In this manner, we can construct a large

model can act as a good hash function with-out overfitting as we observed empirically.

In the algorithm, we also propose to deletesome of the hash functions from the set of op-timized ones, the ones which have low objec-tive function values w.r.t. the rest. This stepprovides robustness against an arbitrarily badchoice of randomly selected subset, SRl .

Our algorithm allows parallel computing, asin the previous hashcode learning approach.

4 Experiments

We demonstrate the applicability of our ap-proach for a challenging task of biomedical re-lation extraction, using four public datasets.

Dataset detailsFor AIMed and BioInfer, cross-corpus evalua-tions have been performed in many previousworks (Airola et al., 2008; Tikk et al., 2010;Peng and Lu, 2017; Hsieh et al., 2017; Rioset al., 2018; Garg et al., 2019). These datasetshave annotations on pairs of interacting pro-teins (PPI) in a sentence while ignoring the in-teraction type. Following the previous works,for a given pair of proteins mentioned in atext sentence from a training or a test set, weobtain the corresponding undirected shortestpath from a Stanford dependency parse of thesentence, that serves as a data point.

We also use PubMed45 and BioNLPdatasets which have been used for extensiveevaluations in recent works (Garg et al., 2019,2018; Rao et al., 2017; Garg et al., 2016).These two datasets consider a relatively moredifficult task of inferring an interaction be-tween two or more bio-entities mentioned ina sentence, along with the inference of theirinteraction-roles, and the type of interactionfrom an unrestricted list. As in the previ-ous works, we use abstract meaning represen-tation (AMR) to obtain shortest path-baseddata points (Banarescu et al., 2013); same bio-AMR parser (Pust et al., 2015) is employed asin the previous works. PubMed45 dataset has11 subsets, with evaluation performed for eachof the subsets as a test set leaving the restfor training (not to be confused with cross-validation). For BioNLP dataset (Kim et al.,2009, 2011; Nedellec et al., 2013), the train-ing set contains annotations from years 2009,2011, 2013, and the test set contains develop-

ment set from year 2013. Overall, for a faircomparison of the models, we keep same ex-perimental setup as followed in (Garg et al.,2019), for all the four datasets, so as to avoidany bias due to engineering aspects; evaluationmetrics for the relation extraction task are, f1score, precision, recall.

Baseline methodsThe most important baseline method for thecomparison is the recent work of supervisedhashcode representations (Garg et al., 2019).Their model is called as KLSH-RF, withKLSH referring to kernelized locality sensitivehashcodes, and RF denotes Random Forest.Our approach differs from their work in thesense that our hashcode representations arenearly unsupervised, whereas their approachis purely supervised, while both approachesuse a supervised RF. We refer to our modelas KLSH-NU-RF. Within the nearly unsuper-vised learning setting, we consider transduc-tive setting by default, i.e. using data pointsfrom both training and test sets. Later, wealso show results for inductive settings, i.e. us-ing a random subset of training data points,as a pseudo-test set. In both scenarios, we donot use class labels for learning hashcodes, butonly for training RF.

For AIMed and BioInfer datasets, adversar-ial learning based four neural network modelshad been evaluated in the prior works (Rioset al., 2018; Ganin et al., 2016), referredas CNN-RevGrad, Bi-LSTM-RevGrad, Adv-CNN, Adv-CNN. Like our model KLSH-NU-RF, these four models are also learned intransductive settings of using data pointsfrom the test set in addition to the train-ing set. Semi-supervised Variational Autoen-coders (SSL-VAE ) have also been explored forbiomedical relation extraction (Zhang and Lu,2019), which we evaluate ourselves for all thefour datasets considered in this paper.

Parameter settingsWe use path kernels with word vectors &kernel parameter settings as in the previouswork (Garg et al., 2019). From a preliminarytuning, we set the number of hash functions,H = 100, and the number of decision treesin a Random Forest classifier, R = 100; theseparameters are not sensitive, requiring mini-

Page 8: arXiv:1909.03881v1 [cs.LG] 9 Sep 2019choosing the set as a small random subset of size from the superset S, i.e. SR l ˆSs.t. jSR l j= ˝N. In this manner, we can construct a large

Models (A, B) (B, A)

SVM (Airola08) 0.47 0.47

SVM (Miwa09) 0.53 0.50

SVM (Tikk10) 0.41 0.42(0.67, 0.29) (0.27, 0.87)

CNN (Nguyen15) 0.37 0.45

Bi-LSTM (Kavuluru17) 0.30 0.47

CNN (Peng17) 0.48 0.50(0.40, 0.61) (0.40, 0.66)

RNN (Hsieh17) 0.49 0.51

CNN-RevGrad (Ganin16) 0.43 0.47

Bi-LSTM-RevGrad (Ganin16) 0.40 0.46

Adv-CNN (Rios18) 0.54 0.49

Adv-Bi-LSTM (Rios18) 0.57 0.49

KLSH-kNN (Garg18) 0.51 0.51(0.41, 0.68) (0.38, 0.80)

KLSH-RF (Garg19) 0.57 0.54(0.46, 0.75) (0.37, 0.95)

SSL-VAE (Zhang19) 0.50 0.46(0.38, 0.72) (0.39, 0.57)

KLSH-NU-RF 0.57 0.57(0.44, 0.81) (0.44, 0.81)

Table 1: Evaluation results from cross-corpus eval-uation for (train, test) pairs of datasets, AIMed (A)and BioInfer (B). For each model, we report F1score, and if available, precision, recall scores arealso shown in brackets. For the adversarial neuralmodels by (Ganin et al., 2016), evaluation on thedatasets was provided by (Rios et al., 2018).

mal tuning. For any other parameters whichmay require fine-grained tuning, we use 10%of training data points, selected randomly, forvalidation. Within kernel locality sensitivehashing, we choose between Random Maxi-mum Margin and Random k-Nearest Neigh-bors techniques, and for neural locality sensi-tive hashing, we use a simple 2-layer LSTMmodel with 8 units per layer. In our nearlyunsupervised learning framework, we use sub-sets of the hash functions, of size 10, to obtainclusters (ζ = 10). We employ 8 cores on ani7 processor, with 32GB memory, for all thecomputations.

4.1 Experimental Results

In summary, our model KLSH-NU-RF signifi-cantly outperforms its purely supervised coun-terpart, KLSH-RF, and also semi-supervisedneural network models.

In reference to Tab. 1, we first discuss re-sults for the evaluation setting of using AIMeddataset as a test set, and BioInfer as a trainingset. We observe that our model, KLSH-NU-RF, obtains F1 score, 3 pts higher w.r.t. the

Models PubMed45 BioNLP

SVM (Garg16) 0.45±0.25 0.46(0.58, 0.43) (0.35, 0.67)

LSTM (Rao17) N.A. 0.46(0.51, 0.44)

LSTM (Garg19) 0.30±0.21 0.59(0.38, 0.28) (0.89, 0.44)

Bi-LSTM (Garg19) 0.46±0.26 0.55(0.59, 0.43) (0.92, 0.39)

LSTM-CNN (Garg19) 0.50±0.27 0.60(0.55, 0.50) (0.77, 0.49)

CNN (Garg19) 0.51±0.28 0.60(0.46, 0.46) (0.80, 0.48)

KLSH-kNN (Garg18) 0.46±0.21 0.60(0.44, 0.53) (0.63, 0.57)

KLSH-RF (Garg19) 0.57±0.25 0.63(0.63, 0.55) (0.78, 0.53)

SSL-VAE (Zhang19) 0.40± 0.16 0.48(0.33, 0.69) (0.43, 0.56)

KLSH-NU-RF 0.61±0.23 0.67(0.61, 0.62) (0.73, 0.61)

Table 2: Evaluation results for PubMed45 andBioNLP datasets. We report F1 score (mean ±standard deviation), and mean-precision & mean-recall numbers in brackets. For BioNLP, we stan-dard deviation numbers are not provided as thereis one fixed test subset.

most recent baseline, KLSH-RF. In compari-son to the semi-supervised neural neural mod-els, CNN-RevGrad, Bi-LSTM-RevGrad, Adv-CNN, Adv-Bi-LSTM, SSL-VAE, which use theknowledge of a test set just like our model,we gain 8-11 pts in F1 score. On the otherhand, when evaluating on BioInfer dataset asa test set and AIMed as a training set, ourmodel is in tie w.r.t. the adversarial neuralmodel, Adv-Bi-LSTM, though outperformingthe other three adversarial models and SSL-VAE, by large margins in F1 score. In com-parison to KLSH-RF, we retain same F1 score,while gaining in recall by 6 pts at the cost oflosing 2 pts in precision.

For PubMed and BioNLP datasets, thereis no prior evaluation of adversarial mod-els. Nevertheless, in Tab. 2, we see thatour model significantly outperforms SSL-VAE,and it also outperforms the most relevant base-line, KLSH-RF, gaining F1 score by 4 pts forboth the datasets.2 Note that, high standarddeviations reported for PubMed45 dataset aredue to high diversity across the 11 test sets,

2These two datasets have high importance to gaugepractical relevance of a model for the task of biomedicalrelation extraction.

Page 9: arXiv:1909.03881v1 [cs.LG] 9 Sep 2019choosing the set as a small random subset of size from the superset S, i.e. SR l ˆSs.t. jSR l j= ˝N. In this manner, we can construct a large

Figure 5: Comparison across the three types oflearning settings, supervised, transductive, and in-ductive.

while the performance of our model for a giventest set is highly stable (improvements are sta-tistically significant with p-value: 6e-3).

One general trend to observe is that the pro-posed nearly unsupervised learning approachleads to a significantly higher recall, at thecost of marginal drop in precision, w.r.t. itssupervised baseline.

Further, note that the number of hash func-tions used in the prior work is 1000 whereas weuse only 100 hash functions. Compute time issame as for their model.

Our approach is easily extensible for othermodeling aspects such as non-stationary ker-nel functions, document level inference, jointuse of semantic & syntactic parses, ontology ordatabase usage (Garg et al., 2019, 2016; Ali-cante et al., 2016), though we refrain from pre-senting system level evaluations, and have fo-cused only upon analyzing improvements fromour principled extension of the recently pro-posed technique that has already been shownto be successful for the task.

Transductive vs inductive settingsIn the above discussed results, hashcode repre-sentations in our models are learned in trans-ductive setting. For inductive settings, we ran-domly select 25% of the training data pointsfor use as a pseudo-test set instead of the testset. In Fig. 5, we observe that both inductiveand transductive settings are more favorablew.r.t. the supervised one, KLSH-RF, whichis the baseline in this paper. F1 scores ob-tained from the inductive setting are on a parwith the transductive settings. It is worth not-ing that, in inductive settings, our model istrained on information even lesser than thebaseline model KLSH-RF, yet it obtains F1

Figure 6: Comparison of neural hashing w.r.t. ker-nel hashing, and the best of neural baselines.

scores significantly higher.

Neural hashingIn Fig. 6, we show results for neural localitysensitive hashing within our proposed frame-work, and observe that neural hashing is alittle worse than its kernel based counter-part, however it performs significantly supe-rior w.r.t. the best of other neural models.

5 Conclusions

We proposed a nearly-unsupervised frame-work for learning of kernelized locality sensi-tive hashcode representations, a recent tech-nique, that was originally supervised, whichhas shown state-of-the-art results for a difficulttask of biomedical relation extraction. Withinour proposed framework, we use the additionalknowledge of test/pseudo-test data points forfine-grained optimization of hash functions soas to obtain hashcode representations general-izing across training & test sets. Our experi-ment results show significant improvements inaccuracy numbers w.r.t. the supervised base-line, as well as semi-supervised neural networkmodels, for the same task of bio-medical rela-tion extraction across four public datasets.

6 Acknowledgments

This work was sponsored by the DARPA BigMechanism program (W911NF-14-1-0364). Itis our pleasure to acknowledge fruitful discus-sions with Shushan Arakelyan, Amit Dhurand-har, Sarik Ghazarian, Palash Goyal, KarenHambardzumyan, Hrant Khachatrian, KevinKnight, Daniel Marcu, Dan Moyer, Kyle Re-ing, and Irina Rish. We are also grateful toanonymous reviewers for their valuable feed-back.

Page 10: arXiv:1909.03881v1 [cs.LG] 9 Sep 2019choosing the set as a small random subset of size from the superset S, i.e. SR l ˆSs.t. jSR l j= ˝N. In this manner, we can construct a large

References

Antti Airola, Sampo Pyysalo, Jari Bjorne, TapioPahikkala, Filip Ginter, and Tapio Salakoski.2008. All-paths graph kernel for protein-proteininteraction extraction with evaluation of cross-corpus learning. BMC Bioinformatics.

Anita Alicante, Massimo Benerecetti, AnnaCorazza, and Stefano Silvestri. 2016. A dis-tributed architecture to integrate ontologicalknowledge into information extraction. Interna-tional Journal of Grid and Utility Computing.

Laura Banarescu, Claire Bonial, Shu Cai,Madalina Georgescu, Kira Griffitt, Ulf Herm-jakob, Kevin Knight, Philipp Koehn, MarthaPalmer, and Nathan Schneider. 2013. Abstractmeaning representation for sembanking. In Pro-ceedings of the Linguistic Annotation Workshopand Interoperability with Discourse.

Yung-Chun Chang, Chun-Han Chu, Yu-Chen Su,Chien Chin Chen, and Wen-Lian Hsu. 2016.PIPE: a protein–protein interaction passageextraction module for biocreative challenge.Database.

Paul R Cohen. 2015. DARPA’s big mechanismprogram. Physical biology.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,Pascal Germain, Hugo Larochelle, FrancoisLaviolette, Mario Marchand, and Victor Lem-pitsky. 2016. Domain-adversarial training ofneural networks. Journal of Machine Learning.

Sahil Garg, Aram Galstyan, Ulf Hermjakob, andDaniel Marcu. 2016. Extracting biomolecularinteractions using semantic parsing of biomedi-cal text. In Proceedings of the AAAI Conferenceon Artificial Intelligence.

Sahil Garg, Aram Galstyan, Greg Ver Steeg, IrinaRish, Guillermo Cecchi, and Shuyang Gao. 2019.Kernelized hashcode representations for relationextraction. In Proceedings of the AAAI Confer-ence on Artificial Intelligence.

Sahil Garg, Greg Ver Steeg, and Aram Galstyan.2018. Stochastic learning of nonstationary ker-nels for natural language modeling. arXivpreprint arXiv:1801.03911.

Marco A Valenzuela-Escarcega Gus Hahn andPowell Thomas Hicks Mihai Surdeanu. 2015.A domain-independent rule-based frameworkfor event extraction. In Proceedings of ACL-IJCNLP.

David Haussler. 1999. Convolution kernels on dis-crete structures. Technical report.

Yu-Lun Hsieh, Yung-Chun Chang, Nai-WenChang, and Wen-Lian Hsu. 2017. Identifying

protein-protein interactions in biomedical liter-ature using recurrent neural networks with longshort-term memory. In Proceedings of the Inter-national Joint Conference on Natural LanguageProcessing.

Alexis Joly and Olivier Buisson. 2011. Randommaximum margin hashing. In Proceedings ofthe Conference on Computer Vision and Pat-tern Recognition.

Ramakanth Kavuluru, Anthony Rios, and TungTran. 2017. Extracting drug-drug interactionswith word and character-level recurrent neuralnetworks. In IEEE International Conference onHealthcare Informatics.

Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo,Yoshinobu Kano, and Jun’ichi Tsujii. 2009.Overview of BioNLP’09 shared task on eventextraction. In Proceedings of the Workshop onCurrent Trends in Biomedical Natural LanguageProcessing: Shared Task.

Jin-Dong Kim, Sampo Pyysalo, Tomoko Ohta,Robert Bossy, Ngan Nguyen, and Jun’ichi Tsu-jii. 2011. Overview of BioNLP shared task2011. In Proceedings of the BioNLP Shared TaskWorkshop.

Durk P Kingma, Shakir Mohamed, DaniloJimenez Rezende, and Max Welling. 2014. Semi-supervised learning with deep generative mod-els. In Proceedings of the Neural InformationProcessing Systems Conference.

Makoto Miwa, Rune Sætre, Yusuke Miyao, andJunichi Tsujii. 2009. Protein–protein interac-tion extraction by leveraging multiple kernelsand parsers. International Journal of MedicalInformatics.

Raymond J Mooney and Razvan C Bunescu. 2005.Subsequence kernels for relation extraction. InProceedings of the Neural Information Process-ing Systems Conference.

Claire Nedellec, Robert Bossy, Jin-Dong Kim,Jung-Jae Kim, Tomoko Ohta, Sampo Pyysalo,and Pierre Zweigenbaum. 2013. Overview ofBioNLP shared task 2013. In Proceedings of theBioNLP Shared Task Workshop.

Thien Huu Nguyen and Ralph Grishman. 2015.Relation extraction: Perspective from convolu-tional neural networks. In Proceedings of the 1stWorkshop on Vector Space Modeling for NaturalLanguage Processing.

Yifan Peng and Zhiyong Lu. 2017. Deep learningfor extracting protein-protein interactions frombiomedical literature. In Proceedings of BioNLPWorkshop.

Page 11: arXiv:1909.03881v1 [cs.LG] 9 Sep 2019choosing the set as a small random subset of size from the superset S, i.e. SR l ˆSs.t. jSR l j= ˝N. In this manner, we can construct a large

Michael Pust, Ulf Hermjakob, Kevin Knight,Daniel Marcu, and Jonathan May. 2015. Pars-ing english into abstract meaning representationusing syntax-based machine translation. In Pro-ceedings of the Conference on Empirical Meth-ods in Natural Language Processing.

Sudha Rao, Daniel Marcu, Kevin Knight, andHal Daume III. 2017. Biomedical event extrac-tion using abstract meaning representation. InWorkshop on Biomedical Natural Language Pro-cessing.

Anthony Rios, Ramakanth Kavuluru, and ZhiyongLu. 2018. Generalizing biomedical relation clas-sification with neural adversarial domain adap-tation. Bioinformatics.

Andrey Rzhetsky. 2016. The big mechanism pro-gram: Changing how science is done. In Pro-ceedings of DAMDID/RCDL.

Domonkos Tikk, Philippe Thomas, Peter Palaga,Jorg Hakenberg, and Ulf Leser. 2010. A com-prehensive benchmark of kernel methods to ex-tract protein–protein interactions from litera-ture. PLoS Computational Biology.

Yijia Zhang and Zhiyong Lu. 2019. Explor-ing semi-supervised variational autoencoders forbiomedical relation extraction. Methods.