unsupervised construction of knowledge graphs from text

Unsupervised Construction ofKnowledge Graphs From Text and Code

Kun CaoGeorgia Tech Research Institute

James FairbanksGeorgia Tech Research Institute

Abstract

The scientific literature is a rich source ofinformation for data mining with conceptualknowledge graphs; the open science movementhas enriched this literature with complementarysource code that implements scientific models.To exploit this new resource, we construct aknowledge graph using unsupervised learningmethods to identify conceptual entities. Weassociate source code entities to these naturallanguage concepts using word embedding andclustering techniques.

Practical naming conventions for methodsand functions tend to reflect the concept(s)they implement. We take advantage of thisspecificity by presenting a novel process forjoint clustering text concepts that combinesword-embeddings, nonlinear dimensionality re-duction, and clustering techniques to assist inunderstanding, organizing, and comparing soft-ware in the open science ecosystem. With ourpipeline, we aim to assist scientists in buildingon existing models in their discipline when mak-ing novel models for new phenomena. By com-bining source code and conceptual information,our knowledge graph enhances corpus-wide un-derstanding of scientific literature.

1 Introduction

The corpus of scientific literature is growing expo-nentially, leaving individual researchers struggling tokeep up. Natural language processing (NLP) tech-niques can help us understand this literature. How-ever, as science becomes more dependent on largescale software, modeling and simulation, and dataanalysis scripts, the knowledge contained in publica-tions shifts from the text of the papers to the softwareartifacts used to produce them. With the rise of open

science and the drive to share open source code thatimplements scientific models, we are able to conductanalysis on scientific source code for the first time.

Source code definitions are typically semantic ab-breviations of the scientific concepts they imple-ment. We demonstrate that this constrained vo-cabulary is sufficient enough to create mappings toconcepts extracted from natural language by usingvectorized distances of word-embeddings [13]. Sub-sequently, we build knowledge representations thatconnect the conceptual relationships from scientifictexts, with the procedural information embodied inopen source scientific code. This paper proposes aknowledge graph framework for that knowledge rep-resentation, as well as a methodology for constructingsaid graph using both rule-based and unsupervised-learning techniques. Our methodology demonstratesan automated process of concept extraction usingan open source textbook on epidemiological model-ing and provides semantic meaning to code. Theseknowledge graphs can be used in future research tounderstand, organize, and augment scientific models.

Motivation The core motivation of this projectis to support SemanticModels.jl [4], which is a sys-tem that allows scientists with limited scientific com-puting backgrounds to modify existing implementa-tions that are similar to their model. The knowledgegraph of reference text and code provides a method ofsearching for other software models that are seman-tically similar. Knowledge graphs are generated foreach model and stored within a code base where sim-ilarity is determined through comparison of concep-tual nodes. This information, along with other dataprovided by dynamic and static analysis, gives Se-manticModels.jl the capability to detect similar mod-els and perform model transformations.

Related Work Related work includes the genera-tion of coding comments with Deep Learning, which

1

arX

iv:1

908.

0935

4v1

[cs

.LG

] 2

5 A

ug 2

019

makes the assumption that the transition processbetween source code and comments is similar tothe translation process between different natural lan-guages [9]. The Abstract Syntax Tree (AST) is usedto model extracted concepts from source code to en-able translation to natural language through traver-sal. NSEEN: Neural Semantic Embedding for EntityNormalization tackles a similar problem of construct-ing a knowledge graph from extracted text from var-ious domains. They introduce a process called en-tity normalization, which consists of mapping entitymentions from reference texts to another set of estab-lished entities from reference sets through the use ofSiamese recurrent neural networks [5].

Text can also be converted to knowledge graph en-tities through the use of Long Short-Term Memorynetworks [7, 10, 14]. Supervised learning throughthe use of Random Walk and a LSTM recurrent net-work is used to create skipgram entities that are in-cluded in a knowledge base. Input text is then as-signed to these entities based on semantic similarity.This requires paired samples of text and knowledgegraph entities to train the models. Leveraging unsu-pervised techniques, our model offers a solution thatdefines entities without a handcrafted ontology.

Developing domain specific software is a costly pro-cess; with a large code base it can be hard to under-stand how the pieces fit together and how those piecesof software relate to concepts in the application do-main of interest. We take scientific software typicallyused in modeling applications as an interesting casebecause both the code and text are relatively sophis-ticated and contributing to the field requires deepknowledge of both software and science concepts. ASurvey of Machine Learning for Big Code and Nat-uralness [1] demonstrates application of bringing se-mantic meaning to code by utilizing the naturalnesshypothesis, which argues: (1) software is a form of hu-man communication; (2) software corpora have simi-lar statistical properties to natural language corpora;and (3) these properties can be exploited to buildbetter software engineering tools.

Our approach leverages code naming conventionsfor matching software and domain application con-ceptual entities into a unified knowledge graph with-out paired training examples. Like the semantic web,our model serves to create concept similarity rela-tionships that involve the content of the resourcesrather than the bibliometric structure of the docu-ments. Furthermore, our model offers a solution tothe multidimensionality of the ontologies within sci-

entific domains and tackles the inherent complexityof Big Code [16] [3] [1].

2 Methodology

Textual Explanations of Modeling ConceptsOnline or interactive textbooks are a novel creationof the open science movement. These textbooks arecreated when an author decides to combine text,data, figures, and code into an interactive textbook.Such books are published under a copy-left licenseon collaborative platforms such as github.com. Thismedium allows for simple text extraction as comparedto alternate formats such as PDF and HTML. Vari-ables, equations, and bibliographic references can beextracted from markdown documents with regular ex-pressions that make text cleaning relatively easy. Forour model, we leverage the expository text and sci-entific or mathematical variables that are referencedin the markdown files of a textbook as shown in Fig-ure 1. We decompose sentences from the referencetext into <subject, verb, object> triples to forman RDF1 knowledge graph. Edges flow from subjectsto objects and are labeled with verbs. A sample ofthe resulting knowledge graph is shown in Figure 2.

Code Implementations of Models When onlinetextbooks are written to explain a scientific, engi-neering, or mathematical domain, examples are oftengiven in the form of source code. These source codefiles are designed for pedagogical purposes and assuch are well structured. These source code examplesprovide a high-quality, high-fidelity corpus for build-ing knowledge graphs of the underlying domain. Thisallows for unambiguous association between conceptsin the reference text and variable and function namesextracted from code signatures. For example, a func-tion named sir_ode() can be associated to an SIRconcept referenced in the text. Furthermore, the codefiles associated with scientific models, being pedagog-ical, do not contain complex software constructionssuch as mutually recursive functions or low level func-tions manipulating complex data structures.

Pre-Processing Epirecipes Cookbook [6] is our pri-mary source of data to build the knowledge graphsince it provides us with a set of epidemiological

1resource description framework

2

github.com

(a)

(b)

Figure 1: Modern online textbooks contain markdown fileswith expository text and Jupyter notebooks with code andfigures. These input formats are designed for interactive in-struction. Our work constructs scientific knowledge graphs foraugmenting scientific reasoning from these data sources. 1a)Example of online Epirecipes Cookbook [6] textbook markdownfile with 1b) corresponding source code file.

Figure 2: A small portion of the resulting knowledge graph.This portion of the knowledge graph shows the relationshipsbetween concepts in SIR modeling. The red vertices are con-cept nodes with the big red vertex representing a cluster centerfor concepts related to "These Models", and the blue nodes aresource code variable nodes. In the bottom left of the figure,you can see that the infected_individuals is related to theinfection concept which is related to the An exposed infectiousclass concept. Additionally, the Beta variable is related to theBeta rate concept which is related to the susceptible concept.

models implemented in Julia and contains descrip-tive text about the models. Equations, references,and non-alphanumeric characters, with the exceptionof punctuation, are stripped from the text; remain-ing variables are then capitalized. Subsequently, sub-ject, verb, and objects within sentences form sourcenodes, edges, and target nodes respectively for ourknowledge graph through the use of the spaCy’s smallnatural language processing model [8].

Functions and variables are then extracted fromthe Julia implementation corresponding to the mod-els. Greek letter representations are translated totheir respective Greek names. In future work, NLPmodels will be trained on large corpora of scientifictexts. 2

3 Experimentation

Our knowledge graph construction is based on cre-ating clusters using word-embedding representationsof the subject and object words from the cleanedtext and then associate these variables and functionsto representative elements from the object clusters.Rather than defining entities manually, we extractthe entities from existing literature and source code.

2https://www.ncbi.nlm.nih.gov/pubmed/30217670

3

https://www.ncbi.nlm.nih.gov/pubmed/30217670

Therefore, we use the density-based spatial clusteringof applications with noise (DBSCAN) to determinethe number of clusters/entities for our model [2].

Applying DBSCAN directly to the word-embeddings yields clusters that are too restrictive.The clusters contain text with only superficial varia-tions in spelling and capitalization. For example inFigure 3a, phrases like “The Model”, “The Modelof”, and “The model in a closed population” shouldshare a semantic relationship but are not connectedby the clustering algorithm. Tuning the DBSCANepsilon parameter resulted in a reduction of the num-ber of phrases that are clustered as noise, but didnot resolve the problem of combining semanticallysimilar phrases.

To tackle this problem, we apply a UMAP trans-formation to the word-embedding to reduce the di-mensionality of the input space for clustering [12].As a result with a DBSCAN epsilon value of 0.30,the clusters in the UMAP embedded space composephrases that are diverse in lexical level, but similaras concepts. In Figure 3b, various types of modelsare assigned to cluster 6, whereas types of rates areassigned to cluster 5.

However, because the function and variable namesare specific, it is difficult to associate our ex-tracted code signatures with these high-level con-cepts. Therefore, we use DBSCAN with UMAPtransformation only on the subject nodes and excludethe noise to create our hierarchical entities. Then, weuse DBSCAN without a UMAP transformation onthe objects to combine syntactically similar nodes.This captures an intuitive sense that there are moreobjects than subjects in the corpus. Finally, we con-nect objects (with noise) to subject entities basedon their original <subject, verb, object> associ-ation. Subsequently, extracted variables and func-tions are compared to the object nodes and are con-nected when the similarity exceeds a fixed thresholddiscussed in Section 4.

The resulting knowledge graph created from theEpirecipes Cookbook yielded 115 object nodes and 93subject nodes. Depending on the threshold, the num-ber of edges ranged from 4,000 to 13,000. In orderto remove extraneous concepts, subject componentswith a node size of 5 or lower were removed from theknowledge graph.

(a)

(b)

Figure 3: A comparison of clusters with and without dimen-sionality reduction. 3a Cluster Assignments for high dimen-sional word-embeddings. The high dimensional space sepa-rates vectors well and each cluster contains only superficialvariations on the same phrase. 3b UMAP transformation em-beddings resulting in semantically significant clusters. This isbecause the nonlinear embedding of the vectors provided byUMAP pulls phrases together in the lower dimensional space.These clusters are at a resolution too fine for the applicationof knowledge graph construction.

4

4 Results and Discussion

The construction of the knowledge graph is depen-dent on factors such as the threshold value and theinput resource. In this section, we introduce method-ologies of determining the threshold value throughprecision versus recall and measures of conductancewhen including an additional scientific corpus.

Internal performance measures, such as Silhouetteindex [15] and other intra-cluster similarity assess-ments would not provide accurate evaluation for ourmodel given examples like Figure 3a, where theSilhouette index would be high. Other evaluationmetrics, such as measuring the purity of the cluster,would also not apply since we aim to compare andgroup shared functionality within a single scientificclass-epidemiology.

In order to assess the threshold value for our vari-able assignment, we crafted a set of ground truthlabels that were hand-labeled by a group of peers.These labels were created from a list of object nodesthat reflect the variable/function nodes that theyshould be connected with. Evaluation was conductedwith respect to these labeled sets in terms of preci-sion versus recall at various thresholds (see Figure4) where Precision = tp/(tp+ fp), and Recall =tp/(tp+ fn).

Figure 4: Precision vs Recall trade-off in this context. Knowl-edge graph construction often prefers high recall, low precisionthresholds because false positives can be filtered out in thedownstream learning or reasoning steps.

In this application, true positives are<variable/function, object> edges that ex-ist in the knowledge graph and in the labeledset. False positives are <variable/function,object> edges that were added to the knowledgegraph, but not in the labeled set. False negatives

are <variable/function, object> edges that donot exist in the knowledge graph, but are in thelabeled set.

As the similarity threshold increased, the numberof edges between variable and object concepts are de-creased. In order to determine the threshold value touse, we first discovered the threshold value where re-call was equivalent to precision. For our model, wechose a threshold higher than the intersection pointbecause extraneous edges can be filtered out later in adownstream processing task. We selected a thresholdvalue of 0.7 to give a good balance between precisionand recall for our knowledge graph applications.

Figure 2 shows a snippet of the constructedknowledge graph; subject concept nodes are con-nected to other object concept nodes extracted fromour <subject, verb, object> triplet from the ref-erence text. Variable names as a result were con-nected to these object nodes through satisfactionof our similarity threshold parameter. The amountof variable matches we obtained in our graph wasgreatly dependent on the quality of the languagetriplets extracted from the reference text. For exam-ple, the variable infected_individuals would nothave been built into our knowledge graph had thetriplet <susceptible, contains, infection> notexisted. Our pipeline mimics human learning in thesense that the more concept associations that arepresent, the more likely it is able to draw connectionsand make semantic sense from source code.

Furthermore, Figure 5 demonstrates a knowl-edge graph that introduces a new corpus from thetextbook, Statistics with Julia: Fundamentals forData Science, Machine Learning and Artificial In-telligence [11]. The new introduction produces an in-dependent set demonstrating that the two textbookswere relatively disjointed in terms of similarity. Theconstruction of our graph allowed for quantitative as-sessments to be performed to evaluate similarity ofthese two resources. Given G = (V,E) to representour knowledge graph where Va ⊂ V represents theset of vertices produced by a corpus a and Vb ⊂ Vrepresents the set of vertices introduced by a newcorpus, b. The sets Va, Vb form a partition of thevertex set V . Let Na and Nb denote the size ofVa and Vb respectively. The fraction of the knowl-edge graph extracted from corpus a is calculated asNa/(Na +Nb). The conductance of any vertex par-tition measures the separation between these verticesin terms of path connectivity. When combining doc-ument corpora into a single knowledge graph, the

5

Figure 5: A portion of the knowledge graph extracted fromtwo online textbooks: Epirecipes Cookbook [6] and Statis-tics with Julia: Fundamentals for Data Science, MachineLearning and Artificial Intelligence [11]. Note that the sub-graphs corresponding to each textbook (Epirecipes in redand Statistics in green) are mostly separated, but there aresome inter-textbook connections namely <Simplistic WeatherModel Model, These Models>. These textbooks both talkabout modeling physical phenomena with mathematics and sowe should expect overlap.

conductance reflects how many connections betweenthe domains of each corpus were introduced by theknowledge graph constructions algorithm. The con-ductance of a cut S, V \ S in graph G is

φ(S) =

∑i∈S,j∈S̄ aij

a(S)

where aij are the entries of the adjacency matrix forG such that:

a(S) =∑i∈S

∑j∈V

aij

and S̄ = V \ S.When combining two corpora, for example two

textbooks on the same or different topics, one can de-fine S as the set of concepts and variables extractedfrom one corpus and compute the conductance φ(S).Table 1 shows the conductance φ(S) between theepidemiology concepts and the statistics concepts ina knowledge graph built by extracting from two text-books. The results show that an increase in the sim-ilarity threshold decreases the number of variable-object relationships across disciplines. The declineof the number of edges lowers the conductance φ(S).

Threshold Conductance

0.20 0.02850.25 0.01520.30 0.01140.35 0.00900.40 0.00890.45 0.00870.50 0.00860.55 0.00850.60 0.00850.65 0.00850.70 0.00840.75 0.00840.80 0.0084

Table 1: The transition point between the threshold value0.30 and 0.35 (indicated in bold) corresponds to a large gapin conductance (0.0114 to 0.0090) and illustrates a methodfor choosing a threshold. A lower threshold reflects a higherconductance, that is, more interdisciplinary edges in the net-work. While these edges are useful for discovering relationshipsbetween disparate scientific disciplines, too many of them in-dicates an imprecise understanding of the concepts within asingle discipline.

5 Conclusion and Future Work

Semantic modeling aims to extract scientific knowl-edge that reside in scientific code. With the abun-dance of open source code and reference texts, ourmodel provides knowledge to reduce the complexityof large code bases to assist scientists and developersin gaining an overview understanding of the sourcecode. Our framework based on unsupervised learningand word-embeddings does not require a large volumeof ontological examples and enables the extension ofmodels with new parameters and components.

Future Work The performance of our model relieson the quality of function and variable names. Stor-age and placeholder variables extracted from sourcecode diminish the accuracy of our model due to terseinitialization names. For our future work, we aimto handle these semantically insignificant variablesthrough methods of classification. Furthermore, be-cause the word-embeddings are trained on the En-glish language corpus rather than coding syntax, werun the risk of lexical entities extracted from referencetext skewing our results. But because of the high-quality of naming schemes presented in the EpirecipesCookbook, the association between concept and codeis less ambiguous.

6

Applications Software engineering and scientificsoftware development are high turnover fields wherepeople rotate onto projects and must come up tospeed quickly. When new engineers join a projector new scientists add a new software method to theirrepertoire, they must first gain understanding of whatis already implemented. An application of our modelcan greatly assist in this transitional period by pro-viding insight into existing code-bases. Knowledgegraphs constructed from software and documentsconstructed using our methods can support seman-tic software engineering applications.

Additionally, artificial intelligence systems de-signed to augment the performance of scientists willneed a deep understanding of the domain scienceof interest. This understanding can be constructedfrom knowledge graphs built by reading textbooksand code in the scientific domain. By supportingthe construction of domain specific knowledge graphs,these methods can contribute to the next generationof methods for applying machine learning to aid inscientific discovery.

6 AcknowledgmentsThe authors thank the authors of Epirecipes Cookbookand Statistics with Julia: Fundamentals for Data Sci-ence, Machine Learning and Artificial Intelligence;without these sources, this work would not exist. Wealso thank Christine Herlihy, Kevin Kelly, and Clay-ton Morrison for their advice on this manuscript.This material is based upon work supported bythe Defense Advanced Research Projects Agency(DARPA) under Agreement No. HR00111990008.

References[1] Miltiadis Allamanis, Earl T Barr, Premkumar

Devanbu, and Charles Sutton. A survey ofmachine learning for big code and naturalness.ACM Computing Surveys (CSUR), 51(4):81,2018.

[2] Martin Ester, Hans-Peter Kriegel, Jörg Sander,Xiaowei Xu, et al. A density-based algorithmfor discovering clusters in large spatial databaseswith noise. In Kdd, volume 96, pages 226–231,1996.

[3] Dominique Estival, Chris Nowak, and AndrewZschorn. Towards ontology-based natural lan-guage processing. In Proceeedings of the Work-shop on NLP and XML (NLPXML-2004): RD-F/RDFS and OWL in Language Technology,pages 59–66. Association for Computational Lin-guistics, 2004.

[4] James Fairbanks and other contributors. Seman-ticmodels.jl, 2018.

[5] Shobeir Fakhraei and Jose Luis Ambite. Nseen:Neural semantic embedding for entity normal-ization. arXiv preprint arXiv:1811.07514, 2018.

[6] Simon Frost, Allyson Walsh, and Jade Thomp-son. Epirecipes text book.

[7] Sepp Hochreiter and Jürgen Schmidhuber.Long short-term memory. Neural computation,9(8):1735–1780, 1997.

[8] Matthew Honnibal and Ines Montani. spacy 2:Natural language understanding with bloom em-beddings. Convolutional Neural Networks andIncremental Parsing, 2017.

[9] Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin.Deep code comment generation. In Proceedingsof the 26th Conference on Program Comprehen-sion, pages 200–210. ACM, 2018.

[10] Dimitri Kartsaklis, Mohammad Taher Pilehvar,and Nigel Collier. Mapping text to knowledgegraph entities using multi-sense lstms. arXivpreprint arXiv:1808.07724, 2018.

[11] Hayden Klok and Yoni Nazarathy. Statisticswith julia: Fundamentals for data science, ma-chine learning and artificial intelligence., May2019.

7

[12] Leland McInnes, John Healy, and JamesMelville. Umap: Uniform manifold approxi-mation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018.

[13] Tomas Mikolov, Kai Chen, Greg Corrado, andJeffrey Dean. Efficient estimation of word rep-resentations in vector space. arXiv preprintarXiv:1301.3781, 2013.

[14] Dat Quoc Nguyen. An overview of embed-ding models of entities and relationships forknowledge base completion. arXiv preprintarXiv:1703.08098, 2017.

[15] Peter J Rousseeuw. Silhouettes: a graphical aidto the interpretation and validation of clusteranalysis. Journal of computational and appliedmathematics, 20:53–65, 1987.

[16] Antonio Toral, Monica Monachini, et al. Simple-owl: a generative lexicon ontology for nlp andthe semantic web. In Workshop on Coopera-tive Construction of Linguistic Knowledge Bases(AIIA 2007), 2007.

8

unsupervised construction of knowledge graphs from text

Documents