exploring declarative rule-based probabilistic frameworks ...1119117/fulltext01.pdf · alexandra,...

IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

Exploring declarative rule-based probabilistic frameworks for link prediction in Knowledge Graphs

XIAOXU GAO

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

Abstract

The knowledge graph stores factual information from the web in form of relationshipsbetween entities. The quality of a knowledge graph is determined by its completenessand accuracy. However, most current knowledge graphs often miss facts or have in-correct information. Current link prediction solutions have problems of scalability andhigh labor costs. This thesis proposed a declarative rule-based probabilistic frameworkto perform link prediction. The system incorporates a rule-mining model into a hinge-loss Markov random fields to infer links. Moreover, three rule optimization strategieswere developed to improve the quality of rules. Compared with previous solutions, thiswork dramatically reduces manual costs and provides a more tractable model. Eachproposed method has been evaluated with Average Precision or F-score on NELL andFreebase15k. It turns out that the rule optimization strategy performs the best. TheMAP of the best model on NELL is 0.754, better than a state-of-the-art graphical model(0.306). The F-score of the best model on Freebase15k is 0.709.

Keywords: Knowledge Graph; Link Prediction; Probabilistic Soft Logic; Hinge-lossMarkov Random Fields

Abstrakt

En kunskapsgraf lagrar information från webben i form av relationer mellan olika en-titeter. En kunskapsgrafs kvalité bestäms av hur komplett den är och dess noggrannhet.Dessvärre har många nuvarande kunskapsgrafer brister i form av saknad fakta ochinkorrekt information. Nuvarande lösningar av länkförutsägelser mellan entiteter harproblem med skalbarhet och hög arbetskostnad. Denna uppsats föreslår ett deklarativtregelbaserat probabilistiskt ramverk för att utföra länkförutsägelse. Systemet involveraren regelutvinnande modell till ett “hinge-loss Markov random fields” för att föreslålänkar. Vidare utvecklades tre strategier för regeloptimering för att förbättra reglernaskvalité. Jämfört med tidigare lösningar så bidrar detta arbete till att drastiskt reduceraarbetskostnader och en mer spårbar modell. Varje metod har utvärderas med precisionoch F-värde på NELL och Freebase15k. Det visar sig att strategin för regeloptimeringpresterade bäst. MAP-uppskattningen för den bästa modellen på NELL är 0.754, vilketär bättre än en nuvarande spjutspetsteknologi graphical model(0.306). F-värdet för denbästa modellen på Freebase15k är 0.709.

Nyckelord: kunskapsgraf; Link Prediction; Probabilistic Soft Logic; Hinge-loss MarkovRandom Fields

Nyckelord: Kunskapsgraf; Länkförutsägelser; Probabilistic Soft Logic; Hinge-loss MarkovRandom Fields

Acknowledgements

I’m deeply grateful that I was given the opportunity to conduct my master thesis atMeltwater. During these 6 months, I met a lot of amazing people and learned a lot thatI would have never learned from text books.

I would first express many thanks to Bhaskar Chakraborty, research scientist at Meltwa-ter, for giving me this opportunity, helping me formulate the research question, guidingme how to solve complex questions step by step. I am grateful that he is always kind andcommitted to supervising me. I also would like to thank my colleagues at Meltwater,Alexandra, Omar, Marci, Sonja, David, Benny and Hannes, for giving me suggestionsand supports during these months.

I want to sincerely thank my examiner Šarunas Girdzijauskas and supervisor VladimirVlassov at KTH for their constructive feedbacks and answers to my thesis in differentphases.

Finally, I would like to thank my family and friends who encouraged me to finish mymaster program in Sweden. I couldn’t have such achievements without your under-standing and supports.

Contents

List of Figures v

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.7 Ethics & Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.8 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.9 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.10 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Extended Background 62.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Knowledge Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Knowledge Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Knowledge Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.3 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.4 Knowledge Representation and Reasoning . . . . . . . . . . . . . . 8

2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1 PRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 TATEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Rule-based Probabilistic Framework 143.1 Statistical Relational Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 RDF-style KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.2 Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 X-World Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.4 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.5 AMIE+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Probabilistic Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.1 MLN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.2 PSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

ii

Contents iii

3.4 Pilot Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.2 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Advanced Rule-based Probabilistic Framework 284.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.4 Rule Mining Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Rule Optimization Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5.1 Fixed-threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.5.2 Rule-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Evaluation 345.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 NELL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2.2 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2.3 Rule optimization & Inference . . . . . . . . . . . . . . . . . . . . . 375.2.4 Result analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 Freebase15k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3.2 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3.3 Result analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Discussion and Conclusion 416.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.3.1 Generating negative rules . . . . . . . . . . . . . . . . . . . . . . . . 426.3.2 Predicting multiple relations . . . . . . . . . . . . . . . . . . . . . . 426.3.3 Alternative rule optimization strategy . . . . . . . . . . . . . . . . 436.3.4 Applying on noisy data . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Bibliography 44

A Data preparation algorithm 47

B AMIE+ algorithm 48B.1 Rule mining algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48B.2 Rule output check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49B.3 SPARQL examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

C Rule optimization algorithm 50C.1 Forward optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50C.2 Backward optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51C.3 Incompatibility computation . . . . . . . . . . . . . . . . . . . . . . . . . . 52

iv Contents

D Results 53D.1 Examples of generated rules from NELL . . . . . . . . . . . . . . . . . . . 53D.2 Results of NELL & Freebase15k . . . . . . . . . . . . . . . . . . . . . . . . . 53

List of Figures

2.1 A graph model example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Results of two knowledge reasoning models . . . . . . . . . . . . . . . . . 13

3.1 The mining model of AMIE[15] . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 An example of ground Markov Network . . . . . . . . . . . . . . . . . . . 223.3 A visualization of Lukasiewicz logic . . . . . . . . . . . . . . . . . . . . . . 233.4 The number of generated rule in Kinship . . . . . . . . . . . . . . . . . . . 253.5 The performance of changing the number of rules in Kinship dataset . . . 263.6 The performance of changing the weights of rules in Kinship dataset . . . 27

4.1 An overview of the system showing all the individual component and itsoutput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 A path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 An example of path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Examples of cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1 A visualization of PR curve and ROC curve . . . . . . . . . . . . . . . . . 35

v

Abbreviations

ADMM Alternating Direction Method of Multipliers. 24

AI Artificial Intelligence. 6

CWA Closed-world Assumption. 17

ETL Extract, Transform, Load. 8

HL-MRFs Hinge-loss Markov Random Fields. 2, 21, 22, 24, 43

ILP Inductive Logic Programming. 14, 15, 18

KEE Knowledge Engineering Environment. 9

MLE Maximum Likelihood Estimation. 31, 41

MLN Markov Logic Network. 2, 14, 21, 22

MPE Most Probable Explanation. 22, 24, 28, 32

MRF Markov random fields. 14, 21

NELL Never-Ending Language Learning. 7

NER Named Entity Recognition. 7

NLP Natural Language Processing. 6, 7

OWA Open-world Assumption. 17, 18

PCA Partial-completeness-world Assumption. 17, 18

PRA Path Ranking Algorithm. 11

PSL Probabilistic Soft Logic. 2, 14, 22

SDG Sustainable Development Goals. 4

SRL Statistical Relational Learning. 14

vi

Chapter 1

Introduction

There is an enormous amount of information on the web in the form of natural language,both structured and unstructured. Investigating how to make use of such a huge datarepository to improve applications in different industries has been a hot research topicfor a long time. Recently, many knowledge graphs have been created such as NELL[9]and Freebase[4], aiming for extracting as much high-quality information as possiblefrom the web, converting them into machine-readable formats and storing them intodatabases to be a strong base for advanced applications. A good example is Google’sKnowledge Graph[12] which is a knowledge base used by Google to enrich its searchengine’s search results with structured and semantic information. It helps users resolvetheir queries more efficiently without navigating to other web pages and collecting an-swers by themselves. Knowledge graphs are a backbone of many information systemsthat require structured knowledge.

1.1 Background

Knowledge graphs are graph structured knowledge bases where each node representsan entity such as person, location, organization and etc, each edge represents an existingrelationship between entities such as friendship, competitor and etc. To build a knowledgegraph, huge amounts of facts are extracted from different sources, including structureddatabases like Wikidata1, WordNet2 and unstructured web pages. A fact is a statementsaying dependencies among different entities. For example, a fact could be Stockholmis the capital city of Sweden which has more than 910k residents. In this case, Stockholm andSweden are entities, captial_city is a relationship. Eventually, the extracted facts will becleaned, converted into machine-readable formats and stored in databases.

However, it may be problematic if we directly apply these databases into applications,because existing knowledge graphs are often missing facts and containing incorrect in-formation. For example, in Freebase3 more than 70% people don’t have place of birth at-tribute. This is the reason of doing knowledge reasoning in knowledge graphs. Knowl-

1https://www.wikidata.org/wiki/Wikidata:Main_Page2https://wordnet.princeton.edu/3https://developers.google.com/freebase/

1

2 Chapter 1. Introduction

edge reasoning is also called link prediction or knowledge graph completion, aiming forpredicting the existence of edges (i.e. relations) in the graph.

So far, many link prediction approaches have been proposed and applied in different ap-plications. They can be divided into three categories: graph-based models, embedding-based models and rule-based models. Graph-based models predict the existence of anedge by extracting features from the observed graphs. Embedding-based models assignlatent features to each entity and relation, then predict the edge by computing latentfeatures of entities. Rule-based models incorporate declarative rules into probabilityframeworks to infer new edges.

Among all the three categories, rule-based models have the following advantages:

1. Graph-based models and embedding-based models have low scalability on certaintypes of databases.

2. Rule-based models are allowed to use prior knowledge or expertise which is ben-eficial when solving tasks in new fields.

3. Declarative rules are understandable for both machines and human beings. Themodels are more tractable.

A big challenge of the current rule-based models such as Markov Logic Network (MLN)[36] and Probabilistic Soft Logic (PSL)[24] is that the needed declarative rules are fullygenerated by human beings which is very costly and low efficient. This thesis is goingto fill this gap.

1.2 Problem Statement

This thesis solves two problems. First, it reduces manual costs in previous rule-basedmodels by combining a rule-mining model and PSL to automatically generate first-orderlogic rules from graphs.

Secondly, it improves the quality of rules generated from the previous step by learningthem and shrinking them to a high-quality subset. After that, the small subset of rulesare applied to the hinge-loss Markov random fields (HL-MRFs) in PSL to infer newfacts.

The overall research question of this thesis is:

How to efficiently incorporate declarative rules into probabilistic frameworks to do link predictionin knowledge graphs?

More specially, the research question can be divided into three parts:

1. How to automatically generate declarative rules from knowledge graphs?

2. Which probabilistic framework should be used to infer new facts in knowledgegraphs?

1.3. Goal 3

3. How to improve the quality of declarative rules?

1.3 Goal

The goal of this thesis is to develop a system that makes the previous rule-based modelsmore automatic and meanwhile improves the quality of rules. More specially, it shouldfirst automatically generate rules from knowledge bases, then improve the quality ofrules. Finally, the system will use these rules to infer new facts.

1.4 Purpose

This thesis will examine the performance of different link prediction models for knowl-edge graphs. The focus will be on rule-based models which leverage straightforwardrules and attributes of Markov random fields to infer more facts. Moreover, it leadsreaders to open new research topics in this area.

1.5 Contribution

This thesis has two contributions:

1. It is the first time to incorporate a rule-mining model into a rule-based link pre-diction model. With such combination, manual costs can be reduced a lot.

2. It develops several rule optimization strategies to get a small subset of high-qualityrules.

1.6 Methodology

A range of methods are applied to answer the research questions. They provide anecosystem to ensure the quality of the produced results. In general, this thesis followsa scientific research methodology consisting of formulating a research question, iden-tifying a hypothesis, making testable predictions, designing experimentations, finallydetermining results and assessing their validity.

A deductive approach[16] is a research methodology that designs a research strategyto test hypothesis based on existing theories. It moves from a more general level toa more specific level. Deductive based approaches are often associated with scientificresearch. This thesis uses a deductive approach to formulate the hypothesis from pastand current research on link prediction. It is assumed that manual costs in previousrule-based models can be reduced and meanwhile the performance can be better. Basedon this hypothesis, a series of experimentations are designed to test the validity of thishypothesis.

4 Chapter 1. Introduction

Quantitative research[33] is a way of collecting data and using statistical models to ob-serve relationships between variables. It uses numbers to formulate facts and uncoverpatterns. This thesis collects numerical data from open datasets and uses different mod-els to verify if the proposed system can make the hypothesis valid.

1.7 Ethics & Sustainability

This work intents to help people understand rule-based models for link prediction. Thetheories from other researchers’ work are properly mentioned in the reference. The dataused in this thesis is from public datasets, so there are no issues with copyright. InChapter 2, results achieved by the stakeholder are mentioned, but the detailed imple-mentation is confidential. All the results from experimentations are valid, reliable andcan be replicated.

17 Sustainable Development Goals (SDG)4 proposed by the United Nation have beenadopted by 193 countries by September 2015. The goals aim at ending poverty, hungerand inequality worldwide. The goal No.95 emphasizes the importance of innovation intechnology.

"Technological progress is the foundation of efforts to achieve environmental objectives. Withouttechnology and innovation, industrialization will not happen, and without industrialization,development will not happen."

This thesis reaches this goal through developing a new system which fills the gap inprevious research, meanwhile reduces manual costs. A complete knowledge graph canbe a strong base for application in different domains. It can not only be beneficial in IT,but also in agricultural, infrastructure, medical science and etc.

1.8 Stakeholders

This thesis was commissioned by Meltwater AB based in Stockholm which is a B2Bmedia monitoring company developing business intelligence softwares to help businessgrow their brands. The vision of Meltwater is to help companies make better, moreinformed decisions based on insights from the outside.

Meltwater is developing a company-wise knowledge graph to provide more insights totheir customers. They have applied two different link prediction models and planned touse the work of this thesis to perform the third method of link prediction.

1.9 Delimitations

Link prediction can be categorized into two scenarios: predicting links that will beappeared in the future on a dynamic graph and predicting missing edges in the current

4http://www.un.org/sustainabledevelopment/sustainable-development-goals/5http://www.un.org/sustainabledevelopment/infrastructure-industrialization/

1.10. Outline 5

observed graph. This thesis is limited to perform link prediction on static graphs whichdon’t take time dimension into considerations.

Besides, rule-based models are suited for predicting new facts and correcting wronginformation in the observed graph. In this thesis, the data collected is deemed as truth.This work doesn’t focus on correcting wrong information.

In experimentations, one relation is going to be predicted each other. This thesis doesn’tpredict multiple relations, but it’s considered as a part of the future work.

1.10 Outline

This thesis is structured as follows. Chapter 2 presents the background about knowl-edge graphs. Chapter 3 introduces rule-based probabilistic framework and a pilot test.Chapter 4 shows the advanced rule-based probabilistic framework with steps of imple-mentation. Chapter 5 is the evaluation of experimentation results. Finally, chapter 6presents the discussion and conclusion of this work.

Chapter 2

Extended Background

A knowledge graph is not a simple graph but a complex system containing differ-ent significant components. This chapter covers the basic components of a knowledgegraph with a focus on knowledge representation and reasoning. Moreover, two state-of-the-art approaches and their performance on Meltwater’s in-house database are de-scribed.

2.1 Natural Language Processing

Natural Language Processing (NLP) is a large field of computer science and a com-ponent of Artificial Intelligence (AI) that concerns the interaction between computersand human languages. The goal of NLP is to make computers understand natural lan-guages in order to perform useful tasks. Since people communicate almost everythingin languages, NLP can be applied everywhere: web search, language translation, rec-ommendation systems, sentimental analysis and etc. Behind NLP applications, thereare so many underlying techniques. Traditional NLP techniques include tokenization,named entity recognition, bag of words model[29]. Due to the complexity and diversityof human languages, machine learning and deep learning has also been applied to chal-lenging NLP applications, such as Google translation[42], IBM Bluemix and Apple Siri.This thesis is on the semantic side of NLP which focuses on understanding and utilizingthe relationship between words.

2.2 Knowledge Graph Construction

Knowledge graphs are graph structured knowledge bases that contain information inform of relationships between entities. In a knowledge graph, each node is an entityand each edge refers to the relationship between two entities. Recently, a large num-ber of knowledge graphs have been created, including NELL[9], YAGO[28], Freebase[4],DBpedia[2] and the Google’s Knowledge Graph[12]. These knowledge graphs providesemantically structured information that is interpretable by computers which is deemedas an important base for more intelligent applications. A good example is Google’s

6

2.2. Knowledge Graph Construction 7

Knowledge Graph which enriches the results of Google’s search engine by providingsemantic-search information in addition to links of web pages. For example, if a usersearches Meltwater, the traditional search engine will only return web links related toMeltwater. But with its knowledge graph, in addition to a list of links, Google candirectly provide structured and detailed information about Meltwater such as headquar-ters, CEO, founded year and a short summary in a knowledge panel. From disorderedweb pages to structured information, knowledge graphs help users find important infor-mation about topics without navigating to other pages and gather information by them-selves. Users can even dive into the topic and explore more by following the knowledgegraph.

Knowledge graphs can contain very general facts from several sources in different do-mains. A good representative is Google’s Knowledge Graph. On the other hand, theycan also be domain-specific, aiming at connecting a huge number of entities in the samedomain and discovering relationships between them such as a academic-wise knowl-edge graph[22] and a social network.

This thesis focuses on more general knowledge graphs which contain information indifferent domains. This section describes how a knowledge graph is built.

2.2.1 Knowledge Collection

Knowledge collection is the first step to build a functional knowledge graph whereknowledge is collected from various sources and become interpretable by computers.The goal is to make the knowledge graph complete, accurate and high quality. Themethods of constructing a knowledge graph can be classified into four groups[34]:

• curated approaches: knowledge graph is created manually by a closed group ofexperts such as WordNet[30]. It leads to highly accurate results but is not scalabledue to the expensive human efforts.

• collaborative approaches: knowledge graph is created manually by a group of volun-teers such as Freebase. It is more scalable than curated approaches but is far fromcomplete. For example, in Freebase, more than 70% people don’t have the place ofbirth attribute.

• automated semi-structured approaches: knowledge graph is automatically extractedfrom semi-structured text via learned rules such as YAGO. They extract informa-tion from Wikipedia which leads to a high accuracy. However, incompleteness isan issue here as well. Most of the knowledge graphs cover only a small fraction ofknowledge on the web.

• automated unstructured approaches: knowledge graph is automatically extracted fromunstructured text via NLP techniques like Named Entity Recognition (NER) suchas NELL. Due to its high scalability, it makes the knowledge graph more completeby extracting all the possible information on the web. But on the other side, it willinduce a lot of noisy, conflicting and ambiguous data. For example, an open in-formation extraction system may have triples like ("Trump","born in","New York"),("Donald Trump","place of birth","NY") and ("Trump","born in","Russia"). The sys-

8 Chapter 2. Extended Background

tem will take all three triples as correct and different triples which is not true.Inducing wrong and duplicate information is a disadvantage of this approach.

2.2.2 Knowledge Fusion

Directly extracting information from different web sources will induce a lot of noisy andconflicting information. Knowledge fusion is to solve conflicting values from differentsources[13] and then find the true values.

If we take the input of knowledge fusion as a two-dimensional matrix, each row rep-resents a data item, such as the birth place of Trump. Each column represents a datasource such as CNN1 and Times2. Each cell represents the value provided by the corre-sponding data source on the specific data item, such as New York or Russia. Knowledgefusion aims at finding the true cells based on the matrix. Several approaches have beenproposed to solve this problem. Voting [13] is a baseline approach. It counts the vote ofeach value and chooses the one with the highest count. Quality-based approaches[13]take the trustworthiness of the data source into consideration. The source with highertrustworthiness will lead to a higher vote. Relation-based approaches take both thetrustworthiness of the data source and the relationship between the data sources intoconsideration. Therefore, fake news that have been copied many times will not gethigher votes. In addition, more advanced algorithms such as introducing the third di-mension into the matrix [13], utilizing the relationships between the sources and theirinformation[44] haven applied widely.

2.2.3 Entity Linking

In the phase of knowledge collection, it may happen that the same entities having dif-ferent names are regarded as different entities. The goal of entity linking (i.e. NamedEntity Disambiguation) is to link a named-entity to an instance in a knowledge basesuch as DBpedia. In Meltwater, a state-of-the-art algorithm called AGDISTIS[39] is be-ing used to solve entity linking problem. It is an open source knowledge-base-agnosticapproach that combines the HITS algorithm with label expansion and string similaritymeasures. More details can be read in the paper[39]. Meltwater has applied AGDISTISto solve entity linking problem for their internal data. It turns out that the micro-averageF-score reaches to 0.7 on disambiguating 50 company names in 1825 manually checkeddocuments.

2.2.4 Knowledge Representation and Reasoning

The previous process is called ETL (Extract, Transform, Load), a process in data ware-housing focusing on extracting data, transforming data into proper formats and finallystoring data into databases. The outcome of ETL is a relative clean and structureddatabase which is deemed as a base for more advanced computation.

1http://edition.cnn.com/2https://www.thetimes.co.uk/

2.2. Knowledge Graph Construction 9

Knowledge representation is usually a step after ETL where information is representedin a form that machines can utilize to solve complex tasks. According to [38], knowl-edge representation is the application of logic and ontology to the task of constructingcomputable models for some domains. Logic provides the formal structure and rulesfor inference. Ontology defines the kind of things existing in the applications. Com-putable models can be implemented in computer programs. The field of knowledgerepresentation is always associated with knowledge reasoning which is also referred tolink prediction, because knowledge representation formalisms are useless without theability to reason with them. The choice of representation can have a major effect on theway of reasoning and its ultimate performance.

The earliest work in knowledge representation was focused on solving general problemssuch as the General Problem Solver[41]. The program intended to work as a universalproblem solver so that any problem can be expressed as a set of Horn clauses. Inthe mid 1980s, frame based languages[23] are developed to describe things, problemsand potential solutions. Frames are stored as ontologies of sets and subsets of theframe concepts. KL-ONE[7] is a frame language that attempts to explicitly representconceptual information as a structured inheritance network. After that, rules started toget attention, as they are good for representing and utilizing complex logic. Such kind ofimprovements and innovation was driven by commercial ventures such as KnowledgeEngineering Environment (KEE) from different research topics.

In the history, knowledge representation was more involved in psychology, in termsof realistic human models and linguistics, in terms of representations of word senses.Currently, a more coherent interpretation bases on logic and computer science. Logichelps researchers deal with truth-preserving operations over symbolic structures. Aperfect logic can not only deal with symbolic structures and operations but also bedefined formally and preserve semantic properties.

Traditionally, there are four main kinds of knowledge representation[26]:

1. Database form is a knowledge base that contains only the kind of information thatcan be represented in a database. The knowledge is complete in a database, sothat there is no need to do knowledge reasoning and inference reduces to simplecalculation. An example could be a university courses database which containsinformation of department, courses and course ID. If we want to know the numberof courses offered by Computer Science Department, all have to do is to count thenumber of tuples appear in the course relation.

2. Logic program form is a knowledge base that has complete knowledge of the worldwithin a given language such as PROLOG[11], but it requires more complex infer-ence to answer questions. Knowledge reasoning can be executed in logic programknowledge base. For example, given a knowledge base consisting of:

parent(Bill,Mary)

parent(Bill,Sam)

parent(X,Y), female(Y)) mother(X,Y)

female(Mary)


Only after having executed the program, we can know who the mother of Bill is.The power of logic program is that users have the ability to incorporate domain-specific control knowledge. A knowledge representation system only manages

a limited form of inference and leave to users to intelligently completing it. Pro-grammability is the major attrition of such knowledge bases. It allows users tointegrate procedural and declarative concerns. Such flexibility motivates the de-velopment of many logic-based knowledge representation and reasoning systems.

3. Semantic networks is a database that can be represented by a labeled directed graph.The nodes are entities, and the edges are labeled with an attribute. The significanceof such graphical representation is that the inference can be performed by graph-searching techniques such as breath-first search and depth-first search. Compu-tationally, the improvement in the efficiency of graph-searching techniques willimprove the performance of inference in a knowledge base of this form. Differentfrom logic program, the graph representation suggests inference based more di-rectly on the structure of the knowledge base than its logical content. For example,a path in the graph can be used to explain the connection between entities insteadof using a logical representation.

4. Frames is a database emphasizing on the structure of the entities in terms of theirattributes. A frame-style knowledge base contains values, restrictions and attachedprocedures. Values state the attribute of an entity. Restrictions state the constraintsthat must be satisfied by attribute values. Attached procedures provide advice onhow the attribute should be used. For example, there is a statement[26]:

Computer Science students taking at least three graduate courses in departments withEngineering.

The frame language will describe it like:(Studentwith a department is computer-science andwith 3 enrolled-course is a(Graduate-Course with a department is a Engineering-Department))

Apart from these four forms, many new forms of knowledge representation have ap-peared, leading to new techniques of knowledge reasoning. Extracting latent structurethat represents the properties of individual entities from the observed data has becomeone of popular methods[31]. This method uses different features to describe each at-tribute, and the relationship between entities can be defined by their features. For ex-ample, there could be a separate feature for female, male, student and etc. The presence orabsence of each of the features can define a person and determines the relationships be-tween people. One kind of latent-feature models was developed for social networks[21],where real-valued vectors are represented as features of the entities. The distance, innerproduct or weighted combination of the vectors of two entities determines the likelihoodof there being a link between them.

2.3. Related Work 11

2.3 Related Work

In Meltwater, a graphical model and a latent-feature model have been implementedto perform knowledge representation and reasoning on its in-house database. In thissection, the performance and challenges of these two approaches are explained.

2.3.1 PRA

Path Ranking Algorithm (PRA)[25] is a graphical model to preform link prediction. Ingraphical models, nodes represent entities and edges represent relations between them.The existence of an edge can be predicted based on the structure of the current graph.Figure 2.1 shows an example of the graphical model. We assume that mother edge fromPenelope to Victoria was missing. Considering the symmetric structure of the graph, it’seasy to predict the existence of this edge. Formally, the link prediction problem basedon graph models is defined as Definition 2.3.1.

Figure 2.1: A graph model example

Definition 2.3.1 (Graph based link prediction) Given a snapshot of network G = (V, E)at time t, V is the set of nodes and E is the set of edges. Link prediction is to predictwhether there will be a link e(u, v) between node u and v at time � t, where u, v 2 Vand e(u, v) /2 E.

PRA performs link prediction by computing feature matrices over node pairs in thegraph and then becoming an input to a logistic regression model. Each predicated pairwill be assigned a weight as the confidence number. PRA has three steps: First, itgenerates a connectivity matrix where each row is a node pair (sj, tj) and each columnis an edge label r. If an edge label appears in the path from sj to tj, the correspondingcell value will be non-zero. PRA does random walk on the graph from sj to tj andrecord which paths connect the node pairs. Frequent paths will become the columnsof the feature matrix. Second, PRA assigns values for each cell in the feature matrix.The value is the probability of getting to ti from sj in a random walk when constrainedto path types. Finally, the feature matrix is given to a logistic regression classificationmodel.

In spite of its feasibility on solving link prediction problems, PRA still has some limi-tations to some extend. Since the feature space in PRA is so large, it’s computationallyintensive to compute feature values in the second step. Hence, in 2015 Matt Gardner andTom Mitchell proposed an advanced version of PRA called Subgraph Feature Extraction


(SFE) [18]. SFE extracts richer features than PRA between node pairs which dramaticallyreduces computation. Rich features include PRA-style feature, path-bigram feature, one-sided feature, vector space similarity feature, etc. More details can be read in the paper[18]. It turns out that SFE outperforms PRA, improving mean precision from 0.432 to0.528 on NELL.

In Meltwater, we used SFE with PRA-style feature to perform link prediction. The PR-curve is on Figure 2.2a.

2.3.2 TATEC

TATEC[17] is an embedding model where triples are explained via latent features ofentities. In embedding models, each entity and relation is represented by latent features.Latent features of entities can be computed. If it’s equal to the latent feature of a relation,it means that the node pair share this relation. For example, in Figure 2.1, Penelopecould be represented by latent features and Victoria could be represented by other latentfeatures. If their combination is equal to the latent feature of relation mother, it meansthat Penelope is the mother of Victoria.

Current latent feature models include tensor factorization models, matrix factorizationmodels, bilinear models and neural tensor networks. More details can be read in thepaper [34]. Among all of these models, TATEC is an approach that overcomes overfittingproblems on rare relations and low capacity for complex relations by separately pre-training a 2-way model and a 3-way model and then combining them. The 2-way modeluses binary interactions between the head and the tail, the head and the label, thelabel and the tail. It uses vectors to represent relations. On the other side, the 3-waymodel uses joint interactions between the head, the tail and the label. It uses matricesto represent relations. The goal of TATEC is to learn embeddings of entities in a lowdimensional vector space and then learn a score function on the set of all possible triples.Triples with higher scores are more likely to have the target relation than ones with lowerscores. We can formulate the score function as following:

s(h, l, t) = s1(h, l, t) + s2(h, l, t) (2.1)

s1(h, l, t) = hr1l |e1

hi+ hr2l |e1

ti+ he1h|D|e1

ti (2.2)

s2(h, l, t) = he2h|Rl |e2

ti (2.3)

where s1 and s2 are score functions of the 2-way model and the 3-way model, e1h and e1

t

are embeddings of the head and tail of triples (h, l, t), r1l and r2

l are vectors that dependon the relation l. D is a diagonal not depend on the input and Rl is a matrix of lowdimensions.

After pre-training two separate models, the next step is to combine them. It has beenobserved that the combined model outperforms any individual model in several bench-marks. There are two approaches: fine tuning and linear combination. Fine tuning(2.4) simply sum two models. Linear combination (2.5) uses different learned weightsdl

i .

s1(h, l, t) = hr1l |e1

hi+ hr2l |e1

ti+ he1h|D|e1

ti+ he2h|Rl |e2

ti (2.4)

2.3. Related Work 13

s1(h, l, t) = dl1hr1

l |e1hi+ dl

2hr2l |e1

ti+ dl3he1

h|D|e1ti+ dl

4he2h|Rl |e2

ti (2.5)

In Meltwater, we used different settings of TATEC to perform link prediction on twobenchmarks: Kinships (Denham, 1971) and SVO (Jenatton, 2012). The PR-curve is onFigure 2.2b.

(a) PR-curve of PRA (b) PR-curve of Embedded Model

Figure 2.2: Results of two knowledge reasoning models

Challenges However, PRA and TATEC both have limitations when performing linkprediction. For PRA and other graphical models, they are well-suited for modeling lo-cal and quasi-local graph patterns because usually they consider the paths in maximumfour hops. They become computationally efficient if sj is connected to ti in short paths.On the other side, for TATEC and other embedding models, they are well-suited formodeling global relational patterns represented by latent features. They are computa-tionally efficient if triples are interpreted with a small number of latent features. Besides,the model of a embedding-based approach is not tractable. Apart from these, bad scala-bility is a common problem for these two approaches. Especially for TATEC, every newnode pair will lead a completely new re-training since it captures the global pattern. InMeltwater’s case, PRA took more than 3 hours to train a model for one relation whileTATEC took more than 20 hours.

Chapter 3

Rule-based Probabilistic Framework

The rule-based model is a branch of knowledge representation and reasoning that usesa logical language to describe the entities and relationships. To better understand thestructure and uncertainty in the database, rules are usually incorporated into a proba-bilistic framework. A common way is to use first-order logic rules as the language todescribe dependencies in the database, and Markov random fields (MRFs) as the prob-abilistic framework to capture the uncertainty. More specially, first-order logic rules aredefined as feature functions and MRFs assign probability to each possible fact by usinga score function which is the weighted combination of feature functions. This task isin the domain of Statistical Relational Learning (SRL) which focuses on learning com-positional structures and handling uncertainty. As of May 2017, rules in the currentrule-based probabilistic frameworks are defined by human beings. Knowledge from ex-perts are needed when solving problems in specific domains, but in more general cases,such manual process is too costly. Therefore, this thesis proposes to insert a rule-miningmodel before using the probabilistic framework to reduce manual costs.

This chapter first briefly introduces SRL. Then, it introduces principles of mining rulesfrom knowledge bases and several state-of-the-art rule mining systems. After that, tworule-based probabilistic frameworks, Markov Logic Network (MLN) and ProbabilisticSoft Logic (PSL) are introduced. At last, a pilot test is carried out to see the performanceof directly combining a rule-mining model and a probabilistic framework. The goal isto see whether the generated rules can fit the framework well and provide insights forrule optimization strategies.

3.1 Statistical Relational Learning

SRL[20] is a sub-discipline of machine learning which addresses problems in relationaldomains where observations are missing, partially observed or noisy but dependent.SRL uses Inductive Logic Programming (ILP) languages to represent the dependenciesbetween variables and uses probability theory to assign or predict probabilities. WithILP languages, it’s easier to add expertise knowledge into the network, empoweringalgorithms to explore new fields. A variety of SRL approaches have been developed

14

3.2. Rule Mining 15

such as Bayesian logic and Markov logic on applications in social network analysis,bio-informatics, etc[20].

3.2 Rule Mining

The rule-mining model is to discover interesting relations in large databases. It intentsto identify strong rules in databases using some measures of reliability. The first rule-mining system was introduced in 1993[1]. The goal is to mine association rules betweenitems in a large database of customer transaction of supermarkets in order to providebusiness decisions for management about how to design coupons and how to place mer-chandise on shelves. In addition to market analysis, rule-mining systems are employedtoday in many applications such as bioinformatics, web usage mining and CustomerRelationship Management[35].

This section gives an overview of a kind of rule-mining systems that is used in thisthesis. It first explains the style of databases and the language. Then it shows differentimportant assumptions that need to be considered when mining rules. After that, severalways of evaluating the quality of rules are introduced. At last, it presents several state-of-the-art rule mining systems.

3.2.1 RDF-style KB

This thesis focuses on RDF-style knowledge bases, consisting of a set of RDF triples.Each RDF triple is a fact in the form of hx, r, yi with x is the subject, r is the relation andy is the object. There are alternative ways to represent RDF triples. Here, we presenta fact as r(x, y). For example, captial(Stockholm, Sweden) means that Stockholm is thecaptial of Sweden. Subjects and Objects are usually entities from the real world such aspeople, location, organization, etc. Relations specify the real-world relationships such asfriendship, competitor, customer, etc. Some knowledge bases have A-Box and T-Box triples,where A-Box triples contain instance data and T-Box triples contain classes, domains andranges information of relations. This thesis only uses A-Box triples.

3.2.2 Language

Any rule language can be defined by two parts: the syntax of the language and the se-mantics of the language. The syntax is a set of rules that structure symbols and the waysto compose them. The semantics is the meaning of the language. The rules in the cus-tomer transaction case were written in the form X ) I|(c, s), where c is the confidence,s is the support. For example, [Men0s Furnishing] ) [Men0s Sportswear](54.85, 5.21).Using such ILP language has several advantages:

1. ILP is readable and computable for machines.

2. ILP has the semantic meaning, easily understood by human beings as well.

3. ILP can be incorporated with expertise knowledge.

16 Chapter 3. Rule-based Probabilistic Framework

Propositional logic[8] is a branch of logical languages that considers ways of joiningor modifying propositions to form more complex ones. It studies logic operators andconnectives that are used to produce complex statements whose truth-value dependsentirely on the truth-values of their sub-statements and the value of each statement iseither true or false. In propositional logic, uppercase letters are represented as state-ments, and logical signs like !, _ are represented as truth-functional operators. Forexample, there is a complex statement:

Stockholm is the most important city in Sweden if and only if Stockholm is the capital of Swedenand Stockholm has a population of over 910 thousands.

This statement can first be divided into three individual sub-statements, and each ofthem can be represented by a uppercase letter, which looks like:

I: Stockholm is the most important city in Sweden.C: Stockholm is the capital of Sweden.P: Stockholm has a population of over 910 thousands.

Then this statement can be translated into I $ (C&P). However, propositional logic hasmany limitations. It’s difficult to express large domains concisely, and it doesn’t captureimportant concepts about the world. This thesis selects first-order logic as the languageinstead. Compared to propositional logic which assumes the world contains facts, first-order logic assumes the world contains objects, relations and functions. Quantifiers canbe applied to variables, so that variables can be universally quantified. For example, theprevious example can be represented by first-order logic, which looks like:

capital(Stockholm, Sweden) ^ population_over_910(Stockholm)) important_city(Stockholm, Sweden)

First-order logic includes description logic and Horn logic. This thesis aims at generat-ing Horn logic rules.

Definition 3.2.1 (Horn Rule) A Horn Rule consists of a head and a body, where thehead is a single atom and the body is a set of atoms. We denote a rule as:

B1 ^ B2 ^ B3...^ Bn ) r(x, y) (3.1)

where {B1, ..., Bn} is the body, r(x, y) is the head. The abbreviated version could be�!B ) r(x, y) . An instantiation of a rule is a rule where all variables are substituted by

entities.

Definition 3.2.2 (Atom) An atom is a fact that can have variables at the subject and/orobject position such as r(x, y). An atom with constants for all arguments is called aground atom such as r(Stockholm, Sweden).

Definition 3.2.3 (Predicate) A predicate is a relation defined by a unique identifier suchas r in r(x, y).

Definition 3.2.4 (Closed rule) A closed rule is a rule where all its variables are closed.A variable is closed if it appears at least twice in the rule.

3.2. Rule Mining 17

Example 3.2.1 Here is a Horn Rule: livesIn(a, v0) ^ livesIn(b, v0) ) couple(a, b). Itmeans that if a, b lives in the same place, they are couples. livesIn and couple arepredicates, livesIn(a, v0), livesIn(b, v0), couple(a, b) are atoms. If a, b, v0 are substitutedfor constants such as livesIn(Jay, Marry), it becomes a ground atom. This rule is also aclosed rule.

3.2.3 X-World Assumption

In the Semantic Web, there are different possibilities to handle with statements thatare not in the database. In order to generate rules, rule-mining systems usually makedifferent assumptions. The assumption made has a huge effect on the performance ofthe system.

Open-world Assumption The Open-world Assumption (OWA) assumes that in knowl-edge bases, non-existing statements are regarded as unknown. The truth value of thestatement is irrespective of whether or not it’s known to be true.

Closed-world Assumption The Closed-world Assumption (CWA) assumes that in knowl-edge bases, non-existing statements are regarded as false. The statement that is true alsoknown to be true.

Partial-completeness-world Assumption The Partial-completeness-world Assumption(PCA) assumes that if the database knows some r attributes of x, then it knows all r at-tributes.

In general, CWA applies when a system has complete information. For example, there isa database about air routes. For this type of database, it’s impossible to make any openassumption because the database contains all the information. On the other hand, OWAand PCA applies when a system has incomplete information. This is the case when newinformation want to be discovered. For example, there is a database about a person’sclinic history. If the clinic history doesn’t include a particular allergy, it cannot be surethat this person doesn’t suffer from the allergy. So, it is deemed as an unknown fact,not a false fact. This thesis is closer to the second scenario where databases are kind ofincomplete and there is a need to infer new information.

3.2.4 Measures

Rule mining systems should be evaluated to show the reliability of rules. It’s too risky touse rules with not enough instances to draw conclusions. For transaction databases[1],support and confidence are used to measure its performance. However, these two metricshave limitations on other types of databases. Recently, head coverage and PCA confidencehave been used in many rule mining systems. But the best way of measuring is stillbased on the type of the system.

Definition 3.2.5 (support) The support of a rule is the number of correct predictions.


Definition 3.2.6 (head coverage) The head coverage is the proportion of pairs from thehead relation that are covered by the prediction of the rule.

hc(�!B ) r(x, y)) :=

supp(�!B ) r(x, y))

#(x0, y0) : r(x0, y0)(3.2)

Definition 3.2.7 (standard confidence) The standard confidence is the ratio of its pre-dictions that are in the KB. It takes facts that are not in the KB as false data:

con f (�!B ) r(x, y)) :=


#(x, y) : 9z1, ..., zm :�!B

(3.3)

Definition 3.2.8 (PCA confidence) The PCA confidence pcacon f is under PCA. It’s de-nominator is the number of facts that are known to be true and the facts that are assumedto be false:

pcacon f (�!B ) r(x, y)) :=


#(x, y) : 9z1, ..., zm, y0 :�!B _ r(x, y0)

(3.4)

3.2.5 AMIE+

AMIE+ is a fast rule-mining model under OWA sceneries which is used in this thesis togenerate first-order logic rules from RDF-style knowledge bases. AMIE+ is an extendedversion of AMIE (Association Rule Mining under Incomplete Evidence, 2013)[15]. Thispart first explains the principle of AMIE and then introduces the advanced aspects ofAMIE+.

AMIE (Association Rule Mining under Incomplete Evidence, 2013)[15] is a rule miningmodel tailored to support the OWA scenario. Traditional rule mining systems such asILP only support CWA scenario and are easy to get out of the memory. Figure 3.1 showsthe mining model of AMIE. There are four types of facts: KBtrue is the facts known tobe true in the knowledge base. KBfalse is the facts known to be false in the knowledgebase. NEWtrue is the true facts unknown to the knowledge. NEWfalse is the false factsunknown to the knowledge base. The goal is to maximize the area B and minimizethe area D. To do so, the area A and the area C are used to estimate the unknownarea but the area C is also unknown since most knowledge bases don’t contain negativeexamples. A breakthrough of AMIE is to generate negative examples under incompleteevidence.

Figure 3.1: The mining model of AMIE[15]

AMIE generates negative examples under the PCA scenario. It means that if the databaseknows some r attributes of x, then it knows all r attributes of x. This is certainly true

3.2. Rule Mining 19

for functional relations such as birthday, capital and inverse-functional relations such asowns, created, etc. For other non functional relations, the PCA is still reasonable for mostknowledge bases that have only one source. Under the PCA scenario, AMIE uses PCAconfidence as another metrics to evaluate rules. Huge search space is a challenge fortraditional rule mining systems. To address this, AMIE explores the search space byiteratively extending rules by mining operators:

1. Dangling Atom (OD) - In the added atom, one of the arguments is a variable that isshared with the rule. The other argument is a fresh variable that hasn’t appearedin the rule.

2. Instantiated Atom (OI) - In the added atom, one argument is an entity, the otherargument is a variable shared with the rule.

3. Closing Atom (OC) - Both arguments of the added atom are shared with the rule.

The algorithm iteratively dequeues a rule from the initial queue. If the rule is closed,the rule is output. Then, AMIE applies all operators to the rule and adds the resultingrules to the queue. The process continues until the queue is empty.

Apart from the previous steps, AMIE+ also presents pruning strategies and approxima-tions that allows the algorithm to explore the search space more efficiently. The detailedalgorithm is in Appendix B. Appendix B.1 introduces the whole process of AMIE+, Ap-pendix B.2 explains the requirements to output a rule. The algorithm takes a databaseG, a head coverage threshold minHC, a maximum rule length maxLen and a minimumconfidence threshold minCon f as inputs. AMIE+ has a queue of rules, which initiallyonly has head atoms (i.e. empty bodies). If a rule doesn’t meet the output requirement,it will be refined by a number of operators. The process is repeated until the queue isempty. The SPARQL examples are in Appendix B.3.

Rules should meet several requirements before being outputted. The algorithm firstchecks if the rule is closed. Then it calculates the PCA confidence of the rule followed byformula 3.4. The PCA confidence cannot be lower than the threshold minCon f . Besides,it should improve the confidence of all its parent rules. For example, the confidence ofa rule (B1 ^ ...^ Bn ^ Bn+1 ) H) is supposed to be higher than its parent rule (B1 ^ ...^Bn ) H). For support and head coverage, the child rule will never have a higher valuethan its parent rule, so there is no need to compare them.

An advanced aspect of AMIE+ is to present pruning strategies and approximationswhich largely reduces the search space. Apart from minCon f , the algorithm also definesother parameters such as minHC, maxLen to skip low quality rules. However, there isa trade-off between the runtime and the number of rules. If minHC is higher and/ormaxLen is lower, the algorithm will process very quickly. But we will get fewer rulesthan the opposite values. This thesis uses default values, meaning that minHC = 0.01,maxLen = 3, minCon f = 0.1. This paper [14] developed an interactive demo to showhow AMIE+ works in each step.

The output of this step is a set of first order logic rules with three scores: head coverage,standard confidence and PCA confidence. In this thesis, we use the average score ofhead coverage and PCA confidence as the initial weight for each rule. The head coverageshows how popular the head atom is and PCA confidence shows how reliable a rule is.


Hence, the average score reflects both the popularity and reliability of a rule.

3.2.6 Related Work

RDF2Rules

RDF2Rules (2015)[40] is a rule mining approach specifically for RDF knowledge bases.It has three steps: First, RDF2Rules mines frequent predicate cycles in knowledge bases.Second, RDF2Rules generates rules from the mined frequent predicate cycles. Finally,pruning technique is used to remove unnecessary rules. A frequent predicate cycle is apath whose support in the given knowledge base is not less than a threshold. Rules aregenerated only from frequent predicate cycles. With such pruning strategy, RDF2Rulescan generate 1237 rules in 3m15s while AMIE takes more than 2 days to generate only75 rules. However, RDF2Rules focuses on efficiency and the number of generated rulesbut not the quality of rules.

RuDiK

RuDiK is a system for the discovery of declarative rules over knowledge bases underOWA. RuDiK is able to discover both positive rules such as "if two people have thesame boss, they are colleagues" and negative rules such as "if two people are classmates,one cannot be the teacher of the other". Positive rules infer new facts in knowledgebases, negative rules are able to detect erroneous facts. Compared to other rule miningsystems, RuDiK enriches the rule language by generating both positive and negativerules. The approximate rules are more robust to errors and incompleteness. Moreover,RuDiK uses disk-based algorithms, improving the efficiency when mining rules fromhuge knowledge bases.

Ontological Pathfinding

Ontological Pathfinding[10] is an algorithm to tackle the large-scale rule mining prob-lem which focuses more on scalability and designing a series of parallelization andoptimization techniques. The mining algorithm is parallelized by dividing the inputknowledge base to a smaller groups running parallel in memory. The parallel miningalgorithm is implemented on some cluster computing frameworks to achieve maximumutilization of possible computation sources. The output of this algorithm is the same asAMIE but it reduces runtime.

3.3 Probabilistic Framework

In SRL, propositional logic and first-order logic are used to describe the intricate depen-dencies between variables. However, the knowledge databases are usually noisy andthe dependencies do not always hold. Therefore, logical languages can be incorporatedinto probabilistic frameworks to create models that capture both the structure and the

3.3. Probabilistic Framework 21

uncertainty. MRFs is a classical class of probabilistic graphical models. An MRF is adistribution that assigns probability using a score function which is a weighted combi-nation of several feature functions called potentials. A notion of a MRF is in definition3.3.1. Logical languages can be used to define these potentials. This section introducestwo probabilistic frameworks built on MRFs. It first introduces MLN, a popular rep-resentation for SRL which combines Markov networks and first-order logic. Then, itintroduces PSL, a probabilistic programming language that makes HL-MRFs easy todefine using first-order logic.

Definition 3.3.1 (MRF) x = (x1, ..., xn) is a vector of random variables and f = (f1, ..., fm)is a vector of potentials where each potential assigns a score to configurations of the vari-ables. w = (w1, ..., wn) is a vector of weights. Then a MRF is a probability distribution:

P(x) = exp(wTf(x)) (3.5)

3.3.1 MLN

A classical rule-based probabilistic framework is MLN[36]. It incorporates first-orderlogic rules into MRFs to learn dependencies and handle uncertainty. It consists of a setof weighted formulas {wi, Fi} written in first-order logic. Higher weights mean that thedifference between a world that agrees with the formula and the one that disagree withthe formula is large. The more formulas a world agree with, the more possible it exists.Formally, a MLN defines a probability distribution over possible worlds defined as thefollowing:

P(X = x) =1Z

exp(Âi

wini(x)) (3.6)

where x is a possible world, ni(x) is the number of true groundings of ith formula in xand wi is the weight.

MLN is different from a regular knowledge graph. In MLN, each node is a variable(i.e. a possible fact) and each statistical dependency between variables is an edge. Ifthe ground atom is true, it’s value is 1. Otherwise, it’s 0. Table 3.6 is an exampleof MLN. Sm, Ca and Fr are predicates, meaning Smoke, Cancer and Friends. Whengiven constants A and B, we can generate a MLN as Figure 3.2. The possibility ofthe world can be computed by the number of true groundings of each formula, soP(X = x(A, B)) = 1

Z exp(1.5 ⇤ 2 + 1.1 ⇤ 4).

The weights can be pre-defined or learned using Voted Perception, Contrastive Diver-gence, Pre-weight learning rates, Diagonal Newton, etc. More technical details can beread in [27].

Challenge MLN has two limitations: i) each node in MLN must take Boolean values,making it difficult to show any possibility. ii) the huge amount of Boolean values makethe inference become an intractable optimization problem.


Table 3.1: An example of MLN

Proposition First-Order logic Clausal form Weight

Smoking causes cancer F1 : 8x, Sm(x)) Ca(x) ¬Sm(x) _ Ca(x) 1.5

If two people are friends,either both smoke or neither does

F2 : 8x8y, Fr(x, y))(Sm(x) () Sm(y))

¬Fr(x, y) _ Sm(x) _ ¬Sm(y) 1.1

¬Fr(x, y) _ ¬Sm(x) _ Sm(y) 1.1

Figure 3.2: An example of ground Markov Network

3.3.2 PSL

PSL[24] is a probabilistic programming language written in first-order logic that caneasily define HL-MRFs. The next step is to infer Most Probable Explanation (MPE)based on HL-MRFs using a convex optimization software. PSL with its MPE algorithmare open sourced, and its code is available on Github1. This thesis uses PSL as a rule-based probabilistic framework to infer new facts.

PSL has four kinds of inputs. The first is a set of closed predicates C, the atoms of whichare complete observed. The second is a set of open predicates O, the atom of which maybe unobserved. The third input is a set of all ground atoms under consideration A. Allatoms in A must have the predicate in either C or O. The last input is functions F thatmap the ground atoms to a value between 0 to 1.

In MLN, logical clauses are interpreted using Boolean values, either 0 or 1. But in thereal world, many facts contain uncertainty, so using Boolean values will cause bias. HL-MRFs is a scalable probabilistic graphical model designed to model rich and structureddata. What makes it different from MLN is that it is defined over continuous valuesfrom 0 to 1 instead of Boolean values. The feature functions are hinge functions withconstraints, using Lukasiewicz operators[3] which can capture a wide range of use-ful relationships. Using continuous values enables the model to represent uncertainty,meaning that they are neither completely true or completely false. In knowledge rea-soning, it’s very important to use continuous values, as many facts are not entirely sure,especially for those who are collected from doubtful sources.

More specially, Lukasiewicz logic can be interpreted as formula 3.7, 3.8 and 3.9. Figure3.3a and 3.3b are the visualization of Lukasiewicz logic.

x1 ^ x2 = max{x1 + x2 � 1, 0} (3.7)

x1 _ x2 = min{x1 + x2, 1} (3.8)1https://github.com/linqs/psl

3.3. Probabilistic Framework 23

¬x = 1� x (3.9)

Formally, we can define HL-MRFs as follows:

Definition 3.3.2 (HL-MRFs) A HL-MRF P over random variables Y and conditioned onrandom variables X is a probability density:

P(Y|X) =1

Z(l)exp[� fl(Y, X)] (3.10)

where Z(l) is a normalization item, fl(Y, X) is a constrained hinge-loss feature func-tion.

fl(Y, X) =m

Âj=1

ljfj(Y, X) (3.11)

where l = (l1, ..., lm) are weights, fj(Y, X) are potentials represented by hinge-lossfunctions.

fj(Y, X) = [max{lj(Y, X), 0}]pj (3.12)

where lj is a linear function of Y and X based on Lukasiewicz operators and pj 2 {1, 2}

(a) A ^ B (b) A _ B

Figure 3.3: A visualization of Lukasiewicz logic

Formally, a PSL program can be defined as Definition 3.3.3[3].

Definition 3.3.3 (PSL) A PSL program is a probabilistic framework containing a setof first-order logic rules, each of which is a template for hinge-loss potentials. Whengrounded over a base of ground atoms, a PSL program induces a HL-MRF conditionedon any specific observations.

PSL uses Lukasiewicz t-norm and its corresponding co-norm to show the degree towhich a ground rule is satisfied. For a rule rbody ) rhead, an I is defined to show if therule is satisfied, if not, the distance to satisfaction. A rule is satisfied only if I(rbody) I(rhead). The distance to satisfaction measures the degree to which this condition isviolated, which is defined as the follow:

dr(I) = max{0, I(rbody)� I(rhead)} (3.13)


Example 3.3.1 Assume that we have two rules:

0.3 : f riend(B, A) ^ votesFor(A, P)) votesFor(B, P)

0.8 : spouse(B, A) ^ votesFor(A, P)) votesFor(B, P)

and we have scores for three atoms: spouse(b, a)! 1, votesFor(a, p)! 0.9, votesFor(b, p)!0.3. We can get I(rbody) = max{0, 1+ 0.9� 1} = 0.9, therefore the distance to satisfactionis dr(I) = max{0, 0.9� 0.3} = 0.6. The rule will be satisfied if the head value is equal orgreater than 0.9.

Given this context, we can rewrite formula 3.10, the probability density in PSL is:

f (I) =1Z

exp[�Âr2R

lr(dr(I))p] (3.14)

The final step is to infer MPE[37] based on the HL-MRFs. MPE aims at finding the mostlikely state of the world where a set of observations hold. Formally, we can define MPEas follows:

Definition 3.3.4 (MPE) Given a probability distribution P(V) over a set of random vari-ables V. The truth values E is a subset of V, E ✓ V, and E is called evidence. The taskof MPE is to determine truth values U to the remaining variables, U * V with maximalprobability, which is:

MPE(E) = argmaxu

P(U = u|E = e) (3.15)

While in PSL, the MPE problem is defined as:

argmaxy

P(y|x) = arg miny2[0,1]

wTf(y, x) (3.16)

As f(y, x) is the potential using hinge-loss functions with continuous values, minimiz-ing wTf(y, x) is a convex optimization rather than a combinatorial one. There are manyexisting solutions for convex optimization. PSL proposes a consensus optimization[6], atechnique that divides an optimization task into independent subtasks and then iteratesto reach a consensus on the optimum. It uses the alternating direction method of multi-pliers (ADMM)[3] to decompose the problem. In PSL, separate subproblems are createdfor each rule. More details about ADMM can be found in the paper [3].

3.4 Pilot Test

To get insights on the performance of directly combing AMIE+ and PSL, a pilot test on asample dataset has been carried out. The goal is to understand how well the generatedrules fit PSL by manually adjusting the PSL model from different aspects. This sectionfirst introduces the dataset used in the pilot test and then shows the experimentation andresults. At last, it discusses the observation and suggestions for the later optimizationstrategies.

3.4. Pilot Test 25

3.4.1 Data

A Kinship dataset2 from UCI Machine Learning Repository is used in the pilot test. It isa complete relational database consisting of 24 unique people in two families, and twofamilies have the same structure. Each family has 12 predicates such as father, mother,uncle and etc. The statistics of the dataset is on Table 3.2.

Table 3.2: Statistics of the Kinship dataset

Entity Relation FactsFamily I 12 12 56Family II 12 12 56

3.4.2 Experimentation

Before designing experimentations, the hypothesis is that the combination of AMIE+and PSL can perform link prediction well and the result truly reflects its performance.Each time, only one relation is going to be predicted, so we did experimentations 12times. First, the pairs with the target relation are separated into training (50%) andtesting (50%). Then, two subgraphs are generated based on the training and testingdata. Chapter 4 will explain more details about the data pre-processing. After that,the subgraph of the training data is considered as the input of AMIE+. The idea is todiscover first-order logic rules from the observed data and apply these rules to inferthe target relation between entities in the subgraph derived from testing data. Theparameters of AMIE+ are on Table 3.3. Figure 3.4 shows the number of generated rulesfor each relation in Kinship. The initial weight of each rule is the average of its headcoverage and its PCA confidence.

Figure 3.4: The number of generated rule in Kinship

The first hypothesis derived from Figure 3.4 is that not all of these rules make contri-butions. As the relations in Kinship dataset are very straightforward, it’s not necessary

2https://archive.ics.uci.edu/ml/datasets/kinship


Table 3.3: Parameters of AMIE+

minHC 0.01maxLen 3minConf 0.1

to apply such a number of rules. Moreover, it’s essential to check the rules themselvesand analyze the false prediction to find the correlation between the rules and the perfor-mance. The second hypothesis is that the weights of rules may play an important rolein the experimentation.

To verify the two hypothesis, two corresponding experimentations have been designed.The first experimentation intents to change the number of rules and discover its effecton the performance. More specially, each time rules whose weights are lower thana threshold are removed. This procedure is repeated for each relation. The secondexperimentation intents to change the weights of rules and discover its influence on theperformance.

3.4.3 Discussion

Figure 3.5 shows the performance of changing the number of rules in Kinship dataset.F-score is used as a metric to evaluate the methods. It considers both the precision andthe recall to compute the score. The formula is described as formula 3.17.

F1 = 2 ⇤ precision ⇤ recallprecision + recall

(3.17)

In Figure 3.5, the horizontal axis represents the threshold of rules. It means that the ruleswhose weights are lower than the threshold are removed. The higher the threshold is,the higher-quality of rules are left to perform inference. The vertical axis represents F-score. It can be observed that the quantity of rules has a huge effect on the performance.Using less but high-confident rules is better than using all the initial rules.

Figure 3.5: The performance of changing the number of rules in Kinship dataset

3.4. Pilot Test 27

The main reason of not using all the initial rules is that it can lead to high recall but verylow precision. For example, the target relation is father and its initial rules contain:

daughter(B, A)) f ather(A, B) 0.58

mother(E, B) & wi f e(E, A)) f ather(A, B) 0.87

The first rule makes sense but it’s not necessarily valid because A and B can also have therelation mother. As its weight is positive, PSL will take it as a valid rule and do inference.Therefore, the takeaway message from this experimentation is that the quantity of rulesdo have an effect on the performance and using all the generated rules may mislead theprobabilistic framework to some extent.

Figure 3.6 shows the performance of changing the weights of rules in Kinship dataset.The horizontal axis represents the relations and the vertical axis represents F-score. Inthis experimentation, the quantity of rules is not changed but the weights are randomlyassigned. Compared to the previous experimentation, changing the weights has effectson the performance but not as huge as the previous experimentation. The performancedoesn’t change a lot after randomly assigning weights. A conclusion from the twoexperimentations is that it’s worth developing automatic rule optimization strategiesto improve the quality of rules, and the quantity and the weights are two importantfactors.

Figure 3.6: The performance of changing the weights of rules in Kinship dataset

Chapter 4

Advanced Rule-based Probabilistic Frame-work

It has been discussed in chapter 3 that automatic rule optimization strategies should bedeveloped to improve the quality of rules which can fit PSL better and lead to betterinference. This chapter together with the next one describe how it can be fully imple-mented and evaluated. Section 4.1 gives an overview of the architecture of the system.Section 4.2 explains the preprocessing steps if the data doesn’t have correct formats. Sec-tion 4.3 describes how to prepare training and testing data. Section 4.4 gives examplesof internal procedures of AMIE+. Section 4.5 introduces the proposed rule optimizationstrategies. As the inference part is the same as the one in the pilot test, the principles ofMPE are not discussed again. More details about the inference are in chapter 3.3.3.

4.1 Architecture

Figure 4.1 shows the architecture of the system. The system has four individual com-ponents. Data preparation is to prepare training and testing data according to the targetrelation R. This thesis uses quasi-local information of the interested node pairs, so eachtarget relation will have its own training and testing sets. Rule mining model is used to au-tomatically generate first-order logic rules directly from the training set. The output is aset of rules with three scores: head coverage, standard confidence and PCA confidence.Rule optimization model is a step to learn these rules, assigning new weights, and thenselect a high-quality subset of rules. Inference model is the final step that incorporates theoptimized rules into PSL to assign a probability on each possible fact.

4.2 Preprocessing

This thesis focuses on RDF-style knowledge bases, where each fact is represented as atriple r(e1,e2). In order to be computable, we used tab-separated values (tsv) format torepresent triples. This thesis only uses triples representing a fact (i.e. A-Box triples), so

28

4.3. Data Preparation 29

Figure 4.1: An overview of the system showing all the individual component and its output

triples which represent classes, regions, domains of relations (i.e. T-Box triples) will notbe considered. The complexity of preprocessing is based on the type of databases.

4.3 Data Preparation

The method uses quasi-local information of interested node pairs instead of using theirglobal information. Interested node pairs are triples that have the target relation R. Theidea is to mine cycles around interested node pairs. A cycle in a RDF graph is a specialpath, consisting of connective individual paths, that starts and ends at the same node. Inthis thesis, we use cycles that contain no more than 3 individual paths. Figure 4.2 is anexample of path, figure 4.4a is an example of 2-hop cycle and figure 4.4b is an exampleof 3-hop cycle.

There are three reasons of using cycles: i) since interested node pairs contain factsin the real world, surrounding cycles should contain the most important and relatedinformation. ii) the information contained in cycles is able to generate declarative rules.iii) for later steps, especially when using the rule mining model, it’s more efficient totake a subgraph as an input.

The detailed algorithm is in Appendix A. It takes a database G and a target relationR as inputs. The algorithm first selects all the node pairs having relation R. Then itsplits pairs into a training set (70%) and a testing set (30%). For each node pair, thealgorithm finds two-hop cycles and three-hop cycles separately, then add these cycles tothe corresponding set. Later, the rule mining model will generate rules from the trainingset, the inference model will work on the testing data. A reminder is that groundtruthdata in the testing set should be removed before using the inference model. Therefore,

30 Chapter 4. Advanced Rule-based Probabilistic Framework

the output is a training set and a testing set with its surrounding cycles.

Figure 4.2: A path

Figure 4.3: An example of path

(a) A 2-hop cycle (b) A 3-hop cycle

Figure 4.4: Examples of cycle

4.4 Rule Mining Model

AMIE+ is used in this step to mine first-order logic rules from the training set. Principlesof AMIE+ have been discussed in chapter 3. This section gives examples that show moreinternal details. Example 4.4.1 shows the expected input and output of AMIE+ and howthey can predict facts. Example 4.4.2 shows the procedure in AMIE+.

Example 4.4.1 In the training set, we have triples such as:

< Swedish, language_o f _country, Sweden >

< Mandarin, language_o f _country, China >

< Swedish, language_school_in_city, Stockholm >

< Mandarin, language_school_in_city, Shanghai >

< Shanghai, city_located_in_country, China >

< Stockholm, city_located_in_country, Sweden >

......

The rule mining model is able to generate a rule from these triples, which is

languag_o f _country(F, B) & language_school_in_city(F, A)� city_located_in_country(A, B), weight : 0.75

where B, F and A are variables that can be instantiated by any constant. The weightshows its confidence. Therefore, if in the testing data, we have triples like:

< Japanese, language_o f _country, Japan >

< Japanese, language_school_in_city, Tokyo >

We can predict the triple < Tokyo, city_located_in_country, Japan> according to the rule.

4.5. Rule Optimization Model 31

Example 4.4.2 Assume that we want to mine rules whose predicate in the head atomis livesIn. Initially, each rule only contains a head atom livesIn. The algorithm thenapplies different operators on the head atom, either adding a dangling atom, an in-stantiated atom or a closed atom. So, there are many options such as livesIn(A, B),livesIn(A, England) and etc. Each option can expand to more options. Considering thepruning strategy and the confidence threshold, if a closed atom is added to the rule, therule will output. At last, a set of rules aiming at predicting livesIn will be producedsuch as:

livesIn(A, C) ^ livesIn(United_Kingdom, C)) livesIn(A, England)

isMarriedTo(A, F) ^ livesIn(F, B)) livesIn(A, B)

An interactive demo that shows the internal procedures of AMIE+ is available online1.

4.5 Rule Optimization Model

Rule optimization is a step to evaluate the quality of rules generated from AMIE+,meanwhile, assign new weights and/or use pruning strategies to remove low-qualityrules and reduce runtime. The reason of doing rule optimization is that in the rulemining model, it’s hard to precisely tune parameters of AMIE+. In order to not miss anyvaluable rules, we use the same default parameters for every relation (minHC = 0.01,maxLen = 3, minCon f = 0.1). So literally, the rule mining model will generate allthe reasonable rules for each relation. However, not all the reasonable rules work forthe other dataset. For example, we discovered two rules from the training set, sayingthat:

city_located_in_country(A, B)) country_citites(B, A)

subpart_o f (A, B)) country_cities(B, A)

It’s easy to see that the first rule is certainly true, but for the second rule, it makes sensebut is not always tenable. This thesis uses Maximum Likelihood Estimation (MLE)technique[3] to maximize the likelihood distribution through updating the weights andthe quantity of rules. Two kinds of rule optimization strategies are proposed: fixed-threshold method and rule-learning method. Fixed-threshold method is similar to thefirst experimentation in the pilot test where only the quantity of rules is changed. Rule-learning method uses MLE to update both the weights and the quantity of rules. Thereare two specific approaches which are explained in this section.

4.5.1 Fixed-threshold

Fixed-threshold method is a manual way to get a high-quality subset of rules. Since it’sunknown that which subset of rules perform best, a naive way is to manually removex% least confident rules each time and then use the rest of rules to do inference. Themodel will run totally 100

x times. This method can help to track how well rules fit intraining data, but is not feasible when given a huge amount of initial rules.

1http://luisgalarraga.de:9080/AMIEDemo/

32 Chapter 4. Advanced Rule-based Probabilistic Framework

4.5.2 Rule-learning

Rule-learning method is an intelligent way to automatically get a high-quality subset ofrules. Instead of running 100

x times, it only needs to run once which contains severalfast iterations. This thesis developed two different rule-learning methods: forward op-timization and backward optimization. In forward optimization, the algorithm initiallytakes all the generated rules and learns their weights by maximizing the log-likelihoodof training data. In each iteration, if the weights become lower than a threshold, theserules will be removed. The process is repeated until the loss convergences. While inbackward optimization, the algorithm initially takes the highest-weighted rule. In eachiteration, the second highest-weighted rule will be added. The process is repeated un-til the loss is lower than a threshold. The detailed algorithm is in Appendix C.1 andC.2.

Maximum Likelihood Estimation The key technique to update weights is to maximizethe log-likelihood of training data. The derivative of the log-likelihood with the weightparameter is:

g =�∂logP(y|x)

∂Wq= EW [Fq(y,x)]�Fq(y,x) (4.1)

where x is the state of the evidence and y is the state of the non-evidence. E is theexpected incompatibility and F is the truth incompatibility. The vote perceptron[27] isused to approximate the expectations E with the counts in MPE. At last, the weight willbe updated followed by:

wt+1 = wt � hg (4.2)

where h = 1iter+1 is the step rate. The new weight is max(weight, 0.0).

Algorithm (forward) First, we set some parameters for the rule-learning process, in-cluding minimum rule weight minWeight, regularization items l2, l1 and loss thresholdtolerance. Then, we compute the truth incompatibility which is the sum of each groundkernel’s distance to satisfaction. The value of each atom is a Boolean value showing itscorrectness. The calculation of the distance is followed by formula 3.13. After that, westart doing iterations. In each iteration, the first step is to compute the expected incom-patibility based on the current rules. The values of body atoms are still Boolean values,but the values of head atoms are computed based on the rule weights. The differencebetween the expected incompatibility and the truth incompatibility will be divided by ascaling factor and multiplied by a step rate. Then it will be added on the weight. Thenew weight should be an nonnegative value, otherwise it will be zero. Finally, if a ruleis assigned a weight lower than 0.05, it will be removed. The iteration is repeated untilthe loss convergences.

Algorithm (backward) The inputs of backward optimization are minLoss and regu-larization items l2, l1. Similar to forward optimization, we calculate the loss which isthe difference between the expected incompatibility and the truth incompatibility. Inthe first iteration, we take the highest-weighted rule. If the loss is above the threshold,

4.5. Rule Optimization Model 33

we add the second highest-weighted rule in the next iteration. The iteration is repeateduntil the loss convergence.

Chapter 5

Evaluation

This chapter shows the performance of the system on real-world datasets. All the threemethods: base-line method, fixed-threshold method and rule-learning method are eval-uated on two different datasets: NELL dataset, a Never-Ending Language Learningproject that extracts facts from web pages and stores these facts based on their con-fidence; and Freebase15k dataset, a huge database structuring general human knowl-edge by getting information from relatively structured knowledge repositories (e.g.Wikipedia). The statistics of the data used are shown on Table 5.1. The goal is to assessthe utility of the system, consisting of a rule mining model, a rule optimization modeland an inference model, at inferring the existence of the target relation R in a dataset.This chapter first introduces the way of evaluation and then shows the performance ontwo datasets.

Table 5.1: Statistics of the data used in experimentations

NELL Freebase15kEntities 53K 14KRelations 519 1345Facts 203K 592KAvg. instances/relation 391 440Relations to be tested 10 7Avg. instances/tested relation 814 692

5.1 Evaluation

The real performance of link prediction approaches is always hard to say because eval-uating link prediction is a big challenge. Link prediction problems fall into two cate-gories: predicting the links that will be added in the future based on the current graphand inferring missing links from the current graph. The first class is a dynamic networkwhich takes time dimension into consideration, the second class is a static network.This thesis focuses only on the second class, so the evaluation is well-suited for staticnetworks.

34

5.2. NELL 35

Training and testing data Properly splitting training and testing data is the first chal-lenge in order to perform an unbiased evaluation. Link prediction should be evaluatedwith a complete and unsampled testing data. Training data should cover more data sothat it can capture the general information, but on the other hand, testing data cannotbe so less because it maybe too biased. Therefore, we do 75/25 data splits, meaningthat 75% of the data is training data, the rest of it is testing data. As discussed inchapter 4.3, cycles around the interested node pairs are used to capture their quasi-localinformation.

Precision and recall curve Precision-recall (Figure 5.1a) curve is the best way to eval-uate the performance of link prediction approaches[43]. It shows precision with respectto recall at different threshold. It is better than the fixed threshold and the ROC curve(Figure 5.1b). The fixed threshold will produce an unfair comparison between two ap-proaches. ROC curves are suitable for imbalanced small data because the curve showsan expected performance. But for link prediction where the imbalance ratio is lower,the ROC curve maybe deceptive. For example, (0.05,0.99) point in a ROC curve meansthat the recall is 0.99 and the false positive rate is 0.05. It seems to be a good modelwhen using the ROC curve. But the precision is unknown which can probably be verylow, meaning that it’s a bad model. Therefore, ROC curves cannot preciously depictthe performance of link prediction approaches. Instead, PR-curve will be used. In thisthesis, since there are too few points in the PR-curve, F-score is used instead.

(a) PR curve (b) ROC curve

Figure 5.1: A visualization of PR curve and ROC curve

Average Precision This work is compared with a state-of-the-art graph-based modelfor link prediction called VSP[19]. In that paper, average precision is used as the metricto evaluate the algorithm. Average precision considers the order in which the returnedfacts are presented. Average precision computes the average value of precision over theinternal from 0 recall to 1.0 recall.

5.2 NELL

This section shows the performance on NELL dataset. First, we introduces what isNELL and the training and testing data we used in experimentations. Then, we analyze

36 Chapter 5. Evaluation

the first-order logic rules generated from AMIE+. Finally, we show the performance ofusing different methods.

5.2.1 Data

NELL (i.e. Never Ending Language Learning)[32] is a dataset iteratively generated by anever-ending machine learning system. The system has the ability to extract structuredinformation from unstructured web pages. In each iteration, NELL generates a set ofcandidate facts based on facts learned from previous iterations and facts extracted fromweb pages. NELL promotes some of those candidate facts that have high confidence.Promoted facts will be regarded as truth, and the other candidate facts will be learnedin the next iteration. So far, NELL has finished 1069 interactions1 and accumulated morethan 50 million candidate facts.

In this thesis, we used the same NELL dataset2 as the paper[19] which aims for solvinglink prediction problems using Vector Space Similarity feature in random walk. It con-tains 10 relations which were hand-selected as the relations with the largest number ofknown instances that had a reasonable precision. From Table 5.1, we can see the averagefacts of each selected relation is much higher than the general average.

Data preparation For each relation, we selected all the node pairs that have such re-lation and split them into 80% training and 20% testing. Next, as described in Chapter3.3, we collected cycles in 3 hops around interested node pairs. The detailed statistics oftraining and testing can be seen on Table 5.2.

Table 5.2: Statistics of training and testing data in NELL

Relation Pairs Training Testing Training(inc. cycles)

Testing(inc. cycles)

river_flows_through_city 997 797 200 1665 448sports_team_position_for_sport 102 81 21 462 78city_located_in_country 495 369 99 1673 353athlete_plays_for_team 1347 1077 270 4936 1372writer_wrote_book 1835 1468 367 3798 968actor_starred_in_movie 537 429 108 860 216journalist_writes_for_publication 1145 916 229 2497 603stadium_located_in_city 427 341 86 873 234state_has_lake 103 82 21 168 42team_plays_in_league 1151 920 231 7774 2365

5.2.2 Rules

The second column in Table 5.3 shows the number of rules we got for each relation.For relation sports_team_position_for_sport and state_has_lake, we couldn’t get any spe-

1http://rtw.ml.cmu.edu/rtw/2http://rtw.ml.cmu.edu/emnlp2014_vector_space_pra/

5.2. NELL 37

cific rules. The reason is that there are not enough paths in the training set, so AMIE+couldn’t discover any rules with scores above the thresholds. The thresholds of pa-rameters we used are on Table 3.3. From Table 5.2 and Table 5.3, we can observe atendency that more paths in the training set can produce more rules. In this thesis, weonly take the rules which have the target relations as headers to do rule optimizationand inference. The reason is that the prediction of any other relations will produce alot of noisy and uncertain information. However, smartly predicting any other usefulrelations could be the future work of this thesis. Examples of selected rules can be foundin Appendix D.1.

In NELL dataset, we discovered many perfect rules. A perfect rule is a rule whoseinitial weight is 1.0. It means that as long as the body atoms are satisfied, the headeratom must be satisfied as well. Such perfects rules contribute a lot in the inference andreduce the runtime as well.

Table 5.3: The number of generated rules from NELL

Relation No. of rules No. of selected rulesriver_flows_through_city 18 9sports_team_position_for_sport 0 0city_located_in_country 148 74athlete_plays_for_team 244 22writer_wrote_book 12 3actor_starred_in_movie 2 1journalist_writes_for_publication 52 10stadium_located_in_city 18 9state_has_lake 0 0team_plays_in_league 378 59

5.2.3 Rule optimization & Inference

This part describes how we do rule optimization using two different strategies and theirinference results. Besides, we will compare the results with VSP[19]. The base-lineapproach means that PSL uses all the initial rules to do inference without any ruleoptimization.

Fixed-threshold In these 10 relations, three of them have more than 20 rules:city_located_in_country, athlete_plays_ f or_team and team_plays_in_league. For them,each time we will remove 10% least confident rules. For the others, each time we willremove the least confident rule (i.e. one rule each time). After that, PSL will assign anew weight to each rule. In this method, we manually changed the quantity of rulesand automatically changed the quality of rules. The detailed results are in AppendixD.2.

Rule-learning Rule-learning method is to automatically change both the quantity andquality of rules. It’s a one-time running which contains several iterations. In each


iteration, PSL gets a new subset of rules by either removing low-quality rules or addingthe next high-confident rule. The detailed results are in Appendix D.2.

5.2.4 Result analysis

Appendix D.2 shows the results of three proposed methods and the best result fromVSP paper on NELL dataset. Among all the 10 relations, this thesis outperformed in7 relations. For relation state_has_lake relation and sports_team_position_ f or_sport, wegot 0 in all the methods, because AMIE+ wasn’t able to generate any first-order logicrules from the training data. But in general, we could see a huge improvement comparedwith VSP paper.

Base-line approach In the base-line approach, we didn’t use any rule optimizationstrategy. But still we can see that for certain relations, the F-score reaches to 1.0. Thisresult has two reasons: i) from Table 5.3, it can be observed that those relations gener-ated very limited rules which means that the structure of the training data is not veryrich, so the patterns are very general. ii) AMIE+ found some perfect rules with a highweight. For instance, for relation writer_wrote_book, AMIE+ found a perfect rule sayingthat:

bookwriter(A, B)) writerwrote(B, A) 1.0

It means that every node pair with relation bookwriter must also have relation writerwrote.With certain perfect rules, it’s very easy to do correct inference as long as the testing datahas complete information.

Fixed-threshold approach In the fixed-threshold approach, we removed 10% least con-fident rules each time. If the number of rules is so small, the algorithm will rule lessthan 10 times. For instance, for relation actor_starred_in_movie, AMIE+ only generatedone rule saying that:

movie_star_actor(A, B)) actor_starred_in_movie(B, A)1.0

So the algorithm will only run once. Since VSP uses precision as the metric, we also useprecision. In general, the performance of the fixed-threshold method is equal or greaterthan VSP and the base-line approach.

Rule-learning approach In the rule-learning approach, we either use forward opti-mization or backward optimization. It turns out that backward optimization works wellwhen perfect rules are discovered, because perfect rules usually generate less loss, mak-ing it faster to stop the process. For NELL, the rule-learning approach performs equalor better than the other approaches with less runtime.

5.3 Freebase15k

This section shows the performance on Freebase15k dataset. The section follows thesame structure as NELL.

5.3. Freebase15k 39

5.3.1 Data

Freebase15k3 is a subset of Freebase, which is a very large database of generic factscontaining more than 1.3 billion triples and 90 million entities. Freebase is an onlinecollection of structured data collected from many sources including individual contrib-utors and semi-structured sources like Wikipedia. Google’s Knowledge Graph has beenpowered in part by Freebase since 2010.

Freebase15k was first introduced in 2013[5]. The goal is to make a smaller dataset forexperimentations. The subset of entities that are also present in the Wikilinks database4

and have at least 100 mentions in Freebase were selected. Moreover, relationships like!/people/person/nationality which just inverse the relation /people/person/nationalitywere removed as well. Table 5.1 shows the statistics of Freebase15k. In this thesis, wesemi-randomly selected 7 relations.

Data preparation The data preparation process is that same as NELL dataset. Thedetailed statistics of training and testing data are on Table 5.4.

Table 5.4: Statistics of training and testing data in Freebase15k

Relation Pairs Training Testing Training(inc. cycles)

Testing(inc. cycles)

film_director_film 1040 780 260 17k 6920tv_tv_program_language 313 234 79 2871 1118people_place_of_death 859 644 215 4977 1957music_genre 868 651 217 11k 5673sports_team_sport 523 392 131 4933 1844language_spoken 409 306 103 2653 1229sport_team_color 833 624 209 1057 517

5.3.2 Rules

Table 5.5 shows the number of rules for each relation. Since freebase15k is much richerthan NELL, we got much more rules. Similar to NELL, we only take the rules whoseheaders are target relations to do rule optimization. Due to the incompleteness of free-base, AMIE+ couldn’t generate as many perfect rules as NELL.

5.3.3 Result analysis

Appendix D.2 shows the results on Freebase15k. We used F-score as the metric. Amongall the 7 relations, rule-learning method outperformed in 5 relations. For the other 2relations, the distance to the best F-score is in 0.05. The runtime is based on the sizeof the relation. The largest relation in this thesis is music_genre which takes less than 5minutes.

3https://www.microsoft.com/en-us/download/details.aspx?id=523124https://code.google.com/archive/p/wiki-links/


Table 5.5: The number of generated rules from freebase15k

Relation No. of rules No. of selected rulesfilm_director_film 20k 385tv_tv_program_language 2984 106people_place_of_death 6538 103music_genre 369 25sports_team_sport 3400 149language_spoken 3623 152sport_team_color 24 9

The average F-score of the base-line approach is 0.624. The average F-score of the rule-learning approach is 0.709. We used either forward optimization or backward opti-mization to learn rules depending on the number of perfect rules. If the relation existsperfect rules, we would use backward optimization, otherwise we would use forwardoptimization. Similar to NELL, backward optimization works well when perfect rulesexist in the relation.

Chapter 6

Discussion and Conclusion

This chapter discusses different aspects of this thesis including the results, challengesand the future work.

6.1 Results

It can be observed that all the three methods can solve link prediction problems inknowledge graphs. Rule-learning method outperforms the other proposed approachesin terms of F-score and runtime in most cases. Compared to graphic models and embed-ding models, rule-mining models are more tractable. By using first-order logic rules, wecan easily understand the model and add expertise if needed. Moreover, HL-MRF is apowerful probabilistic framework to capture the structure and uncertainty of the graphwith continuous values from 0 to 1. HL-MRF makes MPE become a convex problemwhich can be solved very efficiently.

One contribution of this thesis is to combine a rule-mining model into PSL. It dramati-cally reduces manual work because previously people had to manually define rules foreach relation in PSL. The rule-mining model targets RDF-style databases and producesseveral metrics to evaluate rules. To make it more efficient, quasi-local information isused. It turns out that such combination is a successful try in terms of efficiency andquality. In this thesis, the rule-mining model produces rules in less than 10 seconds.In the base-line approach on NELL and Freebase15k, we can see that without any ruleoptimization strategy, the final F-score can reach to 0.754 and 0.624 which to some extentproves the quality of generated rules.

The other contribution is to apply different rule optimization strategies in PSL. Thegoal is to shrink initial rules to a more high-quality subset. Fixed-threshold approachis a kind of manual way to shrink rules. Two methods of rule-learning approach aremore intelligent to shrink rules by using MLE. It turns out that it’s possible to use fixed-threshold approach to get the best performance, but the process is very time-consuming.Forward optimization and backward optimization works the best in most cases. Theycan be applied under different scenarios. It’s better to use backward optimization if

41

42 Chapter 6. Discussion and Conclusion

perfect rules are discovered in training data, otherwise it’s better to use forward opti-mization.

6.2 Challenges

This thesis has the following challenges:

1. Missing truth dataMost current knowledge bases like NELL and Freebase are incomplete, so it mayhappen that the prediction we get is correct in the real world but not exists inthe testing data. In several research papers, researchers used Amazon mechanicalturk to manually check the correctness of the predictions which are not in thetesting data. This thesis doesn’t use any manual work to check the predictions.Especially in Freebase15k, the entities are converted into unique ids, making ithard to understand their meanings. In order to avoid this problem as much aspossible, the NELL dataset is from a state-of-the-art paper and Freebase15k isdeemed as a high-quality subset of Freebase.

2. Limited selected relations

Both NELL and Freebase15k have hundreds of relations. In this thesis, we don’tperform link prediction on all the relations. To avoid selecting biased relations,we semi-randomly selected ones which have more than average instances and aremore meaningful in the real world.

6.3 Future Work

This thesis is possible to be extended and improved in different directions. This sectionintroduces some possible work in the future.

6.3.1 Generating negative rules

This thesis only uses positive rules to perform link prediction, but it would be interestingif negative rules can be used as well. Not too much work has been focused on producingnegative rules. Negative rules can enlarge the expressive power of the rule language toobtain wide coverage of the facts in the knowledge base. They also help to discovererrors in the knowledge base. For example, a negative rule could be like:

spouse(A, B) ) ¬ daughter(A, B)

6.3.2 Predicting multiple relations

In this work, we only take rules whose headers are target relations as initial rules.The reason is that predicting other relations will bring noisy data. However, it’s worth

6.4. Conclusion 43

investigating how to leverage other relations to predict the target relations. For example,we found a rule saying that:

country_cities(A, B)� city_located_in_country(B, A) 1.0

and another rule saying that:

city_located_in_geopolitical_location(B, A)� country_cities(A, B) 0.5

So far, we cannot use the second rule. But, if we can manage to predict more correctpairs having country_cities, there will be more chances to predict more correct nodepairs having the target relation city_located_in_country.

6.3.3 Alternative rule optimization strategy

There are also alternative ways to do rule optimization. One of them is to improvethe backward optimization by using rule cut which means that the model first takesthe highest weighted rule and check how well it covers the training data. Then themodel takes the second highest weighted rule and check how well it covers the trainingdata which is not covered by the first rule, and so on. The process is repeated until acertain amount of facts are covered. Another option could be using the weight learningmethods mentioned in the paper[27] such as Diagonal Newton and per-weight learningrates.

6.3.4 Applying on noisy data

As described in Chapter 2, one of the advantages of such statistical relational machinelearning model is to correct wrong information in knowledge bases. In this thesis, weassume that the data we get are all correct, so there are no uncertain facts in databases.However, in order to show the ability of PSL to detect wrong information, we can removesome of the facts and add noisy data. The goal is to see whether PSL can observe thequestionable facts based on the rules and the graph structure.

6.4 Conclusion

This thesis focuses on solving link prediction problems in knowledge graphs by usingrule-based probabilistic frameworks. The system has several components: generatingtraining and testing data using quasi-local information, using AMIE+ to produce first-order logic rules from training data, performing two different rule optimization strate-gies to get a high-quality subset of rules and finally doing inference on testing data. Asof May 2017, this is the first time that a rule-mining model is incorporated into PSL,a powerful rule-based framework based on HL-MRFs. Another contribution is to per-form rule optimization inside PSL. It turns out that the proposed methods can performlink prediction well in different datasets. Among all the three methods, rule-learningoptimization strategies outperforms the others in most cases.

Bibliography

[1] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. “Mining association rulesbetween sets of items in large databases”. In: Acm sigmod record. Vol. 22. 2. ACM.1993, pp. 207–216.

[2] Sören Auer et al. “Dbpedia: A nucleus for a web of open data”. In: The semanticweb (2007), pp. 722–735.

[3] Stephen H Bach et al. “Hinge-loss Markov random fields and probabilistic softlogic”. In: arXiv preprint arXiv:1505.04406 (2015).

[4] Kurt Bollacker et al. “Freebase: a collaboratively created graph database for struc-turing human knowledge”. In: Proceedings of the 2008 ACM SIGMOD internationalconference on Management of data. AcM. 2008, pp. 1247–1250.

[5] Antoine Bordes et al. “Translating embeddings for modeling multi-relational data”.In: Advances in neural information processing systems. 2013, pp. 2787–2795.

[6] Stephen Boyd et al. “Distributed optimization and statistical learning via the al-ternating direction method of multipliers”. In: Foundations and Trends R� in MachineLearning 3.1 (2011), pp. 1–122.

[7] Ronald J Brachman and James G Schmolze. “An overview of the KL-ONE knowl-edge representation system”. In: Cognitive science 9.2 (1985), pp. 171–216.

[8] Sasa Buvac and Ian A Mason. “Propositional logic of context”. In: AAAI. 1993,pp. 412–419.

[9] Andrew Carlson et al. “Toward an Architecture for Never-Ending Language Learn-ing.” In: AAAI. Vol. 5. 2010, p. 3.

[10] Yang Chen et al. “Ontological Pathfinding: Mining First-Order Knowledge fromLarge Knowledge Bases”. In: (2016).

[11] William Clocksin and Christopher S Mellish. Programming in PROLOG. SpringerScience & Business Media, 2003.

[12] Xin Dong et al. “Knowledge vault: A web-scale approach to probabilistic knowl-edge fusion”. In: Proceedings of the 20th ACM SIGKDD international conference onKnowledge discovery and data mining. ACM. 2014, pp. 601–610.

[13] Xin Luna Dong et al. “From data fusion to knowledge fusion”. In: Proceedings ofthe VLDB Endowment 7.10 (2014), pp. 881–892.

[14] Luis Galárraga. “Interactive Rule Mining in Knowledge Bases”. In: 31ème Con-férence sur la Gestion de Données (BDA 2015), Île de Porquerolles. 2015.

44

Bibliography 45

[15] Luis Antonio Galárraga et al. “AMIE: association rule mining under incompleteevidence in ontological knowledge bases”. In: Proceedings of the 22nd internationalconference on World Wide Web. ACM. 2013, pp. 413–422.

[16] Herve Gallaire, Jack Minker, and Jean-Marie Nicolas. “Logic and databases: Adeductive approach”. In: ACM Computing Surveys (CSUR) 16.2 (1984), pp. 153–185.

[17] Alberto García-Durán et al. “Combining two and three-way embedding modelsfor link prediction in knowledge bases”. In: Journal of Artificial Intelligence Research55 (2016), pp. 715–742.

[18] Matt Gardner and Tom M Mitchell. “Efficient and Expressive Knowledge BaseCompletion Using Subgraph Feature Extraction.” In: EMNLP. 2015, pp. 1488–1498.

[19] Matt Gardner et al. “Incorporating vector space similarity in random walk infer-ence over knowledge bases”. In: (2014).

[20] Lise Getoor. Introduction to statistical relational learning. MIT press, 2007.

[21] Peter D Hoff, Adrian E Raftery, and Mark S Handcock. “Latent space approachesto social network analysis”. In: Journal of the american Statistical association 97.460(2002), pp. 1090–1098.

[22] Shanshan Huang and Xiaojun Wan. “AKMiner: Domain-specific knowledge graphmining from academic literatures”. In: International Conference on Web InformationSystems Engineering. Springer. 2013, pp. 241–255.

[23] Michael Kifer, Georg Lausen, and James Wu. “Logical foundations of object-orientedand frame-based languages”. In: Journal of the ACM (JACM) 42.4 (1995), pp. 741–843.

[24] Angelika Kimmig et al. “A short introduction to probabilistic soft logic”. In: Pro-ceedings of the NIPS Workshop on Probabilistic Programming: Foundations and Applica-tions. 2012, pp. 1–4.

[25] Ni Lao and William W Cohen. “Relational retrieval using a combination of path-constrained random walks”. In: Machine learning 81.1 (2010), pp. 53–67.

[26] Hector J Levesque. “Knowledge representation and reasoning”. In: Annual reviewof computer science 1.1 (1986), pp. 255–287.

[27] Daniel Lowd and Pedro Domingos. “Efficient weight learning for Markov logicnetworks”. In: European Conference on Principles of Data Mining and Knowledge Dis-covery. Springer. 2007, pp. 200–211.

[28] Farzaneh Mahdisoltani, Joanna Biega, and Fabian Suchanek. “Yago3: A knowledgebase from multilingual wikipedias”. In: 7th Biennial Conference on Innovative DataSystems Research. CIDR Conference. 2014.

[29] Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. Introductionto information retrieval. Vol. 1. 1. Cambridge university press Cambridge, 2008.

[30] George A Miller. “WordNet: a lexical database for English”. In: Communications ofthe ACM 38.11 (1995), pp. 39–41.

[31] Kurt Miller, Michael I Jordan, and Thomas L Griffiths. “Nonparametric latentfeature models for link prediction”. In: Advances in neural information processingsystems. 2009, pp. 1276–1284.

46 Bibliography

[32] Tom M Mitchell et al. “Never-ending learning”. In: Twenty-Ninth AAAI Conferenceon Artificial Intelligence. 2015.

[33] Isadore Newman and Carolyn R Benz. Qualitative-quantitative research methodology:Exploring the interactive continuum. SIU Press, 1998.

[34] Maximilian Nickel et al. “A review of relational machine learning for knowledgegraphs”. In: arXiv preprint arXiv:1503.00759 (2015).

[35] Akash Rajak and Mahendra Kumar Gupta. “Association rule mining: applicationsin various areas”. In: Proceedings of International Conference on Data Management,Ghaziabad, India. 2008, pp. 3–7.

[36] Matthew Richardson and Pedro Domingos. “Markov logic networks”. In: Machinelearning 62.1 (2006), pp. 107–136.

[37] Dimitar Shterionov et al. “The most probable explanation for probabilistic logicprograms with annotated disjunctions”. In: Inductive Logic Programming. Springer,2015, pp. 139–153.

[38] John F Sowa et al. Knowledge representation: logical, philosophical, and computationalfoundations. Vol. 13. MIT Press, 2000.

[39] Ricardo Usbeck et al. “AGDISTIS-graph-based disambiguation of named enti-ties using linked data”. In: International Semantic Web Conference. Springer. 2014,pp. 457–471.

[40] Zhichun Wang and Juanzi Li. “RDF2Rules: learning rules from RDF knowledgebases by mining frequent predicate cycles”. In: arXiv preprint arXiv:1512.07734(2015).

[41] Robert Andrew Wilson and Frank C Keil. The MIT encyclopedia of the cognitivesciences. MIT press, 2001.

[42] Yonghui Wu et al. “Google’s Neural Machine Translation System: Bridging theGap between Human and Machine Translation”. In: arXiv preprint arXiv:1609.08144(2016).

[43] Yang Yang, Ryan N Lichtenwalter, and Nitesh V Chawla. “Evaluating link predic-tion methods”. In: Knowledge and Information Systems 45.3 (2015), pp. 751–782.

[44] Xiaoxin Yin, Jiawei Han, and S Yu Philip. “Truth discovery with multiple conflict-ing information providers on the web”. In: IEEE Transactions on Knowledge and DataEngineering 20.6 (2008), pp. 796–808.

Appendix A

Data preparation algorithm

Algorithm 1: Training and testing datasets preparation1 function DataPreparation (G, R);

Input : A database G in tsv format and a target relation ROutput: A training set train and a testing set test

2 trainpairsR, testpairsR = SelectPairs(G,R);3 train = FetchSubgraph(G, trainpairsR);4 test = FetchSubgraph(G, testpairsR);

5 function FetchSubgraph(G, pairsR);Input : A database G in tsv format and a set of node pairs pairsR of ROutput: A set of cycles subgraph around pairsR

6 subgraph = ();7 for i 0 to size(pairsR) do8 e1, e2 = pairsR(i);9 twohop, twohop_inv = SearchPath(e1, e2);

10 subgraph.add(twohop, twohop_in);

11 targetnodes = ();12 e1nbrOne = FindNeighbours(G, e1);13 for j 0 to size(e1nbrOne) do14 e1nbrTwo = FindNeighbours(G, e1nbrOne(j));15 if e2 in e1nbrTwo then16 targetnodes.add(e1nbrOne(j));17 else18 continue;19 end20 end

21 for z 0 to size(targetnodes) do22 threehop, threehop_inv = SearchPath(e1, targetnodes(z));23 subgraph.add(threehop, threehop_inv);24 end25 end

47

Appendix B

AMIE+ algorithm

B.1 Rule mining algorithm

Algorithm 2: AMIE+ rule mining algorithm1 function RuleMining(G, minHC, maxLen, minCon f );

Input : Training set G, minimum head coverage minHC, maximum length of arule maxLen, minimum confidence minCon f

Output: A set of rules with their head coverage, standard confidence and PCAconfidence

2 q = [r1(x, y), r2(x, y)...rm(x, y)];3 output = ();4 while q.isnotEmpty() do5 r = q.dequeue();6 if RuleOutputCheck(r,output,minConf) then7 output.add(r)8 else9 if length(r)<maxLen then

10 R(r) = Re f ine(r);11 for rc 2 R(r) do12 if hc(rc) � minHC & rc 6= q then13 q.enqueue(rc);14 else15 continue;16 end17 end18 else19 continue;20 end21 end22 end23 return output

48

B.2. Rule output check 49

B.2 Rule output check

Algorithm 3: The algorithm to check if AMIE+ can output a rule1 function RuleOutputCheck(r, output, minCon f );

Input : rule r, output set output, minimum confidence minCon fOutput: Boolean value

2 if r.notClosed() _ pcacon f (r) < minCon f then3 return false;4 else5 parentRules = getParentRules(r);6 for rp 2 parentRules do7 if pcacon f (r) pcacon f (rp) then8 return false;9 else

10 return true;11 end12 end13 end

B.3 SPARQL examples

Adding a dangling atom to the rule:

marriedTo(x, z)) livesIn(x, y)

SELECT r, COUNT(livesIn(x, y)) WHERElivesIn(x, y) ^ marriedTo(x, z) ^ r(X, Y)SUCH THAT COUNT(livesIn(x, y)) � kwhere r(X, Y) 2 {r(y, w), r(w, y), r(z, w), r(w, z)}

Adding an instantiated atom to the same rule:

SELECT w, COUNT(livesIn(x, y)) WHERElivesIn(x, y) ^ marriedTo(x, z) ^ citizenO f (x, w)SUCH THAT COUNT(livesIn(x, y)) � k

Appendix C

Rule optimization algorithm

C.1 Forward optimization

Algorithm 4: Forward optimization1 function doLearn(rules, minWeight, l2, l1);

Input : An initial set of rules rules, a weight threshold minWeight, regularizationitem l2, l1

Output: A set of optimized rules rules2 iter = 0;3 maxIter = 500;4 truthIncompatibility = computeObservedIncomp();5 while iter < maxIter and violation > tolerance do6 expectedIncompatibility = computeExpectedIncomp();7 loss = truthIncompatibility� expectedIncompatibility;8 violation = computeViolation();

9 for r 2 rules do10 weight = getWeight();11 currentStep = (expectedIncompatibility(r)� truthIncompatibility(r)� l2 ⇤ weight� l1) ;

12 weight += currentStep/scaling;13 newWeight = max(weight, 0.0);14 r.setWeight(newWeight);15 end

16 for r 2 rules do17 weight = r.getWeight();18 if weight < minWeight then19 rules.remove(r);20 else21 continue;22 end23 end24 end25 return rules;

50

C.2. Backward optimization 51

C.2 Backward optimization

Algorithm 5: Backward optimization1 function doLearn(rules, minLoss, l2, l1);

Input : An initial set of rules rules, a loss threshold minLoss, regularization iteml2, l1

Output: The value of truth incompatibility for each rule tI2 iter = 0;3 maxIter = 500;4 truthIncompatibility = computeObservedIncomp();5 currentRules = ();6 while iter < maxIter and loss > minLoss do7 currentRules.addNextRule();8 expectedIncompatibility = computeExpectedIncomp();9 loss = truthIncompatibility� expectedIncompatibility;

10 for r 2 currentRules do11 weight = getWeight();12 currentStep = (expectedIncompatibility(r)� truthIncompatibility(r)� l2 ⇤ weight� l1) ;

13 weight += currentStep/scaling;14 newWeight = max(weight, 0.0);15 r.setWeight(newWeight);16 end

17 end

52 Appendix C. Rule optimization algorithm

C.3 Incompatibility computation

Algorithm 6: Compute truthIncompatibility1 function truthIncompatibility();

Input : A set of grounding atoms A, a set of rules rulesOutput: The value of truth incompatibility for each rule tI

2 tI = ();3 for r 2 rules do4 for gk 2 A do5 value = 1.0 ⇤ gk.body.exists() - 1.0 ⇤ gk.head.exists() - gk.body.size()-1;6 distance = max(value, 0.0);7 tI(r) += distance;8 end9 end

10 return tI;

Algorithm 7: Compute expectedIncompatibilityInput : A set of grounding atoms A, a set of rules rulesOutput: The value of expected incompatibility for each rule eI

1 eI =();2 for r 2 rules do3 for gk 2 A do4 value = 1.0 ⇤ gk.body.exists() - 1.0 ⇤ gk.head.expectedValue() - gk.body.size()-1;5 distance = max(value, 0.0);6 tI(r) += distance;7 end8 end9 return eI;

Appendix D

Results

D.1 Examples of generated rules from NELL

Table D.1: Example of generated rules from NELL

Rule HC StdConf PCAConf?b movie_star_actor ?a) ?a actor_starred_in_movie ?b 1.0 1.0 1.0?b team_member ?a) ?a athlete_plays_for_team ?b 1.0 1.0 1.0?e teammate ?a ?b team_member ?e) ?a athlete_plays_for_team ?b 0.01 0.92 1.0?b country_capital ?a) ?a city_located_in_country ?b 0.13 0.94 0.98?b created_by_agent ?a) ?a writer_wrote_book ?b 0.29 1.0 1.0

D.2 Results of NELL & Freebase15k

Table D.2: Result table of NELL (AP)

Relation Best result in VSP Base-line 90% 80% 70% 60%river_flows_through_city 0.076 0.799 0.978 0.978 0.978 0.980sports_team_position_for_sport 0.217 0.0 0.0 0.0 0.0 0.0city_located_in_country 0.347 0.901 1.0 1.0 1.0 1.0athlete_plays_for_team 0.589 0.983 / / / /writer_wrote_book 0.202 1.0 / / / /actor_starred_in_movie 0.037 1.0 / / / /journalist_writes_for_publication 0.319 0.998 0.998 0.998 0.998 0.998stadium_located_in_city 0.321 1.0 1.0 1.0 1.0 1.0state_has_lake 0.0 0.0 0.0 0.0 0.0 0.0team_plays_in_league 0.947 0.523 0.529 0.525 0.546 0.594

53

54 Appendix D. Results

Table D.3: Cont. Result table of NELL (AP)

Relation 50% 40% 30% 20% 10% rule-learning (No. of optimized rules)river_flows_through_city 0.980 0.980 0.980 0.980 / 0.981 (inv. 20%)sports_team_position_for_sport 0.0 0.0 0.0 0.0 0.0 0.0city_located_in_country 1.0 1.0 1.0 1.0 1.0 1.0 (inv. 20%)athlete_plays_for_team / / / / / 0.984 (100%)writer_wrote_book / / / / / 1.0 (100%)actor_starred_in_movie / / / / / 1.0 (100%)journalist_writes_for_publication 0.996 0.996 0.998 0.998 0.998 0.998 (90%)stadium_located_in_city 1.0 1.0 1.0 1.0 1.0 1.0 (55%)state_has_lake 0.0 0.0 0.0 0.0 0.0 0.0team_plays_in_league 0.599 0.596 0.906 0.908 0.612 0.582 (inv. 20%)

Table D.4: Result table of Freebase15k (the best f1-score)

Relation Base-line 90% 80% 70% 60% 50%film_director_film 0.546 0.558 0.586 0.613 0.651 0.661tv_tv_program_language 0.892 0.901 0.919 0.921 0.921 0.921person_place_of_death 0.408 0.409 0.402 0.397 0.371 0.361music_genre 0.497 0.501 0.503 0.503 0.503 0.503sports_team_sport 0.888 0.888 0.888 0.888 0.888 0.888language_spoken 0.478 0.483 0.491 0.480 0.503 0.502sport_team_color 0.662 0.662 0.662 0.674 0.681 0.297

Table D.5: Cont. Result table of Freebase15k (mean-f1score/best-f1score)

Relation 40% 30% 20% 10% rule-learning(No. cycles)

film_director_film 0.675 0.696 0.706 0.713 0.890 (inv. 3%)tv_tv_program_language 0.928 0.934 0.926 0.921 0.916 (88%)person_place_of_death 0.314 0.284 0.179 0.106 0.352 (54%)music_genre 0.511 0.511 0.521 0.557 0.557(inv. 10%)sports_team_sport 0.888 0.891 0.891 0.891 0.897(inv. 2%)language_spoken 0.538 0.541 0.602 0.663 0.668(inv. 3%)sport_team_color 0.297 / / / 0.681(60%)

TRITA -ICT-EX-2017:101

www.kth.se

exploring declarative rule-based probabilistic frameworks ...1119117/fulltext01.pdf · alexandra,...

Documents