# compressing probabilistic prolog programs - cs.· compressing probabilistic prolog programs ......

Post on 02-Jul-2018

230 views

Embed Size (px)

TRANSCRIPT

Mach LearnDOI 10.1007/s10994-007-5030-x

Compressing probabilistic Prolog programs

L. De Raedt K. Kersting A. Kimmig K. Revoredo H. Toivonen

Received: 27 September 2007 / Accepted: 9 October 2007Springer Science+Business Media, LLC 2007

Abstract ProbLog is a recently introduced probabilistic extension of Prolog (De Raedt,et al. in Proceedings of the 20th international joint conference on artificial intelligence,pp. 24682473, 2007). A ProbLog program defines a distribution over logic programs byspecifying for each clause the probability that it belongs to a randomly sampled program,and these probabilities are mutually independent. The semantics of ProbLog is then definedby the success probability of a query in a randomly sampled program.

This paper introduces the theory compression task for ProbLog, which consists of select-ing that subset of clauses of a given ProbLog program that maximizes the likelihood w.r.t.a set of positive and negative examples. Experiments in the context of discovering links inreal biological networks demonstrate the practical applicability of the approach.

Editors: Stephen Muggleton, Ramon Otero, Simon Colton.

L. De Raedt A. KimmigDepartement Computerwetenschappen, K.U. Leuven, Celestijnenlaan 200A, bus 2402, 3001 Heverlee,Belgium

L. De Raedte-mail: luc.deraedt@cs.kuleuven.be

A. Kimmig ()e-mail: angelika.kimmig@cs.kuleuven.be

K. Kersting K. RevoredoInstitut fr Informatik, Albert-Ludwigs-Universitt, Georges-Khler-Allee, Gebude 079, 79110Freiburg im Breisgau, Germany

K. Kerstinge-mail: kersting@informatik.uni-freiburg.de

K. Revoredoe-mail: kate@cos.ufrj.br

H. ToivonenDepartment of Computer Science, University of Helsinki, P.O. Box 68, 00014 Helsinki, Finlande-mail: hannu.toivonen@cs.helsinki.fi

Mach Learn

Keywords Probabilistic logic Inductive logic programming Theory revision Compression Network mining Biological applications Statistical relational learning

1 Introduction

The past few years have seen a surge of interest in the field of probabilistic logic learningor statistical relational learning (e.g. De Raedt and Kersting 2003; Getoor and Taskar 2007).In this endeavor, many probabilistic logics have been developed. Prominent examples in-clude PHA (Poole 1993), PRISM (Sato and Kameya 2001), SLPs (Muggleton 1996), andprobabilistic Datalog (pD) (Fuhr 2000). These frameworks attach probabilities to logicalformulae, most often definite clauses, and typically impose further constraints to facilitatethe computation of the probability of queries and simplify the learning algorithms for suchrepresentations.

Our work on this topic has been motivated by mining of large biological networkswhere edges are labeled with probabilities. Such networks of biological concepts (genes,proteins, phenotypes, etc.) can be extracted from public databases, and probabilistic linksbetween concepts can be obtained by various prediction techniques (Sevon et al. 2006).Such networks can be easily modeled by our recent probabilistic extension of Prolog, calledProbLog (De Raedt et al. 2007). ProbLog is essentially Prolog where all clauses are labeledwith the probability that they belong to a randomly sampled program, and these probabilitiesare mutually independent. A ProbLog program thus specifies a probability distribution overall possible non-probabilistic subprograms of the ProbLog program. The success probabilityof a query is then defined simply as the probability that it succeeds in a random subprogram.The semantics of ProbLog is not really new, it closely corresponds to that of pD (Fuhr 2000)and is closely related though different from the pure distributional semantics of (Sato andKameya 2001), cf. also Sect. 8. However, the key contribution of ProbLog (De Raedt etal. 2007) is the introduction of an effective inference procedure for this semantics, whichenables its application to the biological link discovery task.

The key contribution of the present paper is the introduction of the task of compressinga ProbLog theory using a set of positive and negative examples, and the development ofan algorithm for realizing this. Theory compression refers to the process of removing asmany clauses as possible from the theory in such a manner that the compressed theoryexplains the examples as well as possible. The compressed theory should be a lot smaller,and therefore easier to understand and employ. It will also contain the essential componentsof the theory needed to explain the data. The theory compression problem is again motivatedby the biological application. In this application, scientists try to analyze large networks oflinks in order to obtain an understanding of the relationships amongst a typically smallnumber of nodes. The idea now is to remove as many links from these networks as possibleusing a set of positive and negative examples. The examples take the form of relationshipsthat are either interesting or uninteresting to the scientist. The result should ideally be asmall network that contains the essential links and assigns high probabilities to the positiveand low probabilities to the negative examples. This task is analogous to a form of theoryrevision (Wrobel 1996) where the only operation allowed is the deletion of rules or facts. Theanalogy explains why we have formalized the theory compression task within the ProbLogframework. Within this framework, examples are true and false ground facts, and the taskis to find a small subset of a given ProbLog program that maximizes the likelihood of theexamples.

This paper is organized as follows: in Sect. 2, the biological motivation for ProbLog andtheory compression is discussed; in Sect. 3, the semantics of ProbLog are briefly reviewed;

Mach Learn

in Sect. 4, the inference mechanism for computing the success probability of ProbLogqueries as introduced by De Raedt et al. (2007) is reviewed; in Sect. 5, the task of prob-abilistic theory compression is defined; and an algorithm for tackling the compression prob-lem is presented in Sect. 6. Experiments that evaluate the effectiveness of the approach arepresented in Sect. 7, and finally, in Sect. 8, we discuss some related work and conclude.

2 Example: ProbLog for biological graphs

As a motivating application, consider link mining in networks of biological concepts. Mole-cular biological data is available from public sources, such as Ensembl,1 NCBI Entrez,2 andmany others. They contain information about various types of objects, such as genes, pro-teins, tissues, organisms, biological processes, and molecular functions. Information abouttheir known or predicted relationships is also available, e.g., that gene A of organism Bcodes for protein C, which is expressed in tissue D, or that genes E and F are likely to berelated since they co-occur often in scientific articles. Mining such data has been identifiedas an important and challenging task (cf. Perez-Iratxeta et al. 2002).

For instance, a biologist may be interested in the potential relationships between a givenset of proteins. If the original graph contains more than some dozens of nodes, manual andvisual analysis is difficult. Within this setting, our goal is to automatically extract a rele-vant subgraph which contains the most important connections between the given proteins.This result can then be used by the biologist to study the potential relationships much moreefficiently.

A collection of interlinked heterogeneous biological data can be conveniently seen as aweighted graph or network of biological concepts, where the weight of an edge correspondsto the probability that the corresponding nodes are related (Sevon et al. 2006). A ProbLogrepresentation of such a graph could simply consist of probabilistic edge/2 facts thoughfiner grained representations using relations such as codes/2, expresses/2 would alsobe possible. In a probabilistic graph, the strength of a connection between two nodes canbe measured as the probability that a path exists between the two given nodes (Sevon etal. 2006).3 Such queries are easily expressed in ProbLog by defining the (non-probabilistic)predicate path(N1,N2) in the usual way, using probabilistic edge/2 facts. Obviously,logicand ProbLogcan easily be used to express much more complex possible relations.For simplicity of exposition we here only consider a simple representation of graphs, and inthe future will address more complex applications and settings.

3 ProbLog: probabilistic Prolog

A ProbLog program consistsas Prologof a set of definite clauses. However, in ProbLogevery clause ci is labeled with the probability pi that it is true.

Example 1 Within bibliographic data analysis, the similarity structure among items can im-prove information retrieval results. Consider a collection of papers {a,b, c,d} and somepairwise similarities similar(a,c), e.g., based on key word analysis. Two items X and Y

1www.ensembl.org.2www.ncbi.nlm.nih.gov/Entrez/.3Sevon et al. (2006) view this strength or probability as the product of three factors, indicating the reliability,the relevance as well as the rarity (specificity) of the information.

Mach Learn

are related(X,Y) if they are similar (such as a and c) or if X is similar to some item Zwhich is related to Y. Uncertainty in the data and in the inference can elegantly be repre-sented by the attached probabilities:

1.0 : related(X,Y) : similar(X,Y).0.8 : related(X,Y) : similar(X,Z),related(Z,Y).0.9 : similar(a,c). 0.7 : similar(c,b).0.6 : similar(d,c). 0.9 : similar(d,b).

A ProbLog program T = {p1 : c1, . . . , pn : cn} defines a probability distribution over logicprograms L LT = {c1, . . . , cn} in the following way:

P (L|T ) =

ciLpi

ciLT \L

(1 pi). (1)

Unlike in Prolog, where one is typically interested in determining

Recommended