1 learning the structure of markov logic networks stanley kok & pedro domingos dept. of computer...

Post on 22-Dec-2015

221 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Learning the Structure of Markov Logic

Networks

Stanley Kok & Pedro Domingos

Dept. of Computer Science and Eng.

University of Washington

2

Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion

3

Motivation Statistical Relational Learning (SRL)

combines the benefits of: Statistical Learning: uses probability to handle

uncertainty in a robust and principled way Relational Learning: models domains with

multiple relations

4

Motivation Many SRL approaches combine a logical

language and Bayesian networks e.g. Probabilistic Relational Models

[Friedman et al., 1999]

The need to avoid cycles in Bayesian networks causes many difficulties [Taskar et al., 2002]

Started using Markov networks instead

5

Motivation Relational Markov Networks [Taskar et al., 2002]

conjunctive database queries + Markov networks Require space exponential in the size of the cliques

Markov Logic Networks [Richardson & Domingos, 2004]

First-order logic + Markov networks Compactly represent large cliques Did not learn structure (used external ILP system)

6

Motivation Relational Markov Networks [Taskar et al., 2002]

conjunctive database queries + Markov networks Require space exponential in the size of the cliques

Markov Logic Networks [Richardson & Domingos, 2004]

First-order logic + Markov networks Compactly represent large cliques Did not learn structure (used external ILP system)

This paper develops a fast algorithm that learns MLN structure Most powerful SRL learner to date

7

Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion

8

Markov Logic Networks

First-order KB: set of hard constraints Violate one formula, a world has zero probability

MLNs soften constraints OK to violate formulas The fewer formulas a world violates,

the more probable it is Gives each formula a weight,

reflects how strong a constraint it is

9

MLN Definition A Markov Logic Network (MLN) is a set of

pairs (F, w) where F is a formula in first-order logic w is a real number

Together with a finite set of constants,it defines a Markov network with One node for each grounding of each predicate

in the MLN One feature for each grounding of each formula F

in the MLN, with the corresponding weight w

10

Ground Markov Network

Student(STAN)

Professor(PEDRO)

AdvisedBy(STAN,PEDRO)

Professor(STAN)

Student(PEDRO)

AdvisedBy(PEDRO,STAN)

AdvisedBy(STAN,STAN)

AdvisedBy(PEDRO,PEDRO)

AdvisedBy(S,P) ) Student(S) ^ Professor(P)2.7

constants: STAN, PEDRO

11

MLN Model

12

MLN Model

Vector of value assignments to ground predicates

13

MLN Model

Vector of value assignments to ground predicates

Partition function. Sums over all possiblevalue assignments to ground predicates

14

MLN Model

Vector of value assignments to ground predicates

Partition function. Sums over all possiblevalue assignments to ground predicates

Weight of ith formula

15

MLN Model

Vector of value assignments to ground predicates

Partition function. Sums over all possiblevalue assignments to ground predicates

Weight of ith formula

# of true groundings of ith formula

16

MLN Weight Learning

Likelihood is concave function of weights Quasi-Newton methods to find optimal weights

e.g. L-BFGS [Liu & Nocedal, 1989]

17

MLN Weight Learning

Likelihood is concave function of weights Quasi-Newton methods to find optimal weights

e.g. L-BFGS [Liu & Nocedal, 1989]

SLOW#P-complete

18

MLN Weight Learning

Likelihood is concave function of weights Quasi-Newton methods to find optimal weights

e.g. L-BFGS [Liu & Nocedal, 1989]

SLOW#P-completeSLOW

#P-complete

19

MLN Weight Learning R&D used pseudo-likelihood [Besag, 1975]

20

MLN Weight Learning R&D used pseudo-likelihood [Besag, 1975]

21

MLN Structure Learning

R&D “learned” MLN structure in two disjoint steps: Learn first-order clauses with an off-the-shelf

ILP system (CLAUDIEN [De Raedt & Dehaspe, 1997]) Learn clause weights by optimizing

pseudo-likelihood Unlikely to give best results because CLAUDIEN

find clauses that hold with some accuracy/frequency in the data

don’t find clauses that maximize data’s (pseudo-)likelihood

22

Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion

23

This paper develops an algorithm that: Learns first-order clauses by directly optimizing

pseudo-likelihood Is fast enough Performs better than R&D, pure ILP,

purely KB and purely probabilistic approaches

MLN Structure Learning

24

Structure Learning Algorithm

High-level algorithmREPEAT

MLN Ã MLN [ FindBestClauses(MLN)UNTIL FindBestClauses(MLN) returns NULL

FindBestClauses(MLN)Create candidate clausesFOR EACH candidate clause c

Compute increase in evaluation measureof adding c to MLN

RETURN k clauses with greatest increase

25

Structure Learning Evaluation measure Clause construction operators Search strategies Speedup techniques

26

Evaluation Measure

R&D used pseudo-log-likelihood

This gives undue weight to predicates with large # of groundings

27

Weighted pseudo-log-likelihood (WPLL)

Gaussian weight prior Structure prior

Evaluation Measure

28

Weighted pseudo-log-likelihood (WPLL)

Gaussian weight prior Structure prior

Evaluation Measure

weight given to predicate r

29

Weighted pseudo-log-likelihood (WPLL)

Gaussian weight prior Structure prior

Evaluation Measure

sums over groundings of predicate r

weight given to predicate r

30

Weighted pseudo-log-likelihood (WPLL)

Gaussian weight prior Structure prior

Evaluation Measure

sums over groundings of predicate r

weight given to predicate r CLL: conditional log-likelihood

31

Clause Construction Operators Add a literal (negative/positive) Remove a literal Flip signs of literals Limit # of distinct variables to restrict

search space

32

Beam Search

Same as that used in ILP & rule induction Repeatedly find the single best clause

33

Shortest-First Search (SFS)

1. Start from empty or hand-coded MLN2. FOR L Ã 1 TO MAX_LENGTH3. Apply each literal addition & deletion to

each clause to create clauses of length L4. Repeatedly add K best clauses of length L

to the MLN until no clause of length L improves WPLL

Similar to Della Pietra et al. (1997), McCallum (2003)

34

Speedup Techniques FindBestClauses(MLN)

Creates candidate clausesFOR EACH candidate clause c

Compute increase in WPLL (using L-BFGS) of adding c to MLN

RETURN k clauses with greatest increase

35

Speedup Techniques FindBestClauses(MLN)

Creates candidate clausesFOR EACH candidate clause c

Compute increase in WPLL (using L-BFGS)of adding c to MLN

RETURN k clauses with greatest increase

SLOWMany candidates

36

Speedup Techniques FindBestClauses(MLN)

Creates candidate clausesFOR EACH candidate clause c

Compute increase in WPLL (using L-BFGS)of adding c to MLN

RETURN k clauses with greatest increase

SLOWMany candidates

SLOWMany CLLs

SLOWEach CLL involves a#P-complete problem

37

Speedup Techniques FindBestClauses(MLN)

Creates candidate clausesFOR EACH candidate clause c

Compute increase in WPLL (using L-BFGS)of adding c to MLN

RETURN k clauses with greatest increase

SLOWMany candidates

NOT THAT FAST

SLOWMany CLLs

SLOWEach CLL involves a#P-complete problem

38

Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding

39

Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding

40

Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding

41

Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding

42

Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding

43

Speedup Techniques Clause Sampling Predicate Sampling Avoid Redundancy Loose Convergence Thresholds Ignore Unrelated Clauses Weight Thresholding

44

Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion

45

Experiments UW-CSE domain

22 predicates, e.g., AdvisedBy(X,Y), Student(X), etc. 10 types, e.g., Person, Course, Quarter, etc. # ground predicates ¼ 4 million # true ground predicates ¼ 3000 Handcrafted KB with 94 formulas

Each student has at most one advisor If a student is an author of a paper, so is her advisor

Cora domain Computer science research papers Collective deduplication of author, venue, title

46

Systems

MLN(SLB): structure learning with beam searchMLN(SLS): structure learning with SFS

47

Systems

MLN(SLB) MLN(SLS)

KB: hand-coded KBCL: CLAUDIENFO: FOILAL: Aleph

48

Systems

MLN(SLB) MLN(SLS)

KBCLFOAL

MLN(KB)MLN(CL)MLN(FO)MLN(AL)

49

Systems

MLN(SLB) MLN(SLS)

NB: Naïve Bayes

BN: Bayesian

networks

KBCLFOAL

MLN(KB)MLN(CL)MLN(FO)MLN(AL)

50

Methodology UW-CSE domain

DB divided into 5 areas: AI, Graphics, Languages, Systems, Theory

Leave-one-out testing by area Measured

average CLL of the ground predicates average area under the precision-recall curve

of the ground predicates (AUC)

51

0.5330.472

0.306

0.140 0.148

0.429

0.1700.131 0.117

0.266

0.0

0.3

0.6

UW-CSE-0.061 -0.088

-0.151-0.208 -0.223

-0.142

-0.574-0.661

-0.579

-0.812

-1.0

-0.5

0.0M

LN(S

LS)

MLN

(SLB

)

MLN

(CL)

MLN

(FO

)

MLN

(AL)

MLN

(KB)

CL

FO AL

KB

CLL

AU

C

MLN

(SLS

)M

LN(S

LB)

MLN

(CL)

MLN

(FO)

MLN

(AL)

MLN

(KB)

CL

FO AL

KB

UW-CSE

52

0.5330.472

0.306

0.140 0.148

0.429

0.1700.131 0.117

0.266

0.0

0.3

0.6

UW-CSE-0.061 -0.088

-0.151-0.208 -0.223

-0.142

-0.574-0.661

-0.579

-0.812

-1.0

-0.5

0.0M

LN(S

LS)

MLN

(SLB

)

MLN

(CL)

MLN

(FO

)

MLN

(AL)

MLN

(KB)

CL

FO AL

KB

CLL

AU

C

MLN

(SLS

)M

LN(S

LB)

MLN

(CL)

MLN

(FO)

MLN

(AL)

MLN

(KB)

CL

FO AL

KB

UW-CSE

53

0.5330.472

0.306

0.140 0.148

0.429

0.1700.131 0.117

0.266

0.0

0.3

0.6

UW-CSE-0.061 -0.088

-0.151-0.208 -0.223

-0.142

-0.574-0.661

-0.579

-0.812

-1.0

-0.5

0.0M

LN(S

LS)

MLN

(SLB

)

MLN

(CL)

MLN

(FO

)

MLN

(AL)

MLN

(KB)

CL

FO AL

KB

CLL

AU

C

MLN

(SLS

)M

LN(S

LB)

MLN

(CL)

MLN

(FO)

MLN

(AL)

MLN

(KB)

CL

FO AL

KB

UW-CSE

54

0.5330.472

0.306

0.140 0.148

0.429

0.1700.131 0.117

0.266

0.0

0.3

0.6

UW-CSE-0.061 -0.088

-0.151-0.208 -0.223

-0.142

-0.574-0.661

-0.579

-0.812

-1.0

-0.5

0.0M

LN(S

LS)

MLN

(SLB

)

MLN

(CL)

MLN

(FO

)

MLN

(AL)

MLN

(KB)

CL

FO AL

KB

CLL

AU

C

MLN

(SLS

)M

LN(S

LB)

MLN

(CL)

MLN

(FO)

MLN

(AL)

MLN

(KB)

CL

FO AL

KB

UW-CSE

55

0.533

0.472

0.390 0.397

0.0

0.3

0.6

UW-CSE-0.061 -0.088

-0.370

-0.166

-1.0

-0.5

0.0

CLL

AU

C

MLN

(SLS

)

MLN

(SLB

)

NB

BN

MLN

(SLS

)

MLN

(SLB

)

NB

BN

UW-CSE

56

Timing MLN(SLS) on UW-CSE

Cluster of 15 dual-CPUs 2.8 GHz Pentium 4 machines

Without speedups: did not finish in 24 hrs With speedups: 5.3 hrs

57

4.0

21.6

8.46.5

4.1

24.8

0.0

10.0

20.0

30.0

Lesion Study Disable one speedup technique at a time; SFS

UW-CSE (one-fold)

Hour

all speedups

no clausesampling

no predicatesampling

don’t avoidredundancy

no looseconverg.

threshold

no weight thresholding

58

Overview Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion

59

Future Work Speed up counting of # true

groundings of clause Probabilistically bound the loss in

accuracy due to subsampling Probabilistic predicate discovery

60

Conclusion Markov logic networks: a powerful combination

of first-order logic and probability Richardson & Domingos (2004) did not learn

MLN structure We develop an algorithm that automatically learns

both first-order clauses and their weights We develop speedup techniques to make our

algorithm fast enough to be practical We show experimentally that our algorithm

outperforms Richardson & Domingos Pure ILP Purely KB approaches Purely probabilistic approaches

(For software, email: koks@cs.washington.edu)

top related