discriminative training of markov logic networks

Discriminative Training of Markov Logic Networks

Parag Singla & Pedro Domingos

Outline Motivation Review of MLNs Discriminative Training Experiments

Link Prediction Object Identification

Conclusion and Future Work

Outline MotivationMotivation Review of MLNs Discriminative Training Experiments



Markov Logic Networks(MLNs) AI systems must be able to learn,

reason logically and handle uncertainty Markov Logic Networks [Richardson and

Domingos, 2004]- an effective way to combine first order logic and probability

Markov Networks are used as underlying representation

Features specfied using arbitrary formulas in finite first order logic

Training of MLNs – Generative Approach Optimize the joint distribution of all the variables Parameters learnt independent of specific

inference task Maximum-likelihood (ML) training – computation

of the gradient involves inference – too slow! Use Psuedo-likelihood (PL) as an alternative –

easy to compute PL is suboptimal. Ignores any non-local

interactions between variables ML, PL – generative training approaches

Training of MLNs -Discriminative Approach No need to optimize the joint

distribution of all the variables Optimize the conditional likelihood (CL)

of non-evidence variables given evidence variables

Parameters learnt for a specific inference task

Tends to do better than generative training in general

Why is Discriminative Better?

Generative Parameters learnt are not optimized for the specific inference task.

Need to model all the dependencies in the data – learning might become complicated.

Example of generative models: MRFs

Discriminative Parameters learnt are optimized for the specific inference task.

Need not model dependencies between evidence variables – makes learning task easier.

Example of discriminative

models: CRFs [Lafferty, McCallum, Pereira 2001]

Outline Motivation Review of MLNsReview of MLNs Discriminative Training Experiments



Markov Logic Networks A Markov Logic Network (MLN) is a set of

pairs (F, w) where F is a formula in first-order logic w is a real number

Together with a finite set of constants,it defines a Markov network with One node for each grounding of each

predicate in the MLN One feature for each grounding of each

formula F in the MLN, with the corresponding weight w

Likelihood

Fiii xnw

ZXP )(exp

1)(

Iterate over all MLN clauses

# true groundings of ith clause

Iterate over all ground clauses

1 if jth ground clause is true, 0 otherwise

Gjjj xgw

Z)(exp

1

Gradient of Log-Likelihood

)()()(log xnExnxPw iwiwi

1st term: # true groundings of formula in DB2nd term: inference required (slow!)

Feature count according to data

Feature count according to model

Pseudo-Likelihood [Besag, 1975]

Likelihood of each ground atom given its Markov blanket in the data

Does not require inference at each step

Optimized using L-BFGS [Liu & Nocedal, 1989]

( ) ( | ( ))x

PL X P x MB x

Most terms not affected by changes in weights After initial setup, each iteration takes

O(# ground predicates x # first-order clauses)

( ) ( 0) ( 0) ( 1) ( 1)i i i ix

nsat x p x nsat x p x nsat x

Gradient ofPseudo-Log-Likelihood

where nsati(x=v) is the number of satisfied groundingsof clause i in the training data when x takes value v

Outline Motivation Review of MLNs Discriminative Training Discriminative Training Experiments



Conditional Likelihood (CL)

YFiii

x

yxnwZ

XYP ),(exp1

)|(

Normalize over all possible configurations of non-evidence variables

Iterate over all MLN clauses with at least one grounding containing query variables

Non-evidence variables

Evidence variables

Derivative of log CL

),(),()|(log yxnEyxnxyPw iwiwi

1st term: # true groundings (involving query variables) of formula in DB2nd term: inference required, as before (slow!)

Derivative of log CL

Approximate the expected count by MAP count

),(),()|(log *yxnyxnxyPw iiwi

MAP state

),(),()|(log yxnEyxnxyPw iwiwi

Approximating the Expected Count Use Voted Perceptron Algorithm

[Collins, 2002] Approximate the expected count by

count for the most likely state (MAP) state

Used successfully for linear chain Markov networks

MAP state found using Viterbi

Voted Perceptron Algorithm Initialize wi=0 For t=1 to T

Find the MAP configuration according to current set of weights.

wi,t= * (training count – MAP count)

wi= wi,t/T (Avoids over-fitting)

T

t 1

Generalizing Voted Perceptron Finding the MAP configuration NP

hard for the general case. Can be reduced to a weighted

satisfiability (MaxSAT) problem. Given a SAT formula in clausal form

e.g. (x1 V x3 V x5) … (x5 V x7 Vx50) with clause i having weight of wi

Find the assignment maximizing the sum of weights of satisfied clauses.

MaxWalkSAT [Kautz, Selman & Jiang 97] Assumes clauses with positive weights Mixes greedy search with random walks

Start with some configuration of variables. Randomly pick an unsatisfied clause. With probability p, flip the literal in the clause

which gives maximum gain. With probability 1-p flip a random literal in the clause.

Repeat for a pre-decided number of flips, storing the best seen configuration.

Handling the Negative Weights MLN allows formulas with negative

weights. A formula with weight w can be

replaced by its negation with weight –w in the ground Markov network.

(x1 x3 x5) [w] => (x1 x3 x5) [-w]

=> (x1 x3 x5) [-w] (x1 x3 x5) [-w] => x1 , x3 , x5 [ -w/3]

Weight Initialization and Learning Rate Weights initialized using log odds

of each clause being true in the data.

Determining the learning rate – use a validation set. Learning rate 1/#(ground predicates)


Link Prediction Link Prediction Object Identification


Link Prediction UW-CSE database

Used by Richardson & Domingos [2004] Database of people/courses/publications at UW-CSE 22 Predicates e.g. Student(P), Professor(P),

AdvisedBy(P1,P2) 1158 constants divided into 10 types 4,055,575 ground atoms 3212 true ground atoms 94 hand coded rules stating various regularities

Student(P) => !Professor(P) Predict AdvisedBy in the absence of information

about the predicates Professor and Student

Systems Compared MLN(VP) MLN(ML) MLN(PL) KB CL NB BN

Results on Link Prediction

0.033 0.063 0.034

0.530.693

1.237

0.0460

0.2

0.4

0.6

0.8

1

System

-CLL

Results on Link Prediction

0.295

0.077

0.2320.114

0.006 0.065 0.020

0.2

0.4

0.6

0.8

1

System

AU

C


Link Prediction Object IdentificationObject Identification


Object Identification Given a database of various records

referring to objects in the real world Each record represented by a set of

attribute values Want to find out which of the records

refer to the same object Example: A paper may have more than

one reference in a bibliography database

Why is it Important? Data Cleaning and Integration – first

step in the KDD process Merging of data from multiple sources

results in duplicates Entity Resolution: Extremely important

for doing any sort of data-mining State of the art – far from what is

required. Citeseer has 30 different entries for the

AI textbook by Russell and Norvig

Standard Approach [Fellegi & Sunter, 1969] Look at each pair of records independently Calculate the similarity score for each

attribute value pair based on some metric Find the overall similarity score Merge the records whose similarity is above

a threshold Take a transitive closure

An Example

Subset of a Bibliography Relation

Record Title Author Venue

B1 Object Identification using MLNs

Linda Stewart KDD 2004

B2 Object Identification using MLNs

Linda Stewart SIGKDD 10

B3 Learning Boolean Formulas Bill Johnson KDD 2004

B4 Learning of Boolean Formulas William Johnson

SIGKDD 10

Graphical Representation in Standard Model

b1=b2?

Sim(Linda Stewart, Linda Stewart)

b3=b4?

Author

Title

Venue

Sim(KDD 2004, SIGKDD 10)

Sim(Object Identification using MLNs, Object Identification using MLNs)

Sim(Bill Johnson, William Johnson)

Title

Author

Sim(Learning Boolean Formulas, Leraning of Boolean Formulas)


Venue

Record-pair node

Evidence node

What’s Missing?

b1=b2?


b3=b4?

Author

Title

Venue




Title

Author



Venue

If from b1=b2, you infer that “KDD 2004” is same as “SIGKDD 10”, how canyou use that to help figure out if b3=b4?

Collective Model – Basic Idea Perform simultaneous inference for

all the candidate pairs Facilitate flow of information

through shared attribute values

Representation in Standard Model

b1=b2?


b3=b4?

Author

Title

Venue




Title

Author



Venue

No sharing of nodes

Merging the Evidence Nodes

Author

Still does not solve the problem. Why?

b1=b2?


b3=b4?

Author

Title

Venue



Title

Author



Introducing Information Nodes

b1=b2?


b3=b4?

Author

Title

Venue

b1.T=b2.T?

b1.V=b2.V?b3.V=b4.V?

b3.A=b4.A?

b3.T=b4.T?

b1.A=b2.A?



Title

Author


Information node

Full representation in Collective Model


Flow of Information

b1=b2?


b3=b4?

Author

Title

Venue

b1.T=b2.T?

b1.V=b2.V?b3.V=b4.V?

b3.A=b4.A?

b3.T=b4.T?

b1.A=b2.A?



Title

Author



MLN Predicates for De-Duplicating Citation Databases If two bib entries are the same -

SameBib(b1,b2) If two field values are the same -

SameAuthor(a1,a2), SameTitle(t1,t2), SameVenue(v1,v2)

If cosine based TFIDF score of two field values lies in a particular range (0, 0 - .2, .2 - .4, etc.) – 6 predicates for each field. E.g. AuthorTFIDF.8(a1,a2) is true if TFIDF

similarity score of a1,a2 is in the range (.2, .4]

MLN Rules for De-Duplicating Citation Databases Singleton Predicates

! SameBib(b1,b2) Two fields are same => corresponding bib entries are same.

Author(b1,a1) Author(b2,a2) SameAuthor(a1,a2)=> SameBib(b1,b2) Two papers are same => corresponding fields are same

Author(b1,a1) Author(b2,a2) SameBib(b1,b2)=> SameAuthor(a1,a2) High similarity score => two fields are same

AuthorTFIDF.8(a1,a2) =>SameAuthor(a1,a2) Transitive closure (currently not incorporated)

SameBib(b1,b2) SameBib(b2,b3) => SameBib(b1,b3) 25 first order predicates, 46 first order clauses.

Cora Database Cleaned up version of McCallum’s Cora

database. 1295 citations to 132 difference Computer

Science research papers, each citation described by author, venue, title fields.

401,552 ground atoms. 82,026 tuples (true ground atoms) Predict SameBib, SameAuthor, SameVenue

Systems Compared MLN(VP) MLN(ML) MLN(PL) KB CL NB BN

Results on CoraPredicting the Citation Matches

0.069

13.261

0.699

8.629

0.461

0.082 0.067

0

0.2

0.4

0.6

0.8

1

System

-CLL

Results on CoraPredicting the Citation Matches

0.973

0.111

0.722

0.149 0.187

0.945 0.951

0

0.2

0.4

0.6

0.8

1

System

AU

C

Results on CoraPredicting the Author Matches

0.069

12.973 3.062 8.096 2.375

0.203 0.203

0

0.2

0.4

0.6

0.8

1

System

-CLL

Results on CoraPredicting the Author Matches

0.969

0.162

0.3230.18

0.09

0.734 0.734

0

0.2

0.4

0.6

0.8

1

System

AU

C

Results on CoraPredicting the Venue Matches

0.232

13.38

0.708

8.475 1.261

0.233 0.233

0

0.2

0.4

0.6

0.8

1

System

-CLL

Results on CoraPredicting the Venue Matches

0.771

0.061

0.342

0.096 0.047

0.339 0.339

0

0.2

0.4

0.6

0.8

1

System

AU

C



Conclusion and Future WorkConclusion and Future Work

Conclusions Markov Logic Networks – a powerful

way of combining logic and probability. MLNs can be discriminatively trained

using a voted perceptron algorithm Discriminatively trained MLNs perform

better than purely logical approaches, purely probabilistic approaches as well as generatively trained MLNs.

Future Work Discriminative learning of MLN

structure Max-margin type training of MLNs Extensions of MaxWalkSAT Further application to the link

prediction, object identification and possibly other application areas.

discriminative training of markov logic networks

Documents

training data

training computation

specific inference tasktends

generative parameters

evidence variablesparameters

nonevidence variables

psuedolikelihood pl

example of generative