6.899 relational data learning

19
6.899 Relational Data Learning Yuan Qi MIT Media Lab [email protected] May 7, 2002

Upload: nia

Post on 14-Jan-2016

39 views

Category:

Documents


2 download

DESCRIPTION

6.899 Relational Data Learning. Yuan Qi MIT Media Lab [email protected] May 7, 2002. Outline. Structure Learning Using Stochastic Logic Programming (SLP) Text Classification Using Probabilistic Relational Models (PRM). Part 1:Structure Learning Using SLP. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 6.899 Relational Data Learning

6.899 Relational Data Learning

Yuan Qi

MIT Media Lab

[email protected]

May 7, 2002

Page 2: 6.899 Relational Data Learning

Outline

Structure Learning Using Stochastic Logic Programming (SLP)

Text Classification Using Probabilistic

Relational Models (PRM)

Page 3: 6.899 Relational Data Learning

Part 1:Structure Learning Using SLP

SLP defines prior over BN structures

MCMC sampling BN structures

New Sampling Method

Page 4: 6.899 Relational Data Learning

An SLP Defining prior BN structuresbn([],[],[]). bn([RV|RVs],BN,AncBN):- bn(RVs, BN2, AncBN2),connect_no_cycles(RV,BN2,AncBN2,BN,AncBN).% An edge: RV parent of H 1/3:: which_edge([H|T],RV,[H-RV|Rest]):-choose_edges(T,RV,Rest).% An edge: H parent of RV1/3:: which_edge([H|T],RV,[RV-H|Rest]) :-choose_edges(T,RV,Rest).% No edge1/3:: which_edge([_H|T],RV,Rest) :-choose_edges(T,RV,Rest).

Page 5: 6.899 Relational Data Learning

Metropolis-Hasting Sampling

p(T) specifies a tree prior for BN structures. Sampling T* from the transition distribution

q(Ti,T*). Set Ti = T* with the acceptance ratio

else set Ti+1 = Ti.

1,)(),|(),(

)(),|(),(min),(

*

****

iii

ii

TpTXYpTTq

TpTXYpTTqTT

Page 6: 6.899 Relational Data Learning

The Transition Kernel (1) The transition kernel can be implemented by

generating a new derivation(yielding a new model M*) from the derivation which yields the current model Mi. To be specific, we have

Backtrack one step to the most recent choice point in the SLD-tree (i.e., the probability tree)

If at the top of the tree, stop. Otherwise, backtrack one more step to the next choice point with a predefined backtrack probability pb.

Page 7: 6.899 Relational Data Learning

The Transition Kernel (2)

Once stopped backtracking, choose a new leaf M* from the choice point by selecting branches according to their probabilities attached to them (loglinear sampling). However, we may not choose the branch that leads back to Mi.

Page 8: 6.899 Relational Data Learning

Sampling Problems

Inefficiency of the previous Metropolis-Hasting sampling. pb =0.8, Acceptance ratio: 4%.

– lf pb is small, slow movement of the samples, higher acceptance ratio

– lf pb is large, large movement of the samples, lower acceptance ratio

Fixed pb: the balance between local jumps to neighboring models and big jumps to distant ones.

An Improvement: Cyclic transition kernel pb = 1-2-n for n = 1,….28.

Page 9: 6.899 Relational Data Learning

Adaptive Sampling Strategy: Re-Try the Proposals

Suppose a proposal T1 from the proposal distribution q1(T, T1) is tried and rejected. The rejection suggests that this proposal distribution may not be good and a different proposal could be tried. Suppose a new sample T2 is drawn from a new proposal q2(T, T1 , T2).

But how to get a valid Markov sampling chain?

Page 10: 6.899 Relational Data Learning

Adaptive Sampling Strategy: New Acceptance Ratio

If we use the following acceptance ratio:

1,),,()),(1)(,()(),|(

),,()),(1)(,()(),|(min

),,(

2102

10101

00

0122

12121

22

210

TTTqTTTTqTpTXYp

TTTqTTTTqTpTXYp

TTT

then we have a valid MCMC sampler for the target distribution, that is, the posterior of BN structures.

Page 11: 6.899 Relational Data Learning

Part 1: Conclusion

To adaptively sample BN structures, we can start with large backtrack probability pb, and if get rejected samples, we reduce pb and draw a sampled structure using the new backtrack probability. This process can be repeated.

Adaptive proposal distribution allows the SLP sampler to locally tune its parameter to achieve a good balance between local jumps to neighboring models and big jumps to distant ones. Therefore, we expect a much more efficient sampling result.

Page 12: 6.899 Relational Data Learning

Part 2 Text Classification Using Probabilistic Relational Models (PRM)

Why using PRMs? SLP: Discrete R.V.s PRM: Discrete and Continuous R.V.s

Why relational modeling of text? Author relation Citation relation

Page 13: 6.899 Relational Data Learning

Modeling Relational Text Data

Figure 1.PRM modeling of text. By Taskar, Segal, and Koller

Unrolled Bayesian Network

Page 14: 6.899 Relational Data Learning

Transduction: Train and Testing together

The test data are also included in the model

Transduction: EM AlgorithmE step: Belief propagation

M step: Maximum Likelihood Re-estimation

Page 15: 6.899 Relational Data Learning

Several Problems of Modeling in Figure 1

Naïve Bayes (Independence) assumption on generating words

Wrong edge direction between words and topic nodes

Wrong edge direction between a paper and its citations.

Page 16: 6.899 Relational Data Learning

Drawback of EM training and Transductions

High dimensional data, relatively limited training points

Transduction: helps training, but is very expensive for testing, since we need retraining the whole model for a new data point .

Page 17: 6.899 Relational Data Learning

New Modeling and Bayesian Training

The new node, h, models a classifier which takes input from words, aggregated citation and aggregated author.

Page 18: 6.899 Relational Data Learning

Training the new PRM

Unrolling this new PRM, we get a Bayesian network modeling the text data.

Training: Extension of belief propagation, expectation propagation.

We can also easily incorporate the kernel trick like in SVM or Gaussian processes into the classifier h. Note that h models the conditional relation between the text class and words, citations, and authors.

Page 19: 6.899 Relational Data Learning

Part 2: Conclusion

Benefit of the new approach: No overfitting like ML approaches Choice of using transduction or not. Much more powerful classifier,

Bayesian Point Machine with Kernel Expansion, compared to Naïve Bayes method

Better Relation modeling