bayesian networks 4 th, december 2009 presented by kwak, nam-ju the slides are based on, 2nd ed.,...

Post on 26-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Bayesian Networks4th, December 2009

Presented by Kwak, Nam-ju

The slides are based on<Data Mining : Practical Learning Tools and Techniques>, 2nd ed.,

written by Ian H. Witten & Eibe Frank.Images and Materials are from the official lecture slides of the book.

Table of Contents• Probability Estimate vs. Prediction• What is Bayesian Network?• A Simple Example• A Complex One• Why does it work?• Learning Bayesian Networks• Overfitting• Searching for a Good Network Structure• K2 Algorithm• Other Algorithms• Conditional Likelihood• Data Structures for Fast Learning

Probability Estimate vs. Prediction

• Naïve Bayes classifier, logistic regression models: probability estimates

• For each class, they estimate the probability that a given instance belongs to that class.

Probability Estimate vs. Prediction

• Why probability estimates are useful?– They allow predictions to be ranked.– Treat classification learning as the task of learn-

ing class probability estimates from the data.

• What is being estimated is– The conditional probability distribution of the val-

ues of the class attribute given the values of the other attributes.

Probability Estimate vs. Prediction

• In this way, Naïve Bayes classifiers, logistic regression models and decision trees are ways of representing a conditional probability distribution.

What is Bayesian Network?• A theoretically well-founded way of repre-

senting probability distributions concisely and comprehensively in a graphical manner.

• They are drawn as a network of nodes, one for each attribute, connected by directed edges in such a way that there are no cy-cles.– A directed acyclic graph

A Simple Example

Pr[outlook=rainy | play=no]

Summed up into 1

A Complex One• When outlook=rainy, temper-

ature=cool, humidity=high, and windy=true…

• Let’s call E the situation given above.

A Complex One• E: rainy, cool, high, and true• Pr[play=no, E] = 0.0025• Pr[play=yes, E] = 0.0077

Multiply all those!!

An additional example of the calculation

A Complex One• E: rainy, cool, high, and true• Pr[play=no, E] = 0.0025• Pr[play=yes, E] = 0.0077

A Complex One

Summed up into 1

Why does it work?• Terminologies

– T: all the nodes, P: parents, D: descendant– Non-descendant: T-D Non-descendants

Why does it work?• Assumption (conditional independence)

– Pr[node | parents plus any other set of non-descendants]

= Pr[node | parents]

• Chain rule

• The nodes are ordered to give all ancestors of a node ai indices smaller than i. It’s possible since the network is acyclic.

Why does it work?

Ok, that’s what I’m talking about!!!

Learning Bayesian Networks• Basic components of algorithms for learning

Bayesian networks:– Methods for evaluating the goodness of a given

network– Methods for searching through space of possible

networks

Learning Bayesian Networks• Methods for evaluating the goodness of a given

network– Calculate the probability that the network accords

to each instance and multiply these probabilities all together.

– Alternatively, use the sum of logarithms.• Methods for searching through space of possible

networks– Search through the space of possible sets of

edges.

Overfitting• While maximizing the log-likelihood based on the

training data, the resulting network may overfit. What are the solutions?– Cross-validation: training instances and validation

instances (similar to ‘early stopping’ in learning of neural networks)

– Penalty for the complexity of the network– Assign a prior distribution over network structures

and find the most likely network using the proba-bility by the data.

Overfitting• Penalty for the complexity of the network

– Based on the total # of independent estimates in all the probability tables, which is called the # of parameters

Overfitting• Penalty for the complexity of the network

– K: the # of parameters– LL: log-likelihood– N: the # of instances in the training data– AIC score = -LL+K– MDL score = -LL+(K/2)logN– Those two scores are supposed to be minimized.

Akaike Information Criterion

Minimum Description Length

Overfitting• Assign a prior distribution over network struc-

tures and find the most likely network by combining its prior probability with the prob-ability accorded to the network by the data.

Searching fora Good Network Structure

• The probability of a single instance is the product of all the individual probabilities from the various conditional probability tables.

• The product can be rewritten to group to-gether all factors relating to the same table.

• Log-likelihood can also be grouped in such a way.

Searching fora Good Network Structure

• Therefore log-likelihood can be optimized separately for each node.

• This can be done by adding, or removing edges from other nodes to the node being optimized. (without making cycles)

Which one is the best?

Searching fora Good Network Structure

• AIC and MDL can be dealt with in a similar way since they can be split into several components, one for each node.

K2 Algorithm• Starts with given ordering of nodes (at-

tributes)• Processes each node in turn• Greedily tries adding edges from previous

nodes to current node• Moves to next node when current node can’t

be optimized further

Result depends on the initial order

K2 Algorithm• Some tricks

– Use Naïve Bayes classifier as a starting point.– Ensure that every node is in the Marcov blanket

of the class node. (Marcov blanket: parents, chil-dren, and children’s parents)

Naïve Bayesian Classifier Marcov blanket

Pictures from Wikipedia and http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

Other Algorithms• Extended K2 – sophisticated but slow

– Do not order the nodes.– Greedily add or delete edges between arbitrary

pairs of nodes.

• Tree Augmented Naïve Bayes (TAN)

Other Algorithms• Tree Augmented Naïve Bayes (TAN)

– Augment a tree to a Naïve Bayes classifier.– When the class node and its outgoing edges are

eliminated, the remaining edges should form a tree. Naïve Bayes classifier

Tree

Pictures from http://www.usenix.org/events/osdi04/tech/full_papers/cohen/cohen_html/index.html

Other Algorithms• Tree Augmented Naïve Bayes (TAN)

– MST of the network will be a clue for maximizing likelihood.

Conditional Likelihood• What we actually need to know is the condi-

tional likelihood, which is the conditional probability of the class given the other at-tributes.

• However, what we’ve tried to maximize is, in fact, just the likelihood.

O

X

Conditional Likelihood• Computing the conditional likelihood for a

given network and dataset is straightforward.• This is what logistic regression does.

Data Structures for Fast Learning

• Learning Bayesian networks involves a lot of counting.

• For each network structure to be searched, the data must be scanned to get the condi-tional probability tables. (Since the ‘given term’ of the table of a certain node changes frequently, we should rescan the data in or-der to get the brand new conditional probabil-ities many times.)

Data Structures for Fast Learning

• Use a general hash tables.– Assuming that there are 5 attributes, 2 with 3 val-

ues and 3 with 2 values.– There’re 4*4*3*3*3=432 possible categories.– This calculation includes cases of missing values.

(i.e. null)– This can cause memory problems.

Data Structures for Fast Learning

• AD (all-dimensions) tree

– Using a general hash table, there will be 3*3*3=27 categories, even though only 8 cate-gories are actually used.

Data Structures for Fast Learning

• AD (all-dimensions) tree

Only 8 categories are required,compared to 27.

Data Structures for Fast Learning

• AD (all-dimensions) tree - construction– Assume each attribute in the data has been assigned an

index.– Then, expand node for attribute i with the values of all at-

tributes j > i– Two important restrictions:

• Most populous expansion for each attribute is omitted (breaking ties arbitrarily)

• Expansions with counts that are zero are also omitted– The root node is given index zero

Data Structures for Fast Learning

• AD (all-dimensions) tree

Data Structures for Fast Learning

• AD (all-dimensions) treeQ. # of (humidity=normal, windy=true, play=no)?

Data Structures for Fast Learning

• AD (all-dimensions) treeQ. # of (humidity=normal, windy=false, play=no)?

?

Data Structures for Fast Learning

• AD (all-dimensions) treeQ. # of (humidity=normal, windy=false, play=no)?

#(humidity=normal, play=no) – #(humidity=normal, windy=true, play=no) = 1-1=0

Data Structures for Fast Learning

• AD tree only pay off if the data contains many thousands of instances.

Questions and Answers• Any question?

Pictures from http://news.ninemsn.com.au/article.aspx?id=805150

top related