learning with bayesian networks

Learning with Bayesian Networks

Learning with Bayesian NetworksAuthor: David Heckerman

Presented By:Yan Zhang - 2006Jeremy Gould 2013

11OutlineBayesian ApproachBayesian vs. classical probability methodsExamplesBayesian NetworkStructureInferenceLearning ProbabilitiesLearning the Network StructureTwo coin toss an exampleConclusionsExam Questions22A BN is a graphical model for probabilistic relationship among a set of variables. It has become a popular representation for encoding uncertain knowledge in many applications, including medical diagnosis system, gene regulatory networks, etc. The paper I goanna present today is a very good entry level tutorial, which introduces many topics of BNs. In my talk today, I will mainly focus on several fundamental topics, which are outlined here. Firstly, I will give you a review of bayesian approach, compare the difference between bayesian method and classical probability method, and give you an example as an illustration of bayesian method. After that, I will spend most of time on the several topics on BNs, including its structure, the inference in a BN, How to learn parameters and structures in BNs, etc. In the end I will provide a summary and discuss our exam questions. Bayesian vs. the Classical ApproachThe Bayesian probability of an event x, represents the persons degree of belief or confidence in that events occurrence based on prior and observed facts.

Classical probability refers to the true or actual probability of the event and is not concerned with observed behavior.

33If we make a comparison between bayesian method and classical method, we will notice that We will also notice that Let me give you an example. I have a coin here. If I toss it, what is the probability of the outcome being a head? Bayesian statisticians and classical statisticians have different logic to think this question. For classical stats, they cannot answer this question before performing precise measurement of this coins physical properties. One thing that is certain for they is, this probability is fixed in the range 0 and 1. No matter what observations we will have, this probability will not change. For Bayesian stats, however, they will answer this question just according to their personal belief. Example Is this Man a Martian Spy?

4Notes are for wiiiimps!4ExampleWe start with two concepts:Hypothesis (H) He either is or is not a Martian spy.Data (D) Some set of information about the subject. Perhaps financial data, phone records, maybe we bugged his office55ExampleFrequentist SaysBayesian SaysGiven a hypothesis (He IS a Martian) there is a probability P of seeing this data:

P( D | H )

(Considers absolute ground truth, the uncertainty/noise is in the data.)Given this data there is a probability P of this hypothesis being true:

P( H | D )

(This probability indicates our level of belief in the hypothesis.)

6This is why a frequentist cant really say: There is a 70% chance of rain tomorrow. From his perspective it either will or will not rain. The Bayesian on the other hand is free to say I am 70% confident that it will rain tomorrow..6Bayesian vs. the Classical ApproachBayesian approach restricts its prediction to the next (N+1) occurrence of an event given the observed previous (N) events.

Classical approach is to predict likelihood of any given event regardless of the number of occurrences.7NOTE: The Bayesian approach can be updated as new data is observed.7When a set of data is collected, say, I have tossed this coin 100 times and got 65 heads, 35 tails. Classical stats will predict the likelihood of any given data. No matter how many heads and tails I have seen in the previous experiment, the probability of the next toss being a head is always theta. On the other hand, bayesian stats will keep updating their belief using the observed data. So, the probability of the next toss being a head could change.

Bayes Theorem8where

For the continuous case imagine an infinite number of infinitesimally small partitions.

8I think all of you have already seen this bayes theorem before. It states that the probability of hypothesis A given the observed evidence B can be calculated from this equation. If A is a contiguous variable, p(B) can be calculated by integrating over A; If A is a discrete variable, p(B) can be calculated by summing over all possible Ais. Instead of using A and B, we sometimes use other notations, which are more meaningful under different context.

DEPRECIATED, see previous slide versions:For example, in this equation, theta is a parameter or a vector of parameters of some pdf, from which D is generated. In this equation, Sh represents some network structure of a bayesian network. We will come back to these equations at a later stage when I give you examples. Specifically, the evidence B represents data items we collected, A is some hypothesis we want to update given the observed data, p(B|A) is the likelihood function of B, p(A) is the prior probability of A, p(B) is the probability of data. Example Coin TossI want to toss a coin n = 100 times. Lets denote the random variable X as the outcome of one flip:p(X=head) = p(X=tail) =1-

Before doing this experiment we have some belief in our mind: the prior probability . Lets assume that this event will have a Beta distribution (a common assumption):

Sample Beta Distributions:

9In the following example, I will show you how the bayesian method works in this coin toss problem.

A = B = 5 is saying I believe that if I flip a coin 10 times I will see 5 heads and five tails.Example Coin TossIf we assume a 50-50 coin we can use = = 5 which gives:(Hopefully, what you were expecting!)10In the following example, I will show you how the bayesian method works in this coin toss problem.

A = B = 5 is saying I believe that if I flip a coin 10 times I will see 5 heads and five tails.

Gamma(n) = (n-1)!Example Coin TossNow I can run my experiment. As I go I can update my beliefs based on the observed heads (h) and tails (t) by applying Bayes Law to the Beta Distribution:11P(Theta|D,Xi) = Given Data & state of information prob of theta

h & t are the # of observed heads and tails

Notice Im basically just changing alpha and beta11Example Coin Toss12Since were assuming a Beta distribution this becomes:our posterior probability. Supposing that we observed h = 65, t = 35, we would get:P(Theta|D,Xi) = Given Data & state of information prob of theta

h & t are the # of observed heads and tails

Notice Im basically just changing alpha and beta12Example Coin Toss 13

13Dashed is prior belief, solid is belief modified by empirical dataIntegration14To find the probability that Xn+1= heads, we could also integrate over all possible values of to find the average value of which yields: This might be necessary if we were working with a distribution with a less obvious Expected Value.14Remember slide 8?

This is basically what we did before, except we already know the result of doing this to a beta distribution.More than Two OutcomesIn the previous example, we used a Beta distribution to encode the states of the random variable. This was possible because there were only 2 states/outcomes of the variable X.In general, if the observed variable X is discrete, having r possible states {1,,r}, the likelihood function is given by:

15In this general case we can use a Dirichlet distribution instead:

Just using a different distribution really. Rest of the math is the same. Could use Guassian, Gamma, whatever.15Vocabulary ReviewPrior Probability, P( | ): Prior Probability of a particular value of given no observed data (our previous belief)

Posterior Probability, P( | D, ): Probability of a particular value of given that D has been observed (our final value of ).

Observed Probability or Likelihood, P(D|, ): Likelihood of sequence of coin tosses D being observed given that is a particular value.

P(D|): Raw probability of D

1616OutlineBayesian ApproachBayes TheromBayesian vs. classical probability methodscoin toss an exampleBayesian NetworkStructureInferenceLearning ProbabilitiesLearning the Network StructureTwo coin toss an exampleConclusionsExam Questions1717I have had a review of the bayesian approach. Now, I goanna introduce our main topic BN. OK, But So What?Thats great but this is Data Mining not Philosophy of Mathematics.

Why should we care about all of this ugly math?

18

Bayesian AdvantagesIt turns out that the Bayesian technique permits us to do some very useful things from a mining perspective!1. We can use the Chain Rule with Bayesian Probabilities:19Ex.This isnt something we cant easily do with classical probability!2. As weve already seen using the Bayesian model permits us to update our beliefs based on new data.Example Network20To create a Bayesian network we will ultimately need 3 things:A set of Variables X={X1,, Xn}A Network StructureConditional Probability Table (CPT)Note that when we start we may not have any of these things or a given element may be incomplete!20Lets start with a simple case where we are given all three things: a credit fraud network designed to determine the probability of credit fraud.21

21Set of Variables22Each node represents a random variable.(Lets assume discrete for now.)

22Network Structure23Each edge represents a conditional dependence between variables.

23Conditional Probability Table24Each rule represents the quantification of a conditional dependency.

2425Since weve been given the network structure we can easily see the conditional dependencies:P(A|F,A,S,G) = P(A)P(S|F,A,S,G) = P(S)P(G|F,A,S,G) = P(G|F)P(J|F,A,S,G) = P(J|F,A,S)

2526Note that the absence of an edge indicates conditional independence:P(A|G) = P(A)

2627Important Note: The presence of a of cycle will render one or more of the relationships intractable!

Intractibility = chicken and eggA Bayesian network must be a DAG27Inference28Now suppose we want to calculate (infer) our confidence level in a hypothesis on the fraud variable f given some knowledge about the other variables. This can be directly calculated via:(Kind of messy)First is just basic law of probability.

Mess just gets worse as the number of variables increases.28Inference29Fortunately, we can use the Chain Rule to simplify!This Simplification is especially powerful when the network is sparse which is frequently the case in real world problems.

This shows how we can use a Bayesian Network toinfera probability not stored directly in the model.Factor out the P(A)P(S) and cancel (not dependent on f can come out of sum)29Now for the Data Mining!So far we havent added much value to the data. So lets take advantage of the Bayesian models ability to update our beliefs and learn from new data.First well rewrite our joint probability distribution in a more compact form:30Configuration vector relates to local node parents.30Learning Probabilities in a Bayesian NetworkFirst we need to make two assumptions:There is no missing data (i.e. the data accurately describes the distribution)The parameter vectors are independent (generally a good assumption, at least locally).

3131Parameter independence. The problem of learning probabilities in a BN can now be stated simply: Given a random sample D, computer the posterior distribution of theta

X and Y are independent: P(X,Y) = P(X)P(Y)Independent for any given data set:P(X,Y|D) = P(X|D)P(Y|D)Learning Probabilities in a Bayesian NetworkIf these assumptions hold we can express the probabilities as:

3232Parameter independence. The problem of learning probabilities in a BN can now be stated simply: Given a random sample D, computer the posterior distribution of thetaDealing with UnknownsWhew! Now we know how to use our network to infer conditional relationships and how to update our network with new data. But what if we arent given a well defined network? We could start with missing or incomplete:Set of VariablesConditional Relationship DataNetwork Structure

33Unknown Variable SetOur goal when choosing variables is to:

Organizeinto variables having mutually exclusive and collectively exhaustive states.

This is a problem shared by all data mining algorithms: What should we measure and why? There is not and probably cannot be an algorithmic solution to this problem as arriving at any solution requires intelligent and creative thought.

34Heckerman recommends the use of domain knowledge and some statistical aids but admits that the problem remains non trivial.

34Unknown Conditional RelationshipsThis can be easy.

So long as we can generate a plausible initial belief about a conditional relationship we can simply start with our assumption and let our data refine our model via the mechanism shown in the Learning Probabilities in a Bayesian Network slide.

35Unknown Conditional RelationshipsHowever, when our ignorance becomes serious enough that we no longer even know what is dependent on what we segue into the Unknown Structure scenario.36Learning the Network StructureSometimes the conditional relationships are not obvious. In this case we are uncertain with the network structure: we dont know where the edges should be.3737Now, let us consider the problem of learning about both the structure and the probabilities of a BN given data. Assuming we think structure can be improved, we must be uncertain about the network structure that encode the physical joint probability distribution for x.

This is a problem so hard it can be used in cryptography!Learning the Network StructureTheoretically, we can use a Bayesian approach to get the posterior distribution of the network structure:

Unfortunately, the number of possible network structure increase exponentially with n the number of nodes. Were basically asking ourselves to consider every possible graph with n nodes!3838Now, let us consider the problem of learning about both the structure and the probabilities of a BN given data. Assuming we think structure can be improved, we must be uncertain about the network structure that encode the physical joint probability distribution for x.

This is a problem so hard it can be used in cryptography!Learning the Network StructureHeckerman describes two main methods for shortening the search for a network model:Model SelectionTo select a good model (i.e. the network structure) from all possible models, and use it as if it were the correct model.

Selective Model AveragingTo select a manageable number of good models from among all possible models and pretend that these models are exhaustive.

The math behind both techniques is quite involved so Im afraid well have to content ourselves with a toy example today.3939This tutorial introduced the criteria for model selection, I am not going to get into it though. For those of you who are interested in it, you can read through it in detail.Two Coin Toss ExampleExperiment: flip two coins and observe the outcomePropose two network structures: Sh1 or Sh2Assume P(Sh1)=P(Sh2)=0.5After observing some data, which model is more accurate for this collection of data?40X1X2X1X2p(H)=p(T)=0.5p(H)=p(T)=0.5p(H|H)= 0.1p(T|H)= 0.9p(H|T)= 0.9p(T|T)= 0.1Sh1Sh2P(X2|X1)40Here I am goanna give you an example to show how to choose network structure given observed data. I make up this example, which is far from the complex situation in a realistic problem. I just want to let you have a feeling about how it works.

This is a score based assesment

Assumption is that each model is equally likely

Two Coin Toss ExampleX1X21TT2TH3HT4HT5TH6HT7TH8TH9HT10HT4141Two Coin Toss ExampleX1X21TT2TH3HT4HT5TH6HT7TH8TH9HT10HT42Recall for Sh1, P(X1|X2) = P(X1) and P(X2|X1)=P(X2) due to the complete lack of conditional independence.

Note that each pair of terms corresponds to 1 Data point42Two Coin Toss ExampleX1X21TT2TH3HT4HT5TH6HT7TH8TH9HT10HT43...Recall for Sh1, P(X1|X2) = P(X1) and P(X2|X1)=P(X2) due to the complete lack of conditional independence.

Note that each pair of terms corresponds to 1 Data point43Two Coin Toss Example4444Two Coin Toss Example4545Two Coin Toss Example4646OutlineBayesian ApproachBayes TheromBayesian vs. classical probability methodscoin toss an exampleBayesian NetworkStructureInferenceLearning ProbabilitiesLearning the Network StructureTwo coin toss an exampleConclusionsExam Questions4747ConclusionsBayesian methodBayesian networkStructureInference Learn parameters and structureAdvantages

4848Firstly, I reviewed the bayes theorem, and I compared the difference between bayesian method and classical method. I showed a coin toss example to illustrate how to encode the personal belief as the prior information, and how to use the observed data to update the belief. Secondly, I introduced several fundamental topics in BNs. I introduced the structure of a BN: variables, structure, CPT. How to make inference in BN. How to update the local probabilities in CPT. I briefly mentioned how to refine the network structures: model selection and model averaging. I also talked the advantages of BNs, which include Question1: What is Bayesian Probability?A persons degree of belief in a certain eventYour own degree of certainty that a tossed coin will land headsA degree of confidence in an outcome given some data.4949Question 2: Compare the Bayesian and classical approaches to probability (any one point).Bayesian Approach:Classical Probability:+Reflects an experts knowledge+The belief is kept updating when new data item arrives- Arbitrary (More subjective)Wants P( H | D )

+Objective and unbiased- Generally not availableIt takes a long time to measure the objects physical characteristicsWants P( D | H )

5050Question 3: Mention at least 1 Advantage of Bayesian analysisHandle incomplete data setsLearning about causal relationshipsCombine domain knowledge and dataAvoid over fitting5151The EndAny Questions?5252

learning with bayesian networks

Documents

probability p

bayesian statisticians

bayesian stats

classical method

bayesian networksauthor

actual probability

classical statisticians

review of bayesian approach