b ayesian n etworks. s ignificance of c onditional independence consider grade(cs101), intelligence,...
TRANSCRIPT
SIGNIFICANCE OF CONDITIONAL INDEPENDENCE
Consider Grade(CS101), Intelligence, and SAT Ostensibly, the grade in a course doesn’t
have a direct relationship with SAT scores but good students are more likely to get good
SAT scores, so they are not independent… It is reasonable to believe that Grade(CS101)
and SAT are conditionally independent given Intelligence
BAYESIAN NETWORK Explicitly represent independence among
propositions Notice that Intelligence is the “cause” of both Grade
and SAT, and the causality is represented explicitly
Intel.
Grade
high low
0.3 0.7
SAT
P(I,G,S) = P(G,S|I) P(I) = P(G|I) P(S|I) P(I)
I \ G ‘A’ ‘B’ ‘C’
low 0.2 0.34 0.46
high 0.74 0.17 0.09
P(I)
P(G|I) P(S|I)
I \ S low high
low 0.95 0.05
high 0.2 0.8
A MORE COMPLEX BN
Burglary Earthquake
Alarm
MaryCallsJohnCalls
causes
effects
Directed acyclic graph
Intuitive meaning of arc from x to y:
“x has direct influence on y”
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
Size of the CPT for a node with k parents: 2k
A MORE COMPLEX BN
10 probabilities, instead of 31
SIGNIFICANCE OF BAYESIAN NETWORKS
If we know that some variables are conditionally independent, we should be able to decompose joint distribution to take advantage of it
Bayesian networks are a way of efficiently factoring the joint distribution into conditional probabilities
And also building complex joint distributions from smaller models of probabilistic relationships
But… What knowledge does the BN encode about the
distribution? How do we use a BN to compute probabilities of
variables that we are interested in?
WHAT DOES THE BN ENCODE?
Each of the beliefs JohnCalls and MaryCalls is independent of Burglary and Earthquake given Alarm or Alarm
Burglary Earthquake
Alarm
MaryCallsJohnCalls
For example, John doesnot observe any burglariesdirectly
P(BJ) P(B) P(J)P(BJ|A) = P(B|A) P(J|A)
WHAT DOES THE BN ENCODE?
The beliefs JohnCalls and MaryCalls are independent given Alarm or Alarm
For instance, the reasons why John and Mary may not call if there is an alarm are unrelated
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(BJ|A) = P(B|A) P(J|A)P(JM|A) = P(J|A) P(M|A)P(BJ|A) = P(B|A) P(J|A)P(JM|A) = P(J|A) P(M|A)
A node is independent of its non-descendants given its parents
A node is independent of its non-descendants given its parents
WHAT DOES THE BN ENCODE?Burglary Earthquake
Alarm
MaryCallsJohnCallsA node is
independent of its non-descendants given its parents
A node is independent of its non-descendants given its parents
Burglary and Earthquake are independent
Burglary and Earthquake are independent
The beliefs JohnCalls and MaryCalls are independent given Alarm or Alarm
For instance, the reasons why John and Mary may not call if there is an alarm are unrelated
LOCALLY STRUCTURED WORLD
A world is locally structured (or sparse) if each of its components interacts directly with relatively few other components
In a sparse world, the CPTs are small and the BN contains much fewer probabilities than the full joint distribution
If the # of entries in each CPT is bounded by a constant, i.e., O(1), then the # of probabilities in a BN is linear in n – the # of propositions – instead of 2n for the joint distribution
BUT DOES A BN REPRESENT A BELIEF STATE?
IN OTHER WORDS, CAN WE COMPUTE THE FULL JOINT DISTRIBUTION OF THE PROPOSITIONS FROM IT?
CALCULATION OF JOINT PROBABILITY
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
P(JMABE) = ??
P(JMABE)= P(JM|A,B,E) P(ABE)= P(J|A,B,E) P(M|A,B,E) P(ABE)(J and M are independent given A)
P(J|A,B,E) = P(J|A)(J and B and J and E are independent given A)
P(M|A,B,E) = P(M|A) P(ABE) = P(A|B,E) P(B|E) P(E)
= P(A|B,E) P(B) P(E)(B and E are independent)
P(JMABE) = P(J|A)P(M|A)P(A|B,E)P(B)P(E)
Burglary Earthquake
Alarm
MaryCallsJohnCalls
CALCULATION OF JOINT PROBABILITY
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
P(JMABE)= P(J|A)P(M|A)P(A|B,E)P(B)P(E)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062
CALCULATION OF JOINT PROBABILITY
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
P(JMABE)= P(J|A)P(M|A)P(A|B,E)P(B)P(E)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062
P(x1x2…xn) = Pi=1,…,nP(xi|parents(Xi))
full joint distribution table
CALCULATION OF JOINT PROBABILITY
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
P(x1x2…xn) = Pi=1,…,nP(xi|parents(Xi))
full joint distribution table
P(JMABE)= P(J|A)P(M|A)P(A|B,E)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062
Since a BN defines the full joint distribution of a set of propositions, it represents a belief state
Since a BN defines the full joint distribution of a set of propositions, it represents a belief state
WHAT DOES THE BN ENCODE?
Burglary EarthquakeJohnCalls MaryCalls | AlarmJohnCalls Burglary | AlarmJohnCalls Earthquake | AlarmMaryCalls Burglary | AlarmMaryCalls Earthquake | Alarm
Burglary Earthquake
Alarm
MaryCallsJohnCalls
A node is independent of its non-descendents, given its parents
READING OFF INDEPENDENCE RELATIONSHIPS
How about Burglary Earthquake | Alarm ? No! Why?
Burglary Earthquake
Alarm
MaryCallsJohnCalls
READING OFF INDEPENDENCE RELATIONSHIPS
How about Burglary Earthquake | Alarm ? No! Why? P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075 P(B|A)P(E|A) = 0.086
Burglary Earthquake
Alarm
MaryCallsJohnCalls
READING OFF INDEPENDENCE RELATIONSHIPS
How about Burglary Earthquake | JohnCalls? No! Why? Knowing JohnCalls affects the probability of
Alarm, which makes Burglary and Earthquake dependent
Burglary Earthquake
Alarm
MaryCallsJohnCalls
INDEPENDENCE RELATIONSHIPS
Rough intuition (this holds for tree-like graphs, polytrees): Evidence on the (directed) road between two
variables makes them independent Evidence on an “A” node makes descendants
independent Evidence on a “V” node, or below the V, makes
the ancestors of the variables dependent (otherwise they are independent)
Formal property in general case : D-separation independence (see R&N)
BENEFITS OF SPARSE MODELS
Modeling Fewer relationships need to be encoded (either
through understanding or statistics) Large networks can be built up from smaller ones
Intuition Dependencies/independencies between variables
can be inferred through network structures Tractable probabilistic inference
PROBABILISTIC INFERENCE
Is the following problem…. Given:
A belief state P(X1,…,Xn) in some form (e.g., a Bayes net or a joint probability table)
A query variable indexed by q Some subset of evidence variables indexed by
e1,…,ek
Find: P(Xq | Xe1 ,…, Xek)
PROBABILISTIC INFERENCE WITHOUT EVIDENCE
For the moment we’ll assume no evidence variables
Find P(Xq)
PROBABILISTIC INFERENCE WITHOUT EVIDENCE
In joint probability table, we can find P(Xq) through marginalization Computational complexity: 2n-1
(assuming boolean random variables)
In Bayesian networks, we find P(Xq) using top down inference
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
TOP-DOWN INFERENCESuppose we want to compute P(Alarm)
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
TOP-DOWN INFERENCESuppose we want to compute P(Alarm)1. P(Alarm) = Σb,e P(A,b,e)2. P(Alarm) = Σb,e P(A|b,e)P(b)P(e)
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
TOP-DOWN INFERENCESuppose we want to compute P(Alarm)1. P(Alarm) = Σb,e P(A,b,e)2. P(Alarm) = Σb,e P(A|b,e)P(b)P(e)3. P(Alarm) = P(A|B,E)P(B)P(E) +
P(A|B, E)P(B)P(E) +P(A|B,E)P(B)P(E) +P(A|B,E)P(B)P(E)
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
TOP-DOWN INFERENCESuppose we want to compute P(Alarm)1. P(A) = Σb,e P(A,b,e)2. P(A) = Σb,e P(A|b,e)P(b)P(e)3. P(A) = P(A|B,E)P(B)P(E) +
P(A|B, E)P(B)P(E) +P(A|B,E)P(B)P(E) +P(A|B,E)P(B)P(E)
4. P(A) = 0.95*0.001*0.002 +0.94*0.001*0.998 +0.29*0.999*0.002 +0.001*0.999*0.998= 0.00252
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
TOP-DOWN INFERENCENow, suppose we want to compute P(MaryCalls)
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
TOP-DOWN INFERENCENow, suppose we want to compute P(MaryCalls)1. P(M) = P(M|A)P(A) + P(M| A) P(A)
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
TOP-DOWN INFERENCENow, suppose we want to compute P(MaryCalls)1. P(M) = P(M|A)P(A) + P(M| A) P(A)2. P(M) = 0.70*0.00252 + 0.01*(1-0.0252)
= 0.0117
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
TOP-DOWN INFERENCE WITH EVIDENCE
Suppose we want to compute P(Alarm|Earthquake)
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
TOP-DOWN INFERENCE WITH EVIDENCE
Suppose we want to compute P(A|e)1. P(A|e) = Σb P(A,b|e)2. P(A|e) = Σb P(A|b,e)P(b)
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
TOP-DOWN INFERENCE WITH EVIDENCE
Suppose we want to compute P(A|e)1. P(A|e) = Σb P(A,b|e)2. P(A|e) = Σb P(A|b,e)P(b)3. P(A|e) = 0.95*0.001 +
0.29*0.999 += 0.29066
TOP-DOWN INFERENCE
Only works if the graph of ancestors of a variable is a polytree
Evidence given on ancestor(s) of the query variable
Efficient: O(d 2k) time, where d is the number of ancestors
of a variable, with k a bound on # of parents Evidence on an ancestor cuts off influence of
portion of graph above evidence node
QUERYING THE BN
The BN gives P(T|C) What about P(C|T)?
Cavity
Toothache
P(C)
0.1
C P(T|C)
TF
0.40.01111
APPLYING BAYES’ RULE Let A be a cause, B be an effect, and let’s say we
know P(B|A) and P(A) (conditional probability tables)
What’s P(B)?
APPLYING BAYES’ RULE Let A be a cause, B be an effect, and let’s say we
know P(B|A) and P(A) (conditional probability tables)
What’s P(B)? P(B) = Sa P(B,A=a) [marginalization]
P(B,A=a) = P(B|A=a)P(A=a) [conditional probability]
So, P(B) = Sa P(B | A=a) P(A=a)
APPLYING BAYES’ RULE Let A be a cause, B be an effect, and let’s say we
know P(B|A) and P(A) (conditional probability tables)
What’s P(A|B)?
APPLYING BAYES’ RULE Let A be a cause, B be an effect, and let’s say we
know P(B|A) and P(A) (conditional probability tables)
What’s P(A|B)? P(A|B) = P(B|A)P(A)/P(B) [Bayes
rule] P(B) = Sa P(B | A=a) P(A=a) [Last
slide] So, P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)]
HOW DO WE READ THIS?
P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)] [An equation that holds for all values A can take on,
and all values B can take on] P(A=a|B=b) =
HOW DO WE READ THIS?
P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)] [An equation that holds for all values A can take on,
and all values B can take on] P(A=a|B=b) = P(B=b|A=a)P(A=a) /
[Sa P(B=b | A=a) P(A=a)]
Are these the same a?
HOW DO WE READ THIS?
P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)] [An equation that holds for all values A can take on,
and all values B can take on] P(A=a|B=b) = P(B=b|A=a)P(A=a) /
[Sa P(B=b | A=a) P(A=a)]
Are these the same a?
NO!
HOW DO WE READ THIS?
P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)] [An equation that holds for all values A can take on,
and all values B can take on] P(A=a|B=b) = P(B=b|A=a)P(A=a) /
[Sa’ P(B=b | A=a’) P(A=a’)]
Be careful about indices!
QUERYING THE BN The BN gives P(T|C) What about P(C|T)? P(Cavity|Toothache) =
P(Toothache|Cavity) P(Cavity)
P(Toothache)
[Bayes’ rule]
Querying a BN is just applying Bayes’ rule on a larger scale…
Cavity
Toothache
P(C)
0.1
C P(T|C)
TF
0.40.01111 Denominator computed by
summing out numerator over Cavity and Cavity
NAÏVE BAYES MODELS
P(Cause,Effect1,…,Effectn)= P(Cause) Pi P(Effecti | Cause)
Cause
Effect1 Effect2 Effectn
NAÏVE BAYES CLASSIFIER
P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class)
Class
Feature1 Feature2 Featuren
P(C|F1,….,Fk) = P(C,F1,….,Fk)/P(F1,….,Fk)
= 1/Z P(C) Pi P(Fi|C)
Given features, what class?
Spam / Not Spam
English / French/ Latin
…
Word occurrences
MORE COMPLEX NETWORKS
Two limitations of current discussion: “Tree like” networks Evidence at specific locations
Next time: Tractable exact inference without “loops” With loops: NP-hard (compare to CSPs) Approximate inference for general networks
SOME APPLICATIONS OF BN
Medical diagnosis Troubleshooting of hardware/software
systems Fraud/uncollectible debt detection Data mining Analysis of genetic sequences Data interpretation, computer vision, image
understanding