introduction to bayesian networks instructor: dan geiger web page: dang/coursebn dang/coursebn

25
. Introduction to Bayesian Networks Instructor: Dan Geiger Web page : www.cs.technion.ac.il/~ dang/courseBN http://webcourse.cs.technion.ac.il/2363 72 Email : [email protected] Phone: 829 4339 Office: Taub 616. תתתתתתת תתתתתתת תתת תתתת תת תת תתתתתת ? .תתתתתתתת תתתתת תתתתתת תתתתתת תתתתתתתת: תתתתתתתתתת תתתתתת תתתתת, תתתתת תתתתתת תתתתתת, תתתתתת תתתתתתת .תתתתתתת. תתתתתתת תתתתתת תתת תת תתתתתת תתתת תתתת תתתת תתתת תת תתתת תתתתתת תתתתת .תתתת תתתת

Upload: keon-towery

Post on 15-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

.

Introduction to Bayesian NetworksInstructor: Dan Geiger

Web page: www.cs.technion.ac.il/~dang/courseBN http://webcourse.cs.technion.ac.il/236372

Email: [email protected]: 829 4339Office: Taub 616.

על מה המהומה ? נישואים מאושרים בין תורת ההסתברות ותורת .הגרפים

הילדים המוצלחים: אלגוריתמים לגילוי תקלות, קודים לגילוי שגיאות, מודלים למערכות מורכבות. שימושים במגוון רחב של

.תחומים.קורס ממבט אישי למדי על נושא מחקריי לאורך שנים רבות

Page 2: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

2

What is it all about ?

•How to use graphs to represent probability distributions over thousands of random variables ?

•How to encode conditional independence in directed and undirected graphs ?

•How to use such representations for efficient computations of the probability of events of interest ?

•How to learn such models from data ?

Page 3: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

3

Course InformationMeetings:

Lecture: Mondays 8:30 –10:30 Tutorial: Mondays 12:30 – 13:30

Grade: 50% in 5 question sets. These questions sets are obligatory. Each contains mostly

theoretical problems. Submit in pairs before due time (three weeks). 40% Bochan on January 7. (Maybe 20%-20% with a lecture). 10% Attending at least 12 lectures and recitation classes for a **passing grade**. (In very special circumstances 2% per missing item).

Prerequisites: Data structure 1 (cs234218) Algorithms 1 (cs234247) Probability (any course)

Information and handouts: http://webcourse.cs.technion.ac.il/236372 http://www.cs.technion.ac.il/~dang/courseBN/ (Only lecture slides)

Page 4: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

4

Relations to Some Other Courses

Introduction to Artificial Intelligence (cs236501) Introduction to Machine Learning (cs236756) Introduction to Neural Networks (cs236950) Algorithms in Computational Biology (cs236522) Error correcting codes Data mining

.אמור לי מי חבריך ואומר לך מי אתה

Page 5: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

6

Mathematical Foundations (4 weeks including students’ lectures, based on Pearl’s Chapter 3 + papers).

1. Properties of Conditional Independence (Soundness and completeness of marginal independence, graphoid axioms and their interpretation as “irrelevance”, incompleteness of conditional independence, no disjunctive axioms possible.)

2. Properties of graph separation (Paz and Pearl 85, Theorem 3), soundness and completeness of saturated independence statements. Undirected Graphs as I-maps of probability distributions. Markov-Blankets, Pairwise independence basis. Representation theorems (Pearl and Paz, from each basis to I-maps). Markov networks, HC representation theorem, Completeness theorem. Markov chains

3. Bayesian Networks, d-separation, Soundness, Completeness.4. Chordal Graphs as the intersection of BN and Markov networks. Equivalence of their

4 definitions.

Combinatorial Optimiziation of Exact Inference in Graphical models (3 weeks including students lectures).

1. HMMs2. Exact inference and their combinatorial optimization.3. Clique tree algorithm. Conditioning.4. Tree-width. Feedback Vertex Set.

Learning (5 weeks including students lectures).1. Introduction to Bayesian statistics2. Learning Bayesian networks3. Chow and Liu’s algorithm; the TAN model.4. Structural EM5. Searching for Bayesian networks

Applications (2 weeks including student lectures).

Page 6: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

7

Homeworks• HMW #1. Read Chapter 3.1 & 3.2.1. Answer Questions 3.1, 3.2, Prove Eq 3.5b, and fully expand/fix the proof of Theorem 2. Submit in pairs no later than 5/11/12 (Two weeks from now).•HMW #2. Read fully Chapter 3. Answer additional 5 questions of choice at the end. Submit in pairs no later than 19/11/12. •HMW #3. Read Chapter 2 in Pearl’s book and answer 6 questions of choice at the end. Submit in pairs no later than 3/12/12.•HMW #4. Submit in pairs 24/12/12• HMW #5. Submit in pairs 14/1/12Pearl’s book contains all the notations that I happen not to define in these slides – consult it often – it is also a very unique and interesting classic text book.

Page 7: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

8

The Traditional View of Probability in Text Books

• Probability theory provides the impression that we need to literally represent a joint distribution explicitly as P(x1,…,xn) on all propositions and their combinations. It is consistent and exhaustive.

• This representation stands in sharp contrast to human reasoning: It requires exponential computations to compute marginal probabilities like P(x1) or conditionals like P(x1|x2).

•Humans judge pairwise conditionals swiftly while conjunctions are judged hesitantly. Numerical calculations do not reflect simple reasoning tasks.

Page 8: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

9

The Traditional View of Probability in Text Books

Given ?Computed ?

Given ?Estimated or Computed ? P(e | h)

כמחלה ויראלית נדירה h חישבו על ועל

e כחום .גבוה

Page 9: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

10

The Qualitative Notion of Dependence

•Marginal independence is defined numerically as P(x,y)=P(x) P(y). The truth of this equation is hard to judge by humans, while judging whether X and Y are dependent is often easy.

•“Burglary within a day” and “nuclear war within five years”• Likewise, people tend to judge

x

Y

Z

Page 10: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

11

•The notions of relevance and dependence are far more basic than the numerical values. In a resonating system it should be asserted once and not be sensitive to numerical changes.

•Acquisition of new facts may destroy existing dependencies as well as creating new once.

• Learning child’s age Z destroys the dependency between height X and reading ability Y.•Learning symptoms Z of a patient creates dependencies between the diseases X and Y that could account for it.

Probability theory provides in principle such a device via P(X | Y, K) = P(X |K)But can we model the dynamics of dependency changes based on logic, without reference to numerical quantities ?

Page 11: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

12

Definition of Marginal Independence

Definition: IP(X,Y) iff for all xDX and yDy

Pr(X=x, Y=y) = Pr(X=x) Pr(Y=y)

Comments: Each Variable X has a domain DX with value (or state) x in DX. We often abbreviate via P(x, y) = P(x) P(y). When Y is the emptyset, we get Pr(X=x) = Pr(X=x). Alternative notations to IP(X,Y) such as: I(X,Y) or XY Next few slides on properties of marginal independence are based

on “Axioms and algorithms for inferences involving probabilistic independence.”

Page 12: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

13

Properties of Marginal Independence

Trivial Independence: Ip(X,)

Symmetry: Ip(X,Y) Ip(Y,X)

Decomposition: Ip(X,YW) Ip(X,Y)

Mixing: Ip(X,Y) and Ip(XY,W) Ip(X,YW)

Proof (Soundness).

Trivial independence and Symmetry follow from the definition.

Decomposition: Given P(x,y,w) = P(x) P(y,w), simply sum over w on both sides of the given equation.

Mixing: Given: P(x,y) = P(x) P(y) and P(x,y,w) = P(x,y) P(w).

Hence, P(x,y,w) = P(x) P(y) P(w) = P(x) P(y,w).

Page 13: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

14

Properties of Marginal Independence

Are there more such independent properties of independence ?

No. There are none !

Horn axioms are of the form 1 & … & n where each statement stands for an independence statement. We use the symbol for a set of independence statements.

Namely: is derivable from via these properties if and only if is entailed by (i.e., holds in all probability distributions that

satisfy ).

Put differently: For every set and a statement not derivable from

, there exists a probability distribution P that satisfies and not .

Page 14: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

15

Properties of Marginal Independence

Can we use these properties to infer a new independence statements from a set of given independence statements in polynomial time ?

YES. The “membership algorithm” and completeness proof in Recitation class (Paper P2).

Comment. The question “does entail ” could in principle be undecidable, drops to being decidable via a complete set of axioms, and then drops to polynomial with this claim.

Page 15: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

16

Properties of Marginal IndependenceCan we check consistency of a set + independence plus a

set - of negated independence statements ?

The membership algorithm in previous slide applies only for G that includes one negated statement – simply use the algorithm to check that it is not entailed from + .

But another property of independence called “Amstrong Relation” guarantees that consistency is indeed verified by

checking separately (in isolation) that each statement in - is not entailed from + .

Page 16: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

17

Definitions of Conditional Independence

Ip(X,Z,Y) if and only if whenever P(Y=y,Z=z) >0 (3.1) Pr(X=x | Y=y , Z=z) = Pr(X=x |Z=z)

Page 17: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

18

Properties of Conditional Independence

Same Properties of conditional independence:

Symmetry: I(X,Z,Y) I(Y,Z,X)

Decomposition: I(X,Z,YW) I(X,Z,Y)

Mixing: I(X,Z,Y) and I(XY,Z,W) I(X,Z,YW)

BAD NEWS. Are there more properties of independence ?

Yes, infinitely many independent Horn axioms. No answer to the membership problem, nor to the consistency problem.

Page 18: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

19

Important Properties of Conditional Independence

Recall there are some more notations.

Page 19: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

20

Graphical Interpretation

Page 20: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

21

Use graphs and not pure logic

•Variables represented by nodes and dependencies by edges.

• Common in our language: “threads of thoughts”, “lines of reasoning”, “connected ideas”, “far-fetched arguments”.

•Still, capturing the essence of dependence is not an easy task. When modeling causation, association, and relevance, it is hard to distinguish between direct and indirect neighbors.

•If we just connect “dependent variables” we will get cliques.

Page 21: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

22

Undirected Graphs can represent Independence

Let G be an undirected graph (V,E).

Define IG(X,Z,Y) for disjoint sets of nodes X,Y, and Z if and only if all paths between a node in X and a node in Y pass via a node in Z.

In the text book another notation used is <X|Z|Y>G.

Page 22: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

23

M = { IG(M1,{F1,F2},M2), IG(F1,{M1,M2},F2) + symmetry }

Other semantics. The color of each pixel depends on its neighbor.

Page 23: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

24

Dependency Models – abstraction of Probability distributions

A dependency model M over a finite set of elements U is a rule that assigns truth values to the predicate IM(X,Z,Y) where X,Y, Z are (disjoint) subsets of U.

Page 24: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

25

Definitions: 1. G=(U,E) is an I-map of a model M over U if IG(X,Z,Y) IM(X,Z,Y) for all disjoint subsets X,Y, Z of U.

2. G is a D-map of M if IM(X,Z,Y) IG(X,Z,Y) for all disjoint subsets X,Y, Z of U.

3. G is a perfect map of M if IG(X,Z,Y) IM(X,Z,Y) for all disjoint subsets X,Y, Z of U.

4. M is graph-isomorph if there exists a graph G such that G is a perfect map of M.

Page 25: Introduction to Bayesian Networks Instructor: Dan Geiger Web page: dang/courseBN  dang/courseBN

27

Representation of independencies with Undirected graphs can not always be perfect

Strong Union: IG(X,Z,Y) IG(X,ZW,Y)

If G is an I-map of P it can represent IP(X,Z,Y) but can not represent the negation IP(X,ZW,Y).

If G is a D-map of P it can represent IP(X,ZW,Y) but can not represent IP(X,Z,Y).

This property holds for graph separation but not for conditional independence in probability.