vc dimension - cs.cmu.eduguestrin/class/15781/slides/learningtheory-bns.pdf · vc dimension machine...

29
1 ©2005-2007 Carlos Guestrin 1 VC Dimension Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University October 29 th , 2007 ©2005-2007 Carlos Guestrin 2 What about continuous hypothesis spaces? Continuous hypothesis space: |H| = Infinite variance??? As with decision trees, only care about the maximum number of points that can be classified exactly!

Upload: docong

Post on 30-May-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

1

©2005-2007 Carlos Guestrin 1

VC Dimension

Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

October 29th, 2007

©2005-2007 Carlos Guestrin 2

What about continuous hypothesisspaces?

Continuous hypothesis space: |H| = ∞ Infinite variance???

As with decision trees, only care about themaximum number of points that can beclassified exactly!

Page 2: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

2

©2005-2007 Carlos Guestrin 3

How many points can a linearboundary classify exactly? (1-D)

©2005-2007 Carlos Guestrin 4

How many points can a linearboundary classify exactly? (2-D)

Page 3: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

3

©2005-2007 Carlos Guestrin 5

How many points can a linearboundary classify exactly? (d-D)

©2005-2007 Carlos Guestrin 6

PAC bound using VC dimension

Number of training points that can beclassified exactly is VC dimension!!! Measures relevant size of hypothesis space, as

with decision trees with k leaves

Page 4: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

4

©2005-2007 Carlos Guestrin 7

Shattering a set of points

©2005-2007 Carlos Guestrin 8

VC dimension

Page 5: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

5

©2005-2007 Carlos Guestrin 9

PAC bound using VC dimension

Number of training points that can beclassified exactly is VC dimension!!! Measures relevant size of hypothesis space, as

with decision trees with k leaves Bound for infinite dimension hypothesis spaces:

©2005-2007 Carlos Guestrin 10

Examples of VC dimension

Linear classifiers: VC(H) = d+1, for d features plus constant term b

Neural networks VC(H) = #parameters Local minima means NNs will probably not find best

parameters

1-Nearest neighbor?

Page 6: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

6

©2005-2007 Carlos Guestrin 11

Another VC dim. example -What can we shatter? What’s the VC dim. of decision stumps in 2d?

©2005-2007 Carlos Guestrin 12

Another VC dim. example -What can’t we shatter? What’s the VC dim. of decision stumps in 2d?

Page 7: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

7

©2005-2007 Carlos Guestrin 13

What you need to know

Finite hypothesis space Derive results Counting number of hypothesis Mistakes on Training data

Complexity of the classifier depends on number ofpoints that can be classified exactly Finite case – decision trees Infinite case – VC dimension

Bias-Variance tradeoff in learning theory Remember: will your algorithm find best classifier?

©2005-2007 Carlos Guestrin 14

Bayesian Networks –Representation

Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

October 29th, 2007

Page 8: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

8

©2005-2007 Carlos Guestrin 15

Handwriting recognition

Character recognition, e.g., kernel SVMs

z cb

cac rr

rr r

r

©2005-2007 Carlos Guestrin 16

Webpage classification

Company home page

vs

Personal home page

vs

University home page

vs

Page 9: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

9

©2005-2007 Carlos Guestrin 17

Handwriting recognition 2

©2005-2007 Carlos Guestrin 18

Webpage classification 2

Page 10: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

10

©2005-2007 Carlos Guestrin 19

Today – Bayesian networks

One of the most exciting advancements instatistical AI in the last 10-15 years

Generalizes naïve Bayes and logistic regressionclassifiers

Compact representation for exponentially-largeprobability distributions

Exploit conditional independencies

©2005-2007 Carlos Guestrin 20

Causal structure

Suppose we know the following: The flu causes sinus inflammation Allergies cause sinus inflammation Sinus inflammation causes a runny nose Sinus inflammation causes headaches

How are these connected?

Page 11: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

11

©2005-2007 Carlos Guestrin 21

Possible queries

Flu Allergy

Sinus

Headache Nose

Inference

Most probableexplanation

Active datacollection

©2005-2007 Carlos Guestrin 22

Car starts BN

18 binary attributes

Inference P(BatteryAge|Starts=f)

216 terms, why so fast? Not impressed?

HailFinder BN – more than 354 =58149737003040059690390169 terms

Page 12: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

12

©2005-2007 Carlos Guestrin 23

Factored joint distribution -Preview

Flu Allergy

Sinus

Headache Nose

©2005-2007 Carlos Guestrin 24

Number of parameters

Flu Allergy

Sinus

Headache Nose

Page 13: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

13

©2005-2007 Carlos Guestrin 25

Key: Independence assumptions

Flu Allergy

Sinus

Headache Nose

Knowing sinus separates the variables from each other

©2005-2007 Carlos Guestrin 26

(Marginal) Independence

Flu and Allergy are (marginally) independent

More Generally:

Allergy = f

Allergy = t

Flu = fFlu = t

Allergy = f

Allergy = t

Flu = f

Flu = t

Page 14: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

14

©2005-2007 Carlos Guestrin 27

Marginally independent randomvariables

Sets of variables X, Y X is independent of Y if

P ²(X=x⊥Y=y), 8 x2Val(X), y2Val(Y)

Shorthand: Marginal independence: P ² (X ⊥ Y)

Proposition: P statisfies (X ⊥ Y) if and only if P(X,Y) = P(X) P(Y)

©2005-2007 Carlos Guestrin 28

Conditional independence

Flu and Headache are not (marginally) independent

Flu and Headache are independent given Sinusinfection

More Generally:

Page 15: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

15

©2005-2007 Carlos Guestrin 29

Conditionally independent randomvariables

Sets of variables X, Y, Z X is independent of Y given Z if

P ²(X=x ⊥ Y=y|Z=z), 8 x2Val(X), y2Val(Y), z2Val(Z)

Shorthand: Conditional independence: P ² (X ⊥ Y | Z) For P ² (X ⊥ Y | ;), write P ² (X ⊥ Y)

Proposition: P statisfies (X ⊥ Y | Z) if and only if P(X,Y|Z) = P(X|Z) P(Y|Z)

©2005-2007 Carlos Guestrin 30

Properties of independence

Symmetry: (X ⊥ Y | Z) ⇒ (Y ⊥ X | Z)

Decomposition: (X ⊥ Y,W | Z) ⇒ (X ⊥ Y | Z)

Weak union: (X ⊥ Y,W | Z) ⇒ (X ⊥ Y | Z,W)

Contraction: (X ⊥ W | Y,Z) & (X ⊥ Y | Z) ⇒ (X ⊥ Y,W | Z)

Intersection: (X ⊥ Y | W,Z) & (X ⊥ W | Y,Z) ⇒ (X ⊥ Y,W | Z) Only for positive distributions! P(α)>0, 8α, α≠;

Page 16: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

16

©2005-2007 Carlos Guestrin 31

The independence assumption

Flu Allergy

Sinus

Headache Nose

Local Markov Assumption:A variable X is independentof its non-descendants givenits parents

©2005-2007 Carlos Guestrin 32

Explaining away

Flu Allergy

Sinus

Headache Nose

Local Markov Assumption:A variable X is independentof its non-descendants givenits parents

Page 17: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

17

©2005-2007 Carlos Guestrin 33

Naïve Bayes revisited

Local Markov Assumption:A variable X is independentof its non-descendants givenits parents

©2005-2007 Carlos Guestrin 34

What about probabilities?Conditional probability tables (CPTs)

Flu Allergy

Sinus

Headache Nose

Page 18: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

18

©2005-2007 Carlos Guestrin 35

Joint distribution

Flu Allergy

Sinus

Headache Nose

Why can we decompose? Markov Assumption!

©2005-2007 Carlos Guestrin 36

The chain rule of probabilities

P(A,B) = P(A)P(B|A)

More generally: P(X1,…,Xn) = P(X1) · P(X2|X1) · … · P(Xn|X1,…,Xn-1)

Flu

Sinus

Page 19: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

19

©2005-2007 Carlos Guestrin 37

Chain rule & Joint distribution

Flu Allergy

Sinus

Headache Nose

Local Markov Assumption:A variable X is independentof its non-descendants givenits parents

©2005-2007 Carlos Guestrin 38

Two (trivial) special cases

Edgeless graph Fully-connected graph

Page 20: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

20

©2005-2007 Carlos Guestrin 39

The Representation Theorem –Joint Distribution to BN

Joint probabilitydistribution:Obtain

BN: Encodes independenceassumptions

If conditionalindependencies

in BN are subset of conditional

independencies in P

©2005-2007 Carlos Guestrin 40

Real Bayesian networksapplications

Diagnosis of lymph node disease Speech recognition Microsoft office and Windows

http://www.research.microsoft.com/research/dtg/ Study Human genome Robot mapping Robots to identify meteorites to study Modeling fMRI data Anomaly detection Fault dianosis Modeling sensor network data

Page 21: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

21

©2005-2007 Carlos Guestrin 41

A general Bayes net

Set of random variables

Directed acyclic graph Encodes independence assumptions

CPTs

Joint distribution:

©2005-2007 Carlos Guestrin 42

How many parameters in a BN?

Discrete variables X1, …, Xn

Graph Defines parents of Xi, PaXi

CPTs – P(Xi| PaXi)

Page 22: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

22

©2005-2007 Carlos Guestrin 43

Another example

Variables: B – Burglar E – Earthquake A – Burglar alarm N – Neighbor calls R – Radio report

Both burglars and earthquakes can set off thealarm

If the alarm sounds, a neighbor may call An earthquake may be announced on the radio

©2005-2007 Carlos Guestrin 44

Another example – Building the BN

B – Burglar E – Earthquake A – Burglar alarm N – Neighbor calls R – Radio report

Page 23: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

23

©2005-2007 Carlos Guestrin 45

Independencies encoded in BN

We said: All you need is the local Markovassumption (Xi ⊥ NonDescendantsXi | PaXi)

But then we talked about other (in)dependencies e.g., explaining away

What are the independencies encoded by a BN? Only assumption is local Markov But many others can be derived using the algebra of

conditional independencies!!!

©2005-2007 Carlos Guestrin 46

Understanding independencies in BNs– BNs with 3 nodes

Z

YX

Local Markov Assumption:A variable X is independentof its non-descendants givenits parents

Z YX

Z YX

ZYX

Indirect causal effect:

Indirect evidential effect:

Common cause:

Common effect:

Page 24: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

24

©2005-2007 Carlos Guestrin 47

Understanding independencies in BNs– Some examples

A

H

CE

G

D

B

F

K

J

I

©2005-2007 Carlos Guestrin 48

An active trail – Example

A HCE G

DB F

F’’

F’

When are A and H independent?

Page 25: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

25

©2005-2007 Carlos Guestrin 49

Active trails formalized

A path X1 – X2 – · · · –Xk is an active trail whenvariables Oµ{X1,…,Xn} are observed if for eachconsecutive triplet in the trail: Xi-1→Xi→Xi+1, and Xi is not observed (Xi∉O)

Xi-1←Xi←Xi+1, and Xi is not observed (Xi∉O)

Xi-1←Xi→Xi+1, and Xi is not observed (Xi∉O)

Xi-1→Xi←Xi+1, and Xi is observed (Xi2O), or one ofits descendents

©2005-2007 Carlos Guestrin 50

Active trails and independence?

Theorem: Variables Xiand Xj are independentgiven Zµ{X1,…,Xn} if theis no active trail betweenXi and Xj when variablesZµ{X1,…,Xn} are observed

A

H

CE

G

D

B

F

K

J

I

Page 26: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

26

©2005-2007 Carlos Guestrin 51

The BN Representation Theorem

If jointprobability

distribution:Obtain

Then conditionalindependencies

in BN are subset of conditional

independencies in P

Joint probabilitydistribution:Obtain

If conditionalindependencies

in BN are subset ofconditional

independencies in P

Important because: Every P has at least one BN structure G

Important because: Read independencies of P from BN structure G

©2005-2007 Carlos Guestrin 52

“Simpler” BNs

A distribution can be represented by many BNs:

Simpler BN, requires fewer parameters

Page 27: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

27

©2005-2007 Carlos Guestrin 53

Learning Bayes nets

Missing data

Fully observabledata

Unknown structureKnown structure

x(1)

… x(m)

Data

structure parameters

CPTs –P(Xi| PaXi)

©2005-2007 Carlos Guestrin 54

Learning the CPTs

x(1)

… x(m)

DataFor each discrete variable Xi

Page 28: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

28

©2005-2007 Carlos Guestrin 55

Queries in Bayes nets

Given BN, find: Probability of X given some evidence, P(X|e)

Most probable explanation, maxx1,…,xn P(x1,…,xn | e)

Most informative query

Learn more about these next class

©2005-2007 Carlos Guestrin 56

What you need to know Bayesian networks

A compact representation for large probability distributions Not an algorithm

Semantics of a BN Conditional independence assumptions

Representation Variables Graph CPTs

Why BNs are useful Learning CPTs from fully observable data Play with applet!!!

Page 29: VC Dimension - cs.cmu.eduguestrin/Class/15781/slides/learningtheory-bns.pdf · VC Dimension Machine Learning ... Anomaly detection ... An earthquake may be announced on the radio

29

©2005-2007 Carlos Guestrin 57

Acknowledgements

JavaBayes applet http://www.pmr.poli.usp.br/ltd/Software/javabayes/Ho

me/index.html