aprendizagem computacional gladys castillo, ua bayesian networks classifiers gladys castillo...
TRANSCRIPT
Aprendizagem Computacional Gladys Castillo, UA
Bayesian Networks Classifiers
Gladys CastilloUniversity of Aveiro
Aprendizagem Computacional Gladys Castillo, UA
Bayesian Networks Classifiers
Part I – Naive BayesPart II - Introduction to Bayesian
Networks
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.3
The Supervised Classification Problem
A classifier is a function f: X C that assigns a class label c C ={c1,…, cm} to objects described by a
set of attributes X ={X1, X2, …, Xn} X
Supervised Learning Algorithm
hC
hC C(N+1)
Inputs: Attribute values of x(N+1) Output: class of x(N+1)
Classification Phase: the class attached to x(N+1) is c(N+1)=hC(x(N+1)) C
Given: a dataset D with N labeled examples of <X, C>Build: a classifier, a hypothesis hC: X C that can correctly predict the class labels of new objects
Learning Phase:
c(N)<x(N), c(N)>
………
c(2)<x(2), c(2)>
c(1)<x(1), c(1)>
CCXXnn……XX11DD
c(N)<x(N), c(N)>
………
c(2)<x(2), c(2)>
c(1)<x(1), c(1)>
CCXXnn……XX11DD)1(
1x )1(nx
)2(1x )2(
nx
)(1
Nx )( Nnx
)1(1
Nx )1(2
Nx )1( Nnx…
attribute X2
att
ribu
te
X1
x
x x
x
x
xx
x
xx
oo o
oo
o
oo
o
o o
o
o
give credit
don´t give
credit
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.4
Statistical Classifiers Treat the attributes X= {X1, X2, …, Xn}
and the class C as random variables A random variable is defined by the
probability density function Give the probability P(cj | x) that x
belongs to a particular class rather than a simple classification
Probability density function of a random variable and few observations
)(xf )(xf
instead of having the map X C , we have X P(C| X)
The class c* attached to an example is the class with bigger P(cj|
x)
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.5
Bayesian Classifiers
Bayesian because the class c* attached to an example x is determined by the Bayes’ TheoremP(X,C)= P(X|
C).P(C)P(X,C)= P(C|
X).P(X)Bayes theorem is the main tool in Bayesian inference
)(
)|()()|(
XP
CXPCPXCP
We can combine the prior distribution and the likelihood of the observed data in order to derive the posterior
distribution.
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.6
Bayes TheoremExample
Given: A doctor knows that meningitis causes stiff neck 50% of
the time Prior probability of any patient having meningitis is
1/50,000 Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck, what’s the probability he/she has meningitis?0002.0
20/150000/15.0
)()()|(
)|( SP
MPMSPSMP
likelihood
prior
prior
posterior prior x likelihood
)(
)|()()|(
XP
CXPCPXCP
Before observing the data, our prior beliefs can be expressed
in a prior probability distribution that represents the knowledge we have about the
unknown features. After observing the data our
revised beliefs are captured by a posterior distribution over
the unknown features.
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.7
)(
)|()()|(
x
xx
P
cPcPcP jj
j
Bayesian Classifier
Bayes Theorem
Maximum a posteriori
classification
How to determine P(cj | x) for each class cj ?
P(x) can be ignored because is just a normalizing constant that ensures the
posterior adds up to 1; it can be computed by summing up the numerator
over all possible values of C, i.e.,P(x) = j P(c j ) P(x |cj)
)|()(max)(1
*jj
...mjBayes cPcParghc xx
Classify x to the class which has bigger posterior
probability
Bayesian Classificatio
n Rule
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.8
“Naïve” because of its very naïve independence assumption:
Naïve Bayes (NB) Classifier
all the attributes are conditionally independent given the class
Duda and Hart (1973); Langley (1992)
P(x | cj) can be decomposed into a product
of n terms, one term for each attribute
“Bayes” because the class c* attached to an example x is
determined by the Bayes’ Theorem
)|()(max)(1
*jj
...mjBayes cPcParghc xx
when the attribute space is high dimensional direct estimation is hard unless
we introduce some assumptions
n
ijiij
...mjNB cxXPcParghc
11
* )|()(max)(xNB
Classification Rule
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.9
Xi continuous The attribute is discretized and then treats as a discrete
attribute A Normal distribution is usually assumed
2. Estimate P(Xi=xk |cj) for each value xk of the attribute Xi and for each class cj
Xi discrete
1. Estimate P(cj) for each class cj
Naïve Bayes (NB)Learning Phase (Statistical Parameter Estimation)
),;()|( ijijkjki xgcxXP
Given a training dataset D of N labeled examples (assuming complete data)
N
NcP j
j )(ˆ
j
ijkjki N
NcxXP )|(ˆ
2
2
)(2
2
1),;( e
x
xg
ond
e
Nj - the number of examples of the class cj
Nijk - number of examples of the class cj
having the value xk for the attribute Xi
Two options
The mean ij e the standard deviation ij are estimated from D
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.10
2. Estimate P(Xi=xk |cj) for a value of the attribute Xi and for each class cj
For real attributes a Normal distribution is usually assumed
Xi | cj ~ N(μij, σ2ij) - the mean ij e the standard deviation ij are estimated from D
Continuous Attributes Normal or Gaussian Distribution
For a variable X ~ N(74, 36), the probability of observing the value 66 is given by: f(x) = g(66; 74, 6) = 0.0273
The d
ensity
pro
bability
fu
nctio
n
- 2 323
N(0,2)
- 2 323
N(0,2)
f(x) is symmetrical around its mean
),;()|( ijijkjki xgcxXP 2
2
)(2
2
1),;( e
x
xg
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.11
1. Estimate P(cj) for each class cj
2. Estimate P(Xi=xk |cj) for each value of X1 and each class cj
X1 discrete X2 continuos
Naïve Bayes Probability Estimates
)101 1.73 ()|( 2 .,x;gCxXP
Train
ing d
ata
set D
5
3)(ˆ CP
3
2)|(ˆ CaXP i
Class X1 X2
+ a 1.0
+ b 1.2
+ a 3.0
- b 4.4
- b 4.5
Two classes: + (positive) e - (negative)
Two attributes: X1 – discrete which takes values a e b X2 – continuos
Binary Classification Problem
3
1)|(ˆ CbXP i
5
2)(ˆ CP
)070 4.45 ()|( 2 .,x;gCxXP 2
0)|(ˆ CaXP i 2
2)|(ˆ CbXP i
2+= 1.73, 2+ = 1.10
2-= 4.45, 22+ = 0.07
Example from John & Langley (1995)
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.12
NB is one of more simple and effective classifiers
NB has a very strong unrealistic independence assumption:
Naïve Bayes Performance
all the attributes are conditionally independent given the value of class
Bias-variance decomposition of Test Error Naive Bayes for Nursery Dataset
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
500 2000 3500 5000 6500 8000 9500 11000 12500
# Examples
Variance
Bias in practice: independence assumption
is violated HIGH BIAS
it can lead to poor classification
However, NB is efficient due to its high
variance management
less parameters LOW VARIANCE
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.13
reducing the bias resulting from the modeling error by relaxing the attribute independence assumption
one natural extension: Bayesian Network Classifiers
Improving Naïve Bayes reducing the bias of the parameter estimates
by improving the probability estimates computed from data
Rele
van
t w
ork
s:
Web and Pazzani (1998) - “Adjusted probability naive Bayesian induction” in LNCS v 1502
J. Gama (2001, 2003) - “Iterative Bayes”, in Theoretical Computer Science, v. 292
Friedman, Geiger and Goldszmidt (1998) “Bayesian Network Classifiers” in Machine Learning, 29 Pazzani (1995) - “Searching for attribute dependencies in Bayesian Network Classifiers” in Proc.
of the 5th Workshop of Artificial Intelligence and Statistics Keogh and Pazzani (1999) - “Learning augmented Bayesian classifiers…”, in Theoretical
Computer Science, v. 292Rele
van
t w
ork
s:
Aprendizagem Computacional Gladys Castillo, UA
Bayesian Networks Classifiers
Part II - Introduction to Bayesian Networks
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.15
Reasoning under UncertaintyProbabilistic Approach
A problem domain is modeled by a set of random variables X1, X2 . . . , Xn
Knowledge about the problem domain is represented by a joint probability distribution P(X1, X2, . . . , Xn)Example: ALARM network (Pearl 1988)
Story: In LA, burglary and earthquake are not uncommon. They both can cause alarm. In case of alarm, two neighbors John and Mary may call
Problem: Estimate the probability of a burglary based who has or has not called
Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M)
Knowledge required by the probabilistic approach in order to solve this problem: P(B, E, A, J, M)
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.16
To specify P(X1, X2, . . . , Xn) we need at least 2n − 1 numbers probability
exponential model size. Knowledge acquisition difficult (complex, unnatural) Exponential storage and inference
Inference withJoint Probability Distribution
by exploiting conditional independence we can obtain
an appropriate factorization of the joint probability
distribution
the number of parameters could be substantially
reduced
A good factorization is provided with Bayesian
Networks
Solution:
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.17
Bayesian Networks
i
iin XpaXPXXP ))(|(),...,( 1
the structure S - a Directed Acyclic Graph (DAG)
nodes – random variables arcs - direct dependence
between variables
Bayesian Networks (BNs) graphically represent the joint probability distribution of a set X of random variables in a problem domain
A BN=(S, S) consist in two
components:
the set of parameters S
= set of conditional probability distributions (CPDs) discrete variables:
S - multinomial (CPDs are CPTs)
continuos variables: S - Normal ou Gaussiana
Qualitative part Quantitative part
each node is independent of its non descendants given its parents in S
Wet grass BN from Hugin
Pearl (1988, 2000); Jensen (1996); Lauritzen (1996)
Probability theory
Graph theory
Markov condition :
The joint distribution is factorized
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.18
the structure – a Directed Acyclic Graph
the parameters – a set of conditional probability tables (CPTs) pa(B) = { }, pa(E) = { }, pa(A) = {B, E}, pa(J) = {A}, pa(M)
= {A}
X = { B - Burglary, E- Earthquake, A - Alarm, J - JohnCalls, M – MaryCalls }The Bayesian Network has two components:
Example: Alarm Network (Pearl)Bayesian Network
P(B) P(E) P(A | B, E) P(J | A) P( M | A}
Model size reduced from 31 to
1+1+4+2+2=10
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.19
P(B, E, A, J, M) = P(B) P(E) P(A|B, E) P(J|A) P(M|A)
X = {B (Burglary), (E) Earthquake, (A) Alarm, (J) JohnCalls, (M) MaryCalls}
Example: Alarm Network (Pearl)Factored Joint Probability Distribution
Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.20
Bayesian Networks: Summary
Efficient: Local models Independence (d-separation)
Effective: Algorithms take advantage of structure to
Compute posterior probabilities Compute most probable instantiation Decision making
But there is more: statistical induction LEARNING
adapted from © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved .
Bayesian Networks: an efficient and effective representation of the joint probability distribution of a set of
random variables