building classifiers using bayesian networks...• comparison of naive bayes, unsupervised bayesian...

Building Classifiers using Bayesian Networks

Nir Friedman and Moises Goldszmidt 1997

Presented byBrian Collins and Lukas Seitlinger

Paper Summary•The Naive Bayes classifier has reasonable performance compared to more sophisticated methods.•Naive Bayes classifiers can be represented by Bayesian networks.•The paper explores the application of Bayesian networks to classification tasks. This could lead to better performance, but is computationally expensive.•Proposes the Tree Augmented Naive Bayes (TAN) form of restricted Bayesian networks that performs better than Naive Bayes in most cases. An efficient algorithm for learning TAN networks is provided•Extensive empirical results are presented comparing different classification methods on 22 different datasets. TAN appears to have the highest overall performance.

Naive BayespC∣A1An•Classification Task: Determine for data instances

•Assume attributes are conditionally independent given the class label c. Formally:•This strong independence assumption is not true for many data sets.•The conditional distribution of each attribute given the class is modelled. Continuous distributions such as Gaussians can be used, but discrete representations are used in this paper.•Naive Bayes requires minimal storage and computation compared to more sophisticated methods.

pC , A1 , , An= p C ∏i=1N p Ai∣C

{c ,a1, , an }

Bayesian Networks• Provide an efficient framework appropriate for representing

independence assertions.• Are directed acyclic graphs (DAGs) representing the joint

probability distribution of a set of random variables (nodes).• Edges represent direct correlations.

Figure: Naive Bayes classifier as a Bayesian Network

Bayesian Networks for Classification• Allow arbitrary connections between class C and attributes.• Each node stores the conditional distribution of the corresponding

random variable given its parents.• For a fixed network structure, this is trivial to extract for discrete

data when no data is missing – calculate the frequencies in the data.

• To classify a data instance, use Bayes rule to calculate the posterior probability of each class. Choose the class with the highest value.

Figure: general Bayesian networkfor classification

P c∣A1 , , An∝ p A1 , , An∣c p c

Learning Bayesian Networks• Similar to unsupervised learning, since we are trying to learn

the probability distribution of the data, while treating the class value like an attribute.

• Finding the best network structure is hard, the first requirement is a scoring criteria to determine which network is best.

• Log likelihood of the data:• Parameters for a fixed network structure that maximise the log

likelihood are easy to compute. Simply store the conditional probability of each variable given its parents.

LL B∣D=∑i=1

Nlog PBc

i , a1i , , an

i

Leaning Bayesian Networks (contd.)• A fully connected network will always have the highest log

likelihood for the training data, but overfitting tends to occur and the learned parameters will have extremely high variance (when trained on different datasets). This would not be a problem if very large amounts of training data were available

• Finding the best network structure is intractable – there is no know polynomial time algorithm. Exhaustive search seems to be required. The number of possible network structures is exponential in the number of attributes.

• Greedy search over network structures is used in the paper, edges are added, deleted, or reversed in each step. Changes are kept if the scoring criteria improves.

Minimum Description Length• Trade off between log likelihood and network complexity.• Based on information theory – represents the minimum number

of bits needed to transmit the network parameters and the data.• Defined as:• |B| - number of network parameters, N – number of data

instances. The first term represents the theoretical minimum number of bits needed to represent the parameters, and the negative log likelihood represents the minimum number of bits required to encode the data under the model.

• Would indicate the best solution if we had infinite training data.• When training data is limited, MDL does not always indicate the

best network for classification tasks. Particularly when there were more than about 20 attributes.

• MDL might give better results for general tasks of doing inference in networks.

MDL B∣D=12log N ∣B∣−LL B∣D

Other Scoring Functions• Similar scoring functions, such as the Bayesian scoring function

have similar problems finding the best network for the classification task.

• Cross-validation is a computationally expensive alternative, but may provide a better indication of performance.

• Potential solution: modify the scoring function to suit the classification task – conditional log likelihood.

Conditional Log Likelihood• The log likelihood can be decomposed as follows:

• The first term represents how well the network estimates the

probability of the class given the attributes. The second term represents the joint distribution of the attributes.

• Only the first term affects classification performance, define the conditional log likelihood based on the first term:

• Unfortunately, there is no known closed form solution to maximise the CLL for a fixed network structure. Need to use EM or gradient descent methods.

• Could define conditional MDL (CMDL) by replacing LL with CLL in the MDL equation. Evaluating CMDL this requires much more computation than MDL.

LLB∣D=∑i=1

Nlog PB c

i , a1i , , an

i =∑i=1

Nlog PB c

i∣a1i , , an

i ∑i=1

Nlog PB a1

i , , ani

CLL B∣D=∑i=1

NlogPB c

i∣a1i , , an

i

CMDL B∣D=12log N ∣B∣−CLL B∣D

Empirical Results: Naive Bayes v. Bayesian Networks (with best MDL scores)

• Results for 22 different datasets.• Separate test and training sets

for the larger datasets.• 5-fold Cross-validation for the

smaller datasets.

Unrestricted Bayesian Network Summary

• Bayesian networks are a very powerful tool. The best network would perform no worse than the naive Bayes classifier.

• Exhaustively searching for the best network structure is intractable.

• Scoring functions do not always indicate the best network for the classification task.

• Scoring functions specialised for classification are harder to optimise for a fixed network structure.

Restricted Bayesian Networks• Based on the naive

Bayes network structure– Every Attribute has class

as a parent• Allow attributes to be

connected with “correlation” edges– Two attributes need no

longer be conditionally independent given the class

• Learning a restricted network, even when based on the naive Bayes structure is still an intractable problem.– Essentially we are trying to learn a bayesian network over

all the attributes.• So add more restrictions:

– We will construct a directed acyclic spanning tree of the attribute graph

– e.g. any node may have at most one correlation edge pointing to it from another attribute.

– We call this the Tree Augmented Naive Bayes (TAN).– Algorithm for construction of this network (Chow & Liu)

Learning the restricted Network

• Compute the mutual information between each pair of variables.

– Measures the information gained about one attribute when knowing the value of another

– This will be zero for independent attributes– For our purposes (classification) we introduce the

conditional mutual information.

Construction of a maximal log likelihood TAN structure

I X i ; X j=∑xi , x jPDx i , x jlog

PD xi , x jPDx i PDx j

I X i ; X j∣C =∑xi , x j ,cPDx i , x j , clog

PD xi , x j∣cPDx i∣c PDx j∣c

Construction (contd)• Build fully connected undirected graph with a vertex

for each attribute and set the weight between vertices to the mutual information of the two variables.

• Now build the maximum weighted spanning tree of the graph.– MaxST is a connected tree where the sum of the weights of

edges in the tree is greater or equal to the sum of weights of any possible such tree in the given network.

• Convert the undirected tree to a directed tree by choosing a root node and setting the direction of edges to be outward from it.

Time complexity of the construction algorithm

• Overall time complexity is

• Computing mutual information is

while construction of the maximum spanning tree is

– In general N > log (n), hence the above time complexity.

On2⋅N

On2⋅N

On2 log n

Adjusting the parameters• When assigning the parameters to the network

we estimate conditional frequencies of the form:

• For conditional mutual information we partition the data according to the possible values of before computing probabilities.– At least twice as many partitions as in the Naive Bayes,

which partitions on the class variable only.• This reduces the reliability of estimates where few

data instances are available.

PD X∣X

x∣ x

X

Adjusting parameters (contd)• In order to deal with unreliable estimates due to fewer

instances in a partition, introduce a smoothing factor with a bias towards the marginal probability of an attribute X:

where

and s is the smoothing parameter (see Dirichlet prior).– Applying this to the existing TAN algorithm gives us the

smoothed TAN algorithm

sx∣ x =⋅PDx∣ x 1− PD x=

N⋅PD x

N⋅PDxs

Experimental results• Smoothed TAN performs at least as well and in many

cases better than unsmoothed TAN• Comparison of Naive Bayes, Unsupervised Bayesian

Networks, TAN, C4.5 (Decision Tree) and Selective naive Bayesian classifier on 22 datasets– TAN performs competitively with all other classifiers, and

when performing better occasionally it does so with a large margin.

• For evaluation 5-fold cross validation is used with a majority of the data sets.

Comparison of TAN to C4.5 and Naive Bayes

THE END

Questions?

building classifiers using bayesian networks...• comparison of naive bayes, unsupervised bayesian...

Documents