multi-dimensional bayesian network...

Intro BR, CC and LBP BN MBC Appl Concl Ref

MULTI-DIMENSIONALBAYESIAN NETWORK CLASSIFIERS

Pedro Larrañaga

Computational Intelligence GroupArtificial Intelligence Department

Technical University of Madrid

EAIA 2018 - DS4BD. Advanced School on Data Science for Big DataPorto, July 4, 2018

Pedro Larrañaga Multi-dimensional Bayesian network classifiers


Outline

1 Multi-label classification and multi-dimensional classification

2 Binary relevance, classifier chain and label power set

3 Bayesian networks

4 Multi-dimensional Bayesian network classifiers

5 Applications

6 Conclusions

7 References



Simultaneous object recognition in images (multi-label)



Medical diagnosis (multi-label)



Multiple fault diagnosis (multi-label)



Single label classification versus multi-label classification

X1 X2 X3 X4 X5 C3.2 1.4 4.7 7.5 3.7 12.8 6.3 1.6 4.7 2.7 07.7 6.2 4.1 3.3 7.7 19.2 0.4 2.8 0.5 3.9 05.5 5.3 4.9 0.6 6.6 1

Single label classification

X1 X2 X3 X4 X5 C1 C2 C3 C43.2 1.4 4.7 7.5 3.7 1 0 1 12.8 6.3 1.6 4.7 2.7 0 0 1 07.7 6.2 4.1 3.3 7.7 1 0 1 19.2 0.4 2.8 0.5 3.9 0 1 0 05.5 5.3 4.9 0.6 6.6 1 1 0 1

Multi-label classification



Multi-label classification vs multi-dimensional classification

X1 X2 X3 X4 X5 C1 C2 C3 C43.2 1.4 4.7 7.5 3.7 1 0 1 12.8 6.3 1.6 4.7 2.7 0 0 1 07.7 6.2 4.1 3.3 7.7 1 0 1 19.2 0.4 2.8 0.5 3.9 0 1 0 05.5 5.3 4.9 0.6 6.6 1 1 0 1

Multi-label classification

X1 X2 X3 X4 X5 C1 C2 C3 C43.2 1.4 4.7 7.5 3.7 1 0 2 42.8 6.3 1.6 4.7 2.7 0 0 1 07.7 6.2 4.1 3.3 7.7 3 0 2 19.2 0.4 2.8 0.5 3.9 2 1 0 25.5 5.3 4.9 0.6 6.6 3 1 0 3

Multi-dimensional classification



Neuronal cell-type classification (multi-dimensional)

X1, ...,Xn are morphological variables

C1, ...,CdAnimal specie: rat, mouse, monkey, cat, human, ...Neuronal cell-type: pyramidal, interneuron, Purkinje, Martinotti,...Brain region: amygdala, cerebral cortex, hippocampus, ...Age of the animal: young, adult



Dow Jones companies stock market values prediction

(multi-dimensional + time )

X1, ...,Xn are the stock market values of the companies during the last week

C1, ...,CdCompany 1: [-5%,-2%), [-2%,-1%), [-1%,0%) [0%,1%) [1%,2%) [2%,5%)Company 2: [-5%,-2%), [-2%,-1%), [-1%,0%) [0%,1%) [1%,2%) [2%,5%)Company 3: [-5%,-2%), [-2%,-1%), [-1%,0%) [0%,1%) [1%,2%) [2%,5%)Company 4: [-5%,-2%), [-2%,-1%), [-1%,0%) [0%,1%) [1%,2%) [2%,5%)



Weather forescast (multi-dimensional + time + spatial)

X1, ...,Xn are variables extracted from meteorological stations: thermometer,barometer, hygrometer, anenometer, wind vane, rain gauge, disdrometer,transmissometer, ceilling projector, ....

C1, ...,CdTemperature: [-100, 00), [00, 100), [100, 250), [250, 400)Relative humidity: [0, 30), [30, 60), [60, 80), [80, 100)Rain: light rain (precipitation rate is < 2.5 mm/h), moderate rain (2.5 mm/h- 10 mm/h), heavy rain (10 mm/h - 50 mm/h), violent rain (> 50 mm/h)Wind: calm (< 1km/h), light air (1 km/h - 5 km/h), ......, storm (103 km/h -117 km/h), hurricane (>117 km/h)



Learning multi-label classification models from data

A multi-label (multi-dimensional) data setD = {(x(1),c(1)), ..., (x(N),c(N))} wherex(i) ∈ ΩX =

∏nj=1 ΩXj and c

(i) ∈ ΩC =∏d

k=1 ΩCkThe learning task for a multi-label (multi-dimensional)classification paradigm is to output a function φ

ΩC ≡ ΩX1 × · · · × ΩXnφ−→ ΩC ≡ ΩC1 × · · · × ΩCd

x ≡ (x1, ..., xn) → c ≡ (c1, ..., cd )

Reviews on this topic: Tsoumakas and Katakis (2007),Zhang and Zhou (2014) and Gibaja and Ventura (2015)



Overview of learning methods for multi-label classification

Problem transformation methodsThey transform the learning task into one or more single-labelclassification tasksThey are algorithm independent(a) Transform to binary classification: binary relevance (Godbole andSarawagi, 2004), classifiers chain (Read et al., 2011)(b) Transform to multiclass classification: label powerset (Boutell et al.,2004), RAKEL (Tsoumakas et al., 2010)(c) Identifying label dependencies: correlation based pruning (Tsoumakaset al., 2009), LBPR algorithm (Tenenboim et al., 2010)

Algorithm adaptation methodsThey extend specific learning algorithms in order to handle multi-label datadirectlyExamples: classification trees (Clare and King, 2001), neural networks(Zhang and Zhou, 2006), k -nearest neighbors (Zhang and Hou, 2007),support vector machines (Elisseeff and Weston, 2001), random forest(Madjarov et al., 2012), Bayesian networks (van der Gaag and de Waal,2006).



Mean accuracy and exact match for multi-label classification

xi C1 C2 C3 C4 C5 Ĉ1 Ĉ2 Ĉ3 Ĉ4 Ĉ5x1 1 0 1 0 0 1 0 0 1 0x2 0 1 0 1 0 0 1 0 1 0x3 1 0 0 1 0 1 0 0 1 0x4 0 1 1 0 0 0 1 0 0 0x5 1 0 0 0 0 1 0 0 1 0

Mean accuracy computes the mean of the accuracies in each of the d classvariables

Mean accuracy(φ) =1d

d∑j=1

1N

N∑i=1

I(ĉ ij = cij )

where I(true) = 1 and I(false) = 0In the table: Mean accuracy(φ) = 15 (1 + 1 + 0.6 + 0.6 + 1) = 0.804Exact match computes the fraction of correctly classified instances. An instancewill be correctly classified if the binary vector containing the values of eachbinary class variable coincides with the binary vector containing their predictions

Exact match(φ) =1N

N∑i=1

I(ĉi = ci )

In the table: Exact match(φ) = 15 (0 + 1 + 1 + 0 + 0) = 0.4



Outline



3 Bayesian networks


5 Applications

6 Conclusions

7 References



Binary relevance (Godbole and Sarawagi, 2004)

X C1 C2 C3 C4x(1) 1 0 0 1x(2) 0 0 1 1x(3) 1 0 0 1x(4) 0 1 0 0x(5) 1 1 1 0

X C1 X C2 X C3 X C4x(1) 1 x(1) 0 x(1) 0 x(1) 1x(2) 0 x(2) 0 x(2) 1 x(2) 1x(3) 1 x(3) 0 x(3) 0 x(3) 1x(4) 0 x(4) 1 x(4) 0 x(4) 0x(5) 1 x(5) 1 x(5) 1 x(5) 0

Learns one binary classifier for each label independently of the rest of labels

Outputs the concatenation of their predictions

Does not consider label relationships

Any supervised classification method can be used, i.e., Bayesian network basedclassifiers



Binary relevance with Bayesian network classifiers

c∗ = arg maxc

p(c|x) = arg maxc

p(x, c)

Naive Bayes

Selective naive Bayes

Semi-naive Bayes

ODE

TAN

SPODE

k-DB

BAN

Markov blanket-based

Unrestricted

Bayesian multinet

p(c)p(x1, . . . , xn | c)p(c | pa(c))

n∏i=1

p(xi | pa(xi))

p(c)n∏

i=1

p(xi | pac(xi))

p(c | x1, . . . , xn) ∝ p(c, x1, . . . , xn)

Taxonomy of discrete Bayesian network classifiers according to the factorization of p(x, c)(Bielza and Larrañaga (2014))

Decision boundary for discrete Bayesian network classifiers (Varando et al.,2015)




Naive Bayes (Minsky, 1961)

p(c|x) ∝ p(c)n∏

i=1

p(xi |c)

A naive Bayes structure from which p(c|x) ∝ p(c)p(x1|c)p(x2|c)p(x3|c)p(x4|c)p(x5|c)




Selective naive Bayes (Langley and Sage, 1994)

p(c|x) ∝ p(c|xF ) = p(c)∏i∈F

p(xi |c)

XF denotes the projection of X onto the selected feature subset F ⊆ {1, 2, ..., n}

A selective naive Bayes structure from which p(c|x) ∝ p(c)p(x1|c)p(x2|c)p(x4|c). The variables in the shadednodes have not been selected

Feature subset selection: filter (univariate and multivariate) based on informationtheory measures as mutual information, and wrapperFor multivariate filtering and wrapper approaches a heuristic search (best first,floating search, simulated annealing, tabu search, genetic algorithms, estimationof distribution algorithms) in the huge space of cardinality 2n must be carried outSee Saeys et al. (2007) for a review on feature subset selection




One-dependence estimators (ODEs). TAN (Friedman et al. 1997)

One-dependence estimators are more general naive Bayes where each predictorvariable is allowed to depend on at most one other predictor in addition to the class.Tree-augmented naive Bayes (TAN)

p(c|x) ∝ p(c)p(xr |c)n∏

i=1,i 6=rp(xi |c, xj(i))

where Xr denotes the root node and {Xj(i)} = Pa(Xi ) \ C, for any i 6= r

(a) (b)

(a) A TAN structure, whose root node is X3, from whichp(c|x) ∝ p(c)p(x1|c, x2)p(x2|c, x3)p(x3|c)p(x4|c, x3)p(x5|c, x4);

(b) Selective TAN (Blanco et al. 2005), from which p(c|x) ∝ p(c)p(x2|c, x3)p(x3|c)p(x4|c, x3)





Algorithm 1: Learning a TAN structureInput : A data setD = {(x1, c1), ..., (xN , cN )} with X = (X1, ..., Xn)Output: A TAN structure

1 for i < j, i, j = 1, ..., n do

Compute MI(Xi , Xj |C)=∑

i,j,r p(xi , xj , cr ) logp(xi ,xj |cr )

p(xi |cr )p(xj |cr )

end2 Build a complete undirected graph where the nodes are X1, ..., Xn . Annotate the weight of an edge connecting Xi

and Xj by MI(Xi , Xj |C)3 Build a maximum weighted spanning tree:

3a Select the two edges with the heaviest weight3b while The tree contains less that n − 1 edges do

if They do not form a cycle with the previous edges thenSelect the next heaviest edge

elseReject the edge and continue

endend

4 Transform the resulting undirected into a directed tree by choosing a root node and setting the direction of all edgesto be outward from this node

5 Construct a TAN structure by adding a node C and an arc from C to each Xi





MI(X1,X3|C) > MI(X2,X4|C) > MI(X1,X2|C) > MI(X3,X4|C) > MI(X1,X4|C) >MI(X3,X5|C) > MI(X1,X5|C) > MI(X2,X3|C) > MI(X2,X5|C) > MI(X4,X5|C)

(a) (b) (c)

(d) (e) (f)

(g) (h)

An example of TAN structure construction. (a-c) Edges are added according to conditional mutual informationquantities arranged in ascending order. (d-e) Edges X3 − X4 and X1 − X4 (dashed lines) cannot be added since

they form a cycle. (f) Maximum weighted spanning tree. (g) The respective directed tree obtained by choosing X1 asthe root node. (h) Final TAN structure.




Markov blanket-based Bayesian classifier (Koller and Sahami, 1996)

If C has parents:

p(c|x) ∝ p(c|pa(c))n∏

i=1

p(xi |pa(xi ))

The Markov blanket of C is the only knowledge needed to predict its behavior

Bayesian classifiers based on identifying the Markov blanket of the class variable

A Markov blanket structure for C, MBC = {X1, X2, X3, X4}, from whichp(c|x) ∝ p(c|x2)p(x1|c)p(x2)p(x3)p(x4|c, x3)



Discrete Bayesian network classifiers

c∗ = arg maxc

p(c|x) = arg maxc

p(x, c)

Taxonomy of discrete Bayesian network classifiers according to the factorization of p(x, c)(Bielza and Larrañaga (2014))



Classifier chain with Bayesian network classifiers

Classifier chain (Read et al. (2011)) overcomes the label independenceassumption by transforming the multi-label learning problem into a chain ofbinary classification problems

Each binary classifier in the chain is built upon the predictions (probabilisticpredictions) of preceding classifiers

A classifier chain learns d functions φi on augmented input spaces, takingĉ1, ..., ĉi−1 as additional features:

ΩX × {0, 1}i−1 φi−→ [0, 1]

(x , ĉ1, ..., ĉi−1) → p(ci |x , ĉ1, ..., ĉi−1)

φi can be learnt using c1, ..., ci−1 instead of ĉ1, ..., ĉi−1. However, when appliedto new instances, the classifier chain method necessarily uses predictionsĉ1, ..., ĉi−1A total order among the class variables should be defined beforehand (the resultis dependent of that order)



Classifier chain with Bayesian network classifiers

Tree naive Bayesian classifier chain (TNBCC)(Sucar et al., 2014)

Only one parent per class in a chain: A TAN structure for class variables

The root of the TAN can be selected in d different ways

TNBCC uses naive Bayes as baseline classifiers

Structure of a TNBCC

Several heuristics for selecting the root node: random, the node with morenumber of incident edges, the node with the highest mutual information, ...

One unique TNBCC versus an ensemble of TNBCC (best results)



Label powerset. Boutell et al. (2004)

X C1 C2 C3 C4 Cx(1) 1 0 0 1 10x(2) 0 0 1 1 4x(3) 1 0 0 1 10x(4) 0 1 0 0 5x(5) 1 1 1 0 15

Each different set of labels becomes a different class in a new single-labelclassification task

Most implementations of label powerset classifiers essentially ignore labelcombination that are not presented in the training set (cannot predict unseenlabelsets)

Limited training examples for many classes



Label powerset. Multiple diagnosis problem with probabilistic

models. Peng and Reggia (1987a; 1987b)

X1 . . . Xm C1 . . . Cd(x(1), c(1)) x (1)1 . . . x

(1)m c

(1)1 . . . c

(1)d

(x(2), c(2)) x (2)1 . . . x(2)m c

(2)1 . . . c

(2)d

. . . . . . . . .

(x(N), c(N)) x (N)1 . . . x(N)m c

(N)1 . . . c

(N)d

Optimal diagnosis as abductive inference: searching for the most probable explanation(MPE)

(c∗1 , . . . , c∗d )

= arg max(c1,...,cd )

p(C1 = c1, . . . ,Cd = cd |X1 = x1, . . . ,Xm = xm)


p(C1 = c1, . . . ,Cd = cd )p(X1 = x1, . . . ,Xm = xm|C1 = c1, . . . ,Cd = cd )

Number of parameters to be estimated: 2d − 1 + 2d (2m − 1)



Outline



3 Bayesian networks


5 Applications

6 Conclusions

7 References



Bayesian networks (Pearl (1988) and Koller and Friedman (2009))

A Bayesian network is a compact representation of the joint probabilitydistribution (JPD) p(X1, ...,Xn)

Bayesian networks solve this problem by using the concept of conditionalindependence between triplets of variables

Two random variables X and Y are conditionally independent givenanother random variable Z if p(x |y , z) = p(x |z) ∀x , y , zAn equivalent definition is p(x , y |z) = p(x |z)p(y |z) ∀x , y , z. LetIp(X ,Y |Z ) denote this condition

Suppose that for each Xi , a subset Pa(Xi ) ⊆ {X1, ...,Xi−1} such that givenPa(Xi ), Xi is conditionally independent of all variables in {X1, ...,Xi−1} \ Pa(Xi ),i.e., p(Xi |X1, ...,Xi−1) = p(Xi |Pa(Xi )). Then the JPD factorizes as

p(X1, ...,Xn) = p(X1)p(X2|X1)p(X3|X1,X2) · · · p(Xn|X1, ...,Xn−1)= p(X1|Pa(X1)) · · · p(Xn|Pa(Xn))



Bayesian networks. Structure and parameters. Example: Factory production

Years Y is the factory’s age, where y denotes ‘more than 10 years’ and ¬y denotes ‘less than 10 years’Employees E represents the number of employees: more (e) or less (¬e) than 100 employeesMachines M has two values: m and ¬m for ‘more than 20 machines’ and ‘less than 20 machines’Pieces P also has two options: more (p) or less (¬p) than 10,000 produced pieces per yearFailures F includes f that stands for ‘more than two failures on average per month’ and otherwise thestate is ¬f

p(Y)

0.750.25

Y

p(M|Y)

0.100.90

Y M

0.850.15

p(F|P)

0.750.25

M F

0.050.95

p(E|Y)

0.700.30

Y E

0.200.80

p(P|E,M)

0.960.04

M P

0.400.60

E

0.450.550.100.90

¬y

yy

¬y

e

e¬e

¬e

y¬y

¬y

yy

¬y

m

m¬m

¬m

¬m

mm

¬m

f

f¬ f

¬ f

e

e

¬e

¬e

ee

¬e¬e

m

¬mm

¬mm

¬mm

¬m

p

p¬p

¬pp

p¬p

¬p

E M

P

YYears

Employees Machines

Pieces Failures

F

Hypothetical Bayesian network modeling factory production. To fully specify the JPD: 25 − 1 = 31 parameters. TheBayesian network representing requires 17 input conditional probabilities

The Bayesian networks factorizes the JPD as: p(Y , E,M, P, F ) = p(Y )p(E|Y )p(M|Y )p(P|E,M)p(F |M)



Bayesian networks. Markov condition

Markov condition or local directed Markov property

The descendant nodes of Xi are the nodes reachable from Xi by repeatedly following the arcs

Let ND(Xi ) denote the non-descendant nodes of XiFollowing the arcs in the opposite direction, we find the ancestors

In a Bayesian network each node is conditionally independent of its non-descendants, given its parents:Ip(Xi ,ND(Xi )|Pa(Xi ))

p(Y)

0.750.25

Y

p(M|Y)

0.100.90

Y M

0.850.15

p(F|P)

0.750.25

M F

0.050.95

p(E|Y)

0.700.30

Y E

0.200.80

p(P|E,M)

0.960.04

M P

0.400.60

E

0.450.550.100.90

¬y

yy

¬y

e

e¬e

¬e

y¬y

¬y

yy

¬y

m

m¬m

¬m

¬m

mm

¬m

f

f¬ f

¬ f

e

e

¬e

¬e

ee

¬e¬e

m

¬mm

¬mm

¬mm

¬m

p

p¬p

¬pp

p¬p

¬p

E M

P

YYears

Employees Machines

Pieces Failures

F

All nodes are descendants of Y . The descendants of M are P and F . The Markov condition for M statesthat E and M are c.i. given Y

All nodes are non-descendants of P and hence P and {Y , F} are c.i. given {E,M}



Bayesian networks. u-separation

u-separation (Lauritzen et al., 1990): A graphical criterion for finding additional (apartfrom Markov condition) conditional independences

If X is u-separated from Y given Z, then X and Y are c.i. given Z, for any X, Y, Z disjoint random vectors

Checking whether X and Y are u-separated by Z is a three-step procedure:

1 Get the smallest subgraph containing X, Y, and Z and their ancestors. This is called the ancestralgraph

2 Moralize the ancestral graph, i.e., add an undirected link between parents having a common childand then drop directions on all arcs.

3 Z u-separates X and Y whenever Z is in all paths between X and Y

E M

P

Y

Years

Employees Machines

Pieces Failures

F

Moralized ancestral graph of the factory production Bayesian network

Let us check whether P and F are u-separated by {E,M}. Since E or M is always found in every path (inthe moralized ancestral graph) from P to F , P and F are u-separated by {E,M}. Hence P and F are c.i.given {E,M}However, Y and F are not u-separated by P because we can go from F to Y through M, without crossing P



Bayesian networks. Types of inference

Probabilistic reasoning: p(Xi |e) where E = e is the observed evidenceAbductive inference finds the values of a set of variables that best explain theobserved evidence

Total abduction: arg maxU p(U|e), i.e., we find the most probable explanation (MPE),Partial abduction solves the same problem for a subset of variables in u (the explanation set),referred to as the partial maximum a posteriori (MAP)

Predictive reasoning: we predict the effect (produced pieces) given a cause(machines), p(p|m) = 0.62 (in the figure) and p(p|¬m) = 0.30 (not shown)

(a) (b)Inference on the factory production example. (a) Prior distributions p(Xi ). (b) p(Xi |m)

Diagnostic reasoning: we diagnose the causes given the effects. Given the effectF = f , the probability of the cause being many machines is p(m|f ) = 0.86Intercausal reasoning: if we know that the factory has many employees (E = e),this would explain the observed high number of pieces produced p and wouldlower the probability of Machines = m being the cause:p(m|p, e) = 0.32 < p(m|p) = 0.45



Bayesian networks. Inference methods

Exact inferenceBrute-force approach

p(P) =∑

Y ,E,M,F

p(Y ,E ,M,F ,P)

=∑

Y ,E,M,F

p(Y )p(E |Y )p(M|Y )p(P|E ,M)p(F |M)

Variable elimination algorithm (Zhang and Poole, 1994)

p(P) =∑

Y

p(Y )∑

E

p(E |Y )∑

M

p(M|Y )p(P|E ,M)∑

F

p(F |M)

Junction tree algorithm (Lauritzen and Spiegelhalter, 1988) based onmessage passing algorithm on a junction tree

Approximate inferenceProbabilistic logic sampling (Henrion, 1988)Likelihood weighting (Shachter and Peot, 1989)Gibbs sampling (Pearl, 1987)



Bayesian networks. Learning parameters

For parameter learning, we need to have the structure

Ri = |ΩXi |

qi = |ΩPa(Xi )|, i.e., ΩPa(Xi ) = {pa1i , ...,pa

qii }

The parameter θijk = p(Xi = k |Pa(Xi ) = paji ): the conditional probability that Xi

takes its k -th value given that its parents take their j-th value

θijk parameters are organized into θ = (θ1, ...,θn) with a total of∑n

i=1 Ri qicomponents, and are estimated from D = {x1, ..., xN}

Nijk be the number of cases in D where Xi = k and Pa(Xi ) = paji have been

observed at the same time

Let Nij be the number of cases in D in which Pa(Xi ) = paji has been observed

(Nij =∑Ri

k=1 Nijk )



Bayesian networks. Learning parameters

The maximum likelihood estimation finds θ̂ML such that it maximizes thelikelihood of the dataset given the model:

θ̂ML = arg maxθL(θ|D,G) = arg max

θp(D|G,θ) = arg max

θ

N∏h=1

p(xh|G,θ)

Under the global parameter independence and local parameter independenceasumptions (Spiegelhalter and Lauritzen, 1990):

L(θ|D,G) =n∏

i=1

qi∏j=1

Ri∏k=1

θNijkijk

and θ̂MLijk =NijkNij

With sparse datasets Laplace estimator is used: θ̂Lapijk =Nijk +1Nij +Ri

If D has incomplete instances (with missing values), the estimations can becalculated with the EM algorithm (Lauritzen 1995) or even with the structural EM(Friedman 1998) where not only parameters but also structures can be updatedat each EM iteration



Bayesian networks. Learning parametersX1

X2

X3

X4

X1X2X3X4

1 2 21

1 1 21

2 1 2 1

2 1 2 2

1 2 21

1 3 11

(a) (b)

(a) A Bayesian network structure with |ΩXi | = 2, i = 1, 3, 4, |ΩX2 | = 3 and q1 = q2 = 0, q3 = 6, q4 = 2. (b) Adataset with N = 6 for {X1, ..., X4} from which the structure in (a) has been learned

Parameters Meaningθ1 = (θ1−1, θ1−2) (p(X1 = 1), p(X1 = 2))θ2 = (θ2−1, θ2−2, θ2−3) (p(X2 = 1), p(X2 = 2), p(X2 = 3))θ3 = (θ311, θ312, . . . , θ361, θ362) (p(X3 = 1|X1 = 1, X2 = 1),

p(X3 = 2|X1 = 1, X2 = 1), . . .p(X3 = 1|X1 = 2, X2 = 3),p(X3 = 2|X1 = 2, X2 = 3))

θ4 = (θ411, θ412, θ421, θ422) (p(X4 = 1|X3 = 1), p(X4 = 2|X3 = 1),p(X4 = 1|X3 = 2), p(X4 = 2|X3 = 2))

To estimate θ1−1 = p(X1 = 1), we find four out of six instances in the X1 column, and hence, θ̂ML1−1 = 2/3

To estimate θ322 = p(X3 = 2|X1 = 1, X2 = 2), we find that neither of the two instances withX1 = 1, X2 = 2, include either X3 = 2 or θ̂

ML322 = 0 (this is a case where Nijk = 0)

To estimate θ361 = p(X3 = 1|X1 = 2, X2 = 3), we find that θ̂ML361 is undefined since there are no instanceswith X1 = 2, X2 = 3 (i.e., Nij = 0). However, the Laplace estimates yield θ̂

Lap322 = 1/4, θ̂

Lap361 = 1/2



Bayesian networks. Learning structures. Constraint-based

methods

Constraint-based methods use the data to statistically test conditionalindependences among triplets of variables

The goal is to build a DAG that represents a large percentage (and wheneverpossible all) of the identified conditional independence constraints

The PC algorithm (Spirtes and Glymour, 1991) starts with all nodes connectedby edges and follows three steps:

1 Step 1 outputs the adjacencies in the graph, i.e., the skeleton of thelearned structure

2 Step 2 identifies colliders (a collider or converging connection at node X isY → X ← Z )

3 Step 3 orients the edges and outputs the complete partially DAG(CPDAG), the Markov equivalence class of DAGs



Bayesian networks. Learning structures. Constraint-based

methods

Algorithm 2: Step 1 of the PC algorithm: estimation of the skeletonInput : A complete undirected graph and an ordering σ on the variables {X1, ...,Xn}Output: Skeleton G of the learned structure

1 Form the complete undirected graph G on nodes {X1, ...,Xn}2 t = −13 repeat

t = t + 1repeat

4 Select a pair of adjacent nodes Xi − Xj in G using ordering σ5 Find S ⊂ Adj i \ {Xj} in G with |S| = t (if any) using ordering σ6 Remove edge Xi − Xj from G iff Xi and Xj are c.i. given S

until all ordered pairs of adjacent nodes have been testeduntil all adjacent pairs Xi − Xj in G satisfy |Adj i \ {Xj}| ≤ t



Bayesian networks. Score and search methods

AIC, BICBD, K2,

BDe, BDeu

Greedy, simulated annealing, EDAs, genetic algorithms, MCMC

Dynamic programming, branch & bound, mathematical programming

Score and search

Search spaces Scores Search

DAGsEquiv.classes

OrderingsPenalized likelihood

Bayesian Exact Approximate

Methods for Bayesian network structure learning based on score and search



Bayesian networks. Score and search methods

Finding a network structure that optimizes a score is an NP-hard (Chickering,1996)

Different search spaces:Space of DAGs whose cardinality was given by Robinson (1977):

f (n) =n∑

i=1

(−1)i+1(n

i

)2i(n−i)f (n − i), for n > 2

with f (0) = f (1) = 1Space of Markov equivalence classes is smaller than the space of DAGs.The #DAGs/#CPDAGs ratio approaches an asymptote of about 3.7(Gillispie and Perlman, 2002)Space of orderings of the variables: some learning algorithms (e.g., the K2algorithm) only work with a fixed order. Cardinality: n!



Bayesian networks. Score and search methods. ScoresThe need of penalized likelihood scores

The estimated log-likelihood of the data given the BN:

logL(θ̂|D,G) = log p(D|G, θ̂) = logn∏

i=1

qi∏j=1

Ri∏k=1

θ̂Nijkijk =

n∑i=1

qi∑j=1

Ri∑k=1

Nijk log θ̂ijk

where θ̂ijk = θ̂MLijk =NijkNij

, the frequency counts in DThis score increases monotonically with the complexity of the modelThe optimal structure would be the complete graph

X1

Lik

elih

oo

d

DAG complexity

Training data

Test data

X2

X3

X4

X1

X2

X3

X4

X1

X2

X3

X4

Structural overfitting: the likelihood of the training data is higher for denser graphs, but it degrades for the test data



Bayesian networks. Score and search methods. Scores

Penalized likelihood scores

General expression of a family of penalized log-likelihood scores:

QPen(D,G) =n∑

i=1

qi∑j=1

Ri∑k=1

Nijk logNijkNij− dim(G)pen(N)

where dim(G) =∑n

i=1(Ri − 1)qi denotes the model dimension (number ofparameters needed in the Bayesian network), and pen(N) is a non-negativepenalization function

Depending on pen(N):If pen(N) = 1, the score is called Akaike’s information criterion (Akaike,1974)If pen(N) = 12 log N, the score is the Bayesian information criterion (BIC)(Schwartz, 1978)



Bayesian networks. Score and search methods. Scores

Bayesian scores

The Bayesian approach to structure learning aims to find G that maximizes itsposteriori probability given the data, i.e., find arg maxG p(G|D)Using Bayes’ formula: p(G|D) ∝ p(D,G) = p(D|G)p(G)The second factor, p(G), is the prior distribution over structuresThe first factor, p(D|G), is the marginal likelihood of the data, defined as

p(D|G) =∫

p(D|G,θ)f (θ|G)dθ

where p(D|G,θ) is the likelihood of the data given the Bayesian network(structure G and parameters θ), and f (θ|G) is the prior distribution over theparameters



Bayesian networks. Score and search methods. ScoresBayesian scores

Depending on f (θ|G), we have different scores: If (θij |G) follows a Dirichlet ofparameters αij1, ..., αijRi , we have the Bayesian Dirichlet score (BD score). A Dirichletdistribution is determined by hyperparameters αijk for all i, j, k

The K2 score (Cooper and Herskovits, 1992) uses the uninformative assignmentαijk = 1, for all i, j, k , resulting in

QK 2(D,G) = p(G)n∏

i=1

qi∏j=1

(Ri − 1)!(Nij + Ri − 1)!

Ri∏k=1

Nijk !.

The K2 algorithm uses a greedy search method and the K2 score. The user gives a node ordering and themaximum number of parents that any node is permitted to have. Starting with an empty structure, thealgorithm incrementally adds, from the set of nodes that precede each node Xi in the node ordering, theparent whose addition most increases the function:

g(Xi , Pa(Xi )) =qi∏

j=1

(Ri − 1)!(Nij + Ri − 1)!

Ri∏k=1

Nijk !.

When the score does not increase further with the addition of a single parent, no more parents are added to

node Xi , and we move on to the next node in the ordering

The likelihood-equivalent Bayesian Dirichlet score (BDe score) (Heckerman et al.(1995)) sets the hyperparameters as αijk = α p(Xi = k ,Pa(Xi ) = pa

ji |G). The

equivalent sample size α expresses the user’s confidence in the prior networkIn the BDeu score (Buntine, 1991), αijk = α 1qi Ri



Dynamic, temporal and continuous time Bayesian networks

Dynamic Bayesian networks (Dean and Kanazawa, 1989)

X1[1]

X2[1]

X3[1]

X4[1]

X1[t+1]X 1[t ]

X2[ t ]

X3[ t ]

X2[ t+1]

X3[ t+1]

X4[t ] X

4[t+1]

X1[1]

X2[1]

X3[1]

X4[1]

X1[2]

X2[2 ]

X3[2]

X4[2]

X1[3]

X2[3]

X3[3]

X4[3]

(a) (b) (c)

A dynamic Bayesian network structure with four variables X1, X2, X3 and X4 and three time slices (T = 3):(a) prior Bayesian network; (b) transition network, with the first-order Markovian transition assumption; (c)

dynamic Bayesian network unfolded in time for three time slices

Temporal nodes Bayesian networks (Galán et al., 2007)

Continuous time Bayesian networks (Nodelman et al., 2002)



Bayesian networks. Software

HUGIN1 GeNIe2 Open-Markov3 gRain4 bnlearn5

Exact inferenceJunction tree

√ √ √

Approximate inferenceProbabilistic logic sampling

√ √ √ √

Constraint-based learningPC algorithm

√ √ √ √

Score+searchK2 algorithm

√ √ √ √

The symbol√

denotes that the inference or structure learning method is available in the respective software

1https://www.hugin.com/

2https://www.bayesfusion.com/

3http://www.openmarkov.org/

4https://CRAN.R-project.org/package=gRain

5http://www.bnlearn.com/



Outline



3 Bayesian networks


5 Applications

6 Conclusions

7 References



Multiple diagnosis problem. Direct approach

X1 . . . Xm C1 . . . Cd(x(1), c(1)) x (1)1 . . . x

(1)m c

(1)1 . . . c

(1)d

(x(2), c(2)) x (2)1 . . . x(2)m c

(2)1 . . . c

(2)d

. . . . . . . . .

(x(N), c(N)) x (N)1 . . . x(N)m c

(N)1 . . . c

(N)d

Optimal diagnosis as abductive inference: searching for the most probable explanation(MPE)

(c∗1 , . . . , c∗d )


p(C1 = c1, . . . ,Cd = cd |X1 = x1, . . . ,Xm = xm)


p(C1 = c1, . . . ,Cd = cd )p(X1 = x1, . . . ,Xm = xm|C1 = c1, . . . ,Cd = cd )

Number of parameters to be estimated for the case of binary predictors and classes:2d − 1 + 2d (2m − 1)



Multiple diagnosis with multi-dimensional Bayesian network classifiers (MBCs) (van der

Gaag and de Waal, 2006)

In a multi-dimensional Bayesian network classifier (MBC) the set of vertices V ispartitioned into:

VC = {C1, ...,Cd} of class variables andVX = {X1, ...,Xm} of feature variables

with (d + m = n)

Three subgraphs in the structure of a multi-dimensional Bayesian networkclassifier

Class subgraph: GCBridge subgraph: GCXFeature subgraph: GX



Examples of MBCs structures

(a) Empty-empty MBC

(b) Tree-tree MBC

(c) Polytree-DAG MBC



Tractability of MPE in MBCs with class bridgedecomposable MBCs (Bielza et al., 2011)

MPE is generally NP-hard in Bayesian networks (Kwisthout, 2011)

An MBC is class-bridge decomposable (CB-decomposable MBC) (Bielza et al., 2011)if:

1 GC ∪ GCX can be decomposed as: GC ∪ GCX =⋃r

i=1(GC i ∪ G(CX ) i ) whereGC i ∪ G(CX ) i with i = 1, ..., r are its maximal connected components

2 Non-shared children: Ch(VC i ) ∩ Ch(VC j ) = ∅, with i, j = 1, ..., r and i 6= j , whereCh(VC i ) denotes the children of all the variables in VC i

(a) A CB-decomposable MBC

(b) Its two maximal connected components



Tractability of MPE in MBCs with class bridge decomposable MBCs (Bielza et al., 2011)

Theorem (Bielza et al., 2011)

Given a CB-decomposable MBC where Ii =∏

C∈VC iΩC denotes the sample space associated with VC i , then

maxc1,...,cd

p(C1 = c1, ...,Cd = cd |X1 = x1, ..., Xm = xm)

∝r∏

i=1

maxc↓VC i ∈Ii

∏C∈VC i

p(c|pa(c))∏

X∈Ch(VC i )p(x|paVC (x), paVX (x))

where c↓VC i represents the projection of vector c to the coordinates found in VC i

maxc1,...,c5 p(C1 = c1, ...,C5 = c5|X1 = x1, ..., X6 = x6) ∝maxcp(c1)p(c2)p(c3|c2)p(c4)p(c5|c4)p(x1|c1)p(x2|c1, c2, x1)p(x3|c3)p(x4|c3, x3, x5, x6)p(x5|c4, c5, x6)p(x6|c5)= maxc1,c2,c3 p(c1)p(c2)p(c3|c2)p(x1|c1)p(x2|c1, c2, x1)p(x3|c3)p(x4|c3, x3, x5, x6)·

maxc4,c5 p(c4)p(c5|c4)p(x5|c4, c5, x6)p(x6|c5)

Here VC1 = {C1,C2,C3},VC2 = {C4,C5},Ch(VC1) = {X1, X2, X3, X4} and Ch(VC2) = {X5, X6}. Notethat Ch(VC1) ∩ Ch(VC2) = ∅ as required in the theorem



Tractability of MPE in MBCs according to thetreewidth

Exact methods for MPE computations in Bayesian networks are exponential in thetreewidth of G

Treewidth as the size of the largest clique in its triangulated graph (all cycles of four ormore nodes have a chord)

Several results for bounding treewidths in MBCs:1 For MBCs with empty feature subgraph (Pastink and van der Gaag, 2015):

treewidth(G) < treewidth(G′ ), where G′ is the pruned graph (first moralize G andthen remove the feature nodes from the moral graph)

2 For general MBCs (DAGs-DAGs MBCs) (de Waal and van der Gaag, 2007):treewidth(G) ≤ treewidth(GX ) + d

3 For CB-decomposable MBCs (Kwisthout, 2011):treewidth(G) ≤ treewidth(GX ) + |dmax|, where |dmax| is the number of classvariables of the component with maximun number of class variables

4 For general MBCs (DAGs-DAGs MBCs) (Benjumeda et al., 2018) present underwhich circunstances MPE computation can be done in polynomial time



Learning MBCs from data. Cardinality of thesearch space


The number of all possible MBC structures with d class variables and n feature variables, MBC(d, n), is

MBC(d,m) = S(d) · 2dm · S(m),

where S(m) =∑m

i=1(−1)i+1(

mi

)2i(m−i)S(m − i) is Robinson’s formula that counts the number of possible DAG

structures of n nodes, which is initialized as S(0) = S(1) = 1


The number of all possible bridge subgraphs, BRS(d,m), m ≥ d, for MBCs satisfying the following two conditions:(a) for each Xi ∈ VX , there is a Cj ∈ VC with (Cj , Xi ) ∈ ACX and (b) for each Cj ∈ VC , there is an Xi ∈ VXwith (Cj , Xi ) ∈ ACX , is given by the recursive formula

BRS(d,m) = 2dm −m−1∑k=0

(dmk

)−

dm∑k=m

∑x≤d,y≤m

k≤xy≤dm−d

(dx

)(my

)BRS(x, y, k)

where BRS(x, y, k) denotes the number of bridge subgraphs with k arcs in an MBC with x class variables and yfeature variables which is initialized as BRS(1, 1, 1) = BRS(1, 2, 2) = BRS(2, 1, 2) = 1



Learning MBCs from data. Empty-empty MBC

Empty-empty MBC learned Multi-label naive Bayes (MLNB)(Zhang et al., 2009)

A two-stage filter-wrapper feature selection strategy is incorporated:First stage (filter): feature extraction techniques based on principlecomponent analysis (PCA)Second stage (wrapper): subset selection techniques based on a geneticalgorithm (GA) (on the space of PCAs) are used to choose the mostappropriate subset of features for classification

For continuous features a Gaussian assumption is assumed: the density of thefeatures variables given the class values follows a Gaussian density



Learning MBCs from data. Tree-tree MBC

Tree-tree MBC learned in a three steps algorithm (van derGaag and de Waal, 2006)

1 The class subgraph is learnt by searching for the maximum weighted undirectedspanning tree and transforming it into a directed tree using Chow and Liu’s(1968) algorithm

The weight of an edge is the mutual information between a pair of classvariables

2 For a fixed bridge subgraph, the feature subgraph is then learnt by building amaximum weighted directed spanning tree

The weight of an arc is the conditional mutual information between pairs offeature variables given the parents (classes) of the second featuredetermined by the bridge subgraph

3 The bridge subgraph is greedily changed in a wrapper-like way trying to improvethe considered metric (i.e., exact match)



Learning MBCs from data. DAG-DAG MBC. Filter

DAG-DAG MBC (Bielza et al., 2011)

Score: penalized likelihood or Bayesian scoreGreedy search in five steps:

1 Learn the class subgraph2 Learn the feature subgraph3 Propose a candidate bridge subgraph4 Obtain a candidate feature subgraph5 Decide on bridge and feature subgrah candidates





1. Learn the class subgraph





2. Learn the feature subgraph and3. Propose a candidate bridge subgraph





4. Obtain a candidate feature subgraph and5. Decide on bridge and feature subgrah candidates



Learning MBCs from data. DAG-DAG MBC.Wrapper


i = 01. G(i) = ∅. Acc = Acc(i)

2. Whenever there are arcs that can be added to G(i) (and not previously discarded):Add one arc to GC (i), GCX (i) or GX (i) and obtain the new G(i+1) and Acc(i+1)

3. If Acc(i+1) > Acc(i), Acc = Acc(i+1), i = i + 1, and go to 2Else discard the arc and go to 2

4. Stop and return G(i) and Acc



Learning CB decomposable MBCs from data (Borchani et al., 2010)

Phase I. Learn bridge subgraph

X2

C1

X5 X1

C2

X4X2 X1X4

C3

X3X6 X5

C4

X4X6

Starting from an empty graphical structure, learn a selective naive Bayes for each class variable

X2

C1

X1

C2

X3

C3

X6X4 X5

C4

Check the non-shared children property to induce an initial CB-decomposable MBC (remove all commonchildren based on two criteria: feature insertion rank and accuracy)The result of this phase is a simple CB-decomposable MBC where only the bridge subgraph is defined(class and feature subgraphs are empty)




Phase II. Learn feature subgraph

X2

C1

X1

C2

X3

C3

X6X4 X5

C4

Introduce the dependence relationships between the feature subgraph

Fix the maximum number of iterations

In each iteration an arc is selected at random between a pair of feature variables. If there is an accuracyimprovement, the arc is added, otherwise it is discarded




Phase III. Merge maximal connected components

X2

C1

X1

C2

X3

C3

X6X4 X5

C4

X2

C1

X1

C2

X3

C3

X6X4 X5

C4

Merging the maximal connected components Bridge subgraph update

All possible arc additions between the class variables are evaluated, adding the arc improving the accuracythe most (from r to r − 1 maximal connected components)A bridge update step is performed inside the new induced maximal connected component

X2

C1

X1

C2

X3

C3

X6X4 X5

C4

Feature subgraph update

Update the feature subgraph by inserting, one by one, additional arcs between feature variables

This phase iterates over these three steps (stopping criteria: no more component merging can improve theaccuracy or r = 1)



Learning MBCs by a constrained-based approach (Borchani et al., 2013)

Markov blanket MBC (MB-MBC) learning algorithm

Apply the HITON algorithm (Aliferis et al., 2003) to each class variable todetermine Markov blankets for each of them

Given the MBC definition, direct parents of any class variable Ci , i = 1, . . . , d ,can only be among the remaining class variables, whereas direct children orspouses of Ci can include either class or feature variables

The MBC subgraphs based on the results of the HITON algorithm:

Class subgraph: firstly insert an edge between each class variable Ci andany class variable belonging to its corresponding parents-children setPC(Ci ). Then, direct all these edges using the PC algorithm’s edgeorientation rulesBridge subgraph: this is built by inserting an arc from each class variableCi to every feature variable belonging to PC(Ci )Feature subgraph: for every feature X in the set MB(Ci ) \ PC(Ci ), i.e., forevery spouse X , we insert an arc from X to the corresponding commonchild given by PC(X) ∩ PC(Ci )



Multi-dimensional Bayesian network classifier trees (Gil-Begue et al., 2018)

A multi-dimensional Bayesian network classifier tree (MBCTree) is a classification treewith MBCs in the leaves

An internal node of an MBCTree corresponds to a feature variable Xi , as instandard classification trees, and has a labelled branch to a child for each of itspossible valuesA leaf of an MBCTree is an MBC over all the class variables and those featuresnot present in the path from the root to the leaf

Wrapper approach (greedy search) guided by the exact match accuracy for learningMBCTrees



Detecting multi-dimensional concept drift in data streams (Borchani et al., 2016)

The concept drift detection used to be based on ensemble updatingWe use a single MBC and our drift detection method is performed locallyThe average local log-likelihood of each variable in the MBC network:

llsi =

1Ns∑qi

j=1∑ri

k=1 Nsijk log

NsijkNsij

Change point detection with the Page-Hinkley testCUMs =

∑st=1(LL

t −meanLLt− δ) where mean

LLt= 1t

∑th=1 LL

h

denotes the mean of LL1, ..., LL

tvalues, and δ is a tolerance parameter

MAX s = max{

CUM t , t = 1, ..., s}

The Page-Hinkley test value: PHs = MAX s − CUMs . If PHs > λ, the nullhypothesis is rejected and the Page-Hinkley test alarms a change,otherwise, no change is signaled

Evolution of the average local log-likelihood values of four different class variablesPedro Larrañaga Multi-dimensional Bayesian network classifiers


Outline



3 Bayesian networks


5 Applications

6 Conclusions

7 References



Predicting human immunodeficiency virus (HIV) type 1 inhibitors (Borchani et al., 2013)

A cell with the HIV-1. Reverse transcription




Therapies for HIV-1 are combinations or cocktails of antiretroviral drugs

We aim to gain insight into the different interactions between drugs andresistance mutations

Reverse transcriptase inhibitors (RTIs) consist of two groups of antiretroviraldrugs preventing HIV-1 replication: nucleoside and nucleotide reversetranscriptase inhibitors (NRTIs) and non-nucleoside reverse transcriptaseinhibitors (NNRTIs)

NRTIs consist of seven drugs: Abacavir (ABC), Didanosine (DDI), Emtricitabine(FTC), Lamivudine (3TC), Stavudine (D4T), Tenofovir (TDF), and Zidovudine(AZT)

NNRTIs considers three drugs: Efavirenz (EFV), Nevirapine (NVP), andDelavirdine (DLV)

A total of 38 mutations associated with resistance to RTIs: 22 associated withNNRTs and 16 with NNRTIs




We analyzed reverse transcriptase and protease data sets obtained from theonline Stanford HIV-1 database (Rhee et al., 2003)Treatment histories from 2855 patients that received either NRTIs, NNRTIs orbothThe data set contained a total of 4884 samplesThe number of RTIs varies from 1 to 8 drugs: 17 samples with 1 RTI, 25 with 2RTIs, 157 with 3 RTIs, 698 with 4 RTIs, 1852 with 5 RTIs, 1600 with 6 RTIs, 483with 7 RTIs, and 56 with 8 RTIs

Mean accuracy Exact matchmaxCS = 1 0.7108± 0.0221 0.1151± 0.0466maxCS = 2 0.7062± 0.0191 0.0881± 0.0403

MB-MBC maxCS = 3 0.7019± 0.0153 0.0780± 0.0363maxCS = 4 0.6995± 0.0145 0.0701± 0.0336maxCS = 5 0.6978± 0.0106 0.0646± 0.0241

Tree-Tree 0.6968± 0.0163 0.0364± 0.0101DAG-DAG Filter 0.7074± 0.0063 0.0240± 0.0066DAG-DAG Wrapper 0.7095± 0.0040 0.0291± 0.0008CB-MBC 0.7261± 0.0113 0.0382± 0.0105

Estimated performance metrics (mean± std. deviation) for RTIs data set




The graphical structure of the MBC learnt using RTIs data set



EQ-5D health states from PDQ-39 in Parkinson (Borchani et al., 2012)

Parkinson disease motor symptoms




PDQ-39 and EQ-5D: quality of life instruments to measure the degree of disability in PD

39-item Parkinson’s Disease Questionnaire: a specific instrument

PDQ-39 captures patient’s perception of his illness covering 8 dimensions:

1 Mobility

2 Activities of dailyliving

3 Emotional well-being

4 Stigma

5 Social support

6 Cognitions

7 Communication

8 Bodily discomfort




European Quality of Life - 5 Dimensions: a generic instrument

EQ-5D is a generic measure of health for clinical and economic appraisal




Mapping PDQ-39 to EQ-5D

PDQ1 PDQ2 ... ... PDQ39 EQ1 EQ2 EQ3 EQ4 EQ53 1 ... ... 3 1 3 3 2 12 3 ... ... 2 1 1 2 3 25 2 ... ... 4 1 3 3 1 2... ... ... ... ... ... ... ... ... ...4 4 ... ... 3 3 1 2 3 24 4 ... ... 3 3 1 2 3 25 5 ... ... 4 2 3 2 3 3

φ : (PDQ1, ...,PDQ39)→ (EQ1, ...,EQ5)




488 Parkinson’s patients. Estimated measures over 5-fold cross-validation

Method Mean accuracy Exact matchMB-MBC 0.7119± 0.0338 0.2030± 0.0718CB-MBC 0.6807± 0.0285 0.1865± 0.0429MNL 0.6926± 0.0430 0.1802± 0.0713OLS 0.4201± 0.0252 0.0123± 0.0046CLAD 0.4254± 0.0488 0.0143± 0.0171




MB-MBC graphical structure



Outline



3 Bayesian networks


5 Applications

6 Conclusions

7 References



MULTI-DIMENSIONAL BAYESIAN NETWORK CLASSIFIERSSpecialized Bayesian networks for solving multi-label andmuti-dimensional problemsAdvantages:

Transparency, interpretabilityVariety of inference (MPE) and learning algorithmsHierarchy of models organized by structural complexityCompetitive results with the state-of-the-art methods whend is about dozens

Disadvantages:Wrapper approaches for learning are very computationallydemandingIn large d problems difficult to compete with state-of-the-artmethods



Outline



3 Bayesian networks


5 Applications

6 Conclusions

7 References



References

H. Akaike (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control,6, 716-723.C. F. Aliferis, I. Tsamardinos and M. S. Statnikov (2003). HITON: A novel Markov blanket algorithm foroptimal variable selection. AMIA Annual Symposium Proceedings, 2125.M. Benjumeda, C. Bielza, and P. Larrañaga (2018). Tractability of most probable explanations inmultidimensional Bayesian network classifiers. International Journal of Approximate Reasoning, 93, 74-87.C. Bielza and G. Li and P. Larrañaga (2011). Multi-dimensional classification with Bayesian networks.International Journal of Approximate Reasoning, 52, 705-727.C. Bielza and P. Larrañaga (2014). Discrete Bayesian network classifiers: A survey. ACM ComputingSurveys, 47 (1), Article 5.R. Blanco, I. Inza, M. Merino, J. Quiroga, and P. Larrañaga (2005). Feature selection in Bayesian classifiersfor the prognosis of survival of cirrhotic patients treated with TIPS. Journal of Biomedical Informatics, 38(5),376-388.Borchani, H., C. Bielza, and P. Larrañaga (2010). Learning CB-decomposable multi-dimensional Bayesiannetwork classifiers. Proceedings of the 5th European Workshop on Probabilistic Graphical Models, 25-32.H. Borchani, C. Bielza, P. Martı́nez-Martı́n and P. Larrañaga (2012). Multidimensional Bayesian networkclassifiers applied to predict the European quality of life-5 dimensions (EQ-5D) from the 39-item Parkinson’sdisease questionnaire (PDQ-39), Journal of Biomedical Informatics, 45, 1175-1184H. Borchani, C. Bielza, and P. Larrañaga (2013). Predicting human immunodeficiency virus inhibitors usingmulti-dimensional Bayesian network classifiers. Artificial Intelligence in Medicine, 57 (3), 219-229.H. Borchani, P. Larrañaga, J. Gama, and C. Bielza (2016) Mining multi-dimensional concept-drifting datastreams using Bayesian network classifiers. Intelligent Data Analysis, 20 (2), 257-280.M. R. Boutell, J. Luo, X. Shen and C.M. Brown (2004). Learning multi-label scene classification. PatternRecognition 37, 17571771.W.L. Buntine (1991). Theory refinement on Bayesian networks. Proceedings of the 7th Conference onUncertainty in Artificial Intelligence, 52-60.D.M. Chickering (1996). Learning Bayesian networks is NP-Complete. Learning from Data: ArtificialIntelligence and Statistics V, 121-130, Springer.



References

C.K. Chow and C.N. Liu (1968). Approximating discrete probability distributions with dependence trees.IEEE Transactions on Information Theory, 14(3), 462-467.

A. Clare and R.D. King (2001). Knowledge discovery in multi-label phenotype data. Proceedings of the 5thEuropean Conference on Principles of Data Mining and Knowledge Discovery, 42-53.

G.F. Cooper and E. Herskovits (1992). A Bayesian method for the induction of probabilistic networks fromdata. Machine Learning, 9, 309-347.

P.R. de Waal and L. van der Gaag (2007). Inference and learning in multi-dimensional Bayesian networkclassifiers. Proceedings of the 9th Europen Conference on Symbolic and Quantitative Approaches toReasoning with Uncertainty, Springer, 501-511.

T. Dean and K. Kanazawa (1989). A model for reasoning about persistence and causation. ComputationalIntelligence, 5(3), 142-150.

K. Deb, A. Pratap, S. Agarwal and T. Meyarivan (2002). A fast and elitist multiobjective genetic algorithm:NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182-197.

A. Elisseeff and J. Weston (2002). A kernel method for multi-labelled classification. Advances in NeuralInformation Processing Systems 14, 681-687.

N. Friedman (1998). The Bayesian structural EM algorithm. Proceedings of the 14th Conference onUncertainty in Artificial Intelligence, 129-138.

N. Friedman, D. Geiger and M. Goldszmidt (1997). Bayesian network classifiers. Machine Learning, 29,131-163.S.F. Galán, G. Arroyo-Figueroa, F.J. Dı́ez L.E. and Sucar (2007). Comparison of two types of eventBayesian networks: A case study. Applied Artificial Intelligence, 21(3), 185-209.

S. Gil-Begue, C. Bielza, P. Larrañaga (2018). Multi-dimensional Bayesian network classifier trees. The 9thInternational Conference on Probabilistic Graphical Models, submitted.S.B. Gillispie and M.D. Perlman (2002). The size distribution for Markov equivalence classes of acyclicdigraph models. Artificial Intelligence, 141(1/2), 137-155.

E. Gibaja and S. Ventura (2015). A tutorial on multi-label learning. ACM Computing Surveys, 47, 3, Article52.



References

S. Godbole and S. Sarawagi (2004). Discriminative methods for multi-labeled classification. Proceedings ofthe 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 22-30.

D. Heckerman, D. Geiger and D.M. Chickering (1995). Learning Bayesian networks: The combination ofknowledge and statistical data. Machine Learning, 20, 197-243.

M. Henrion (1988). Propagating uncertainty in Bayesian networks by probabilistic logic sampling.Uncertainty in Artificial Intelligence 2, 149-163, Elsevier Science.

D. Koller and N. Friedman (2009). Probabilistic Graphical Models: Principles and Techniques. The MITPressD. Koller and M. Sahami (1996). Toward optimal feature selection. Proceedings of the 13th InternationalConference on Machine Learning, 284-292.

J. Kwisthout (2011). Most probable explanations in Bayesian networks: Complexity and tractability.International Journal of Approximate Reasoning, 52(9), 1452-1469.

P. Langley and S. Sage (1994). Induction of selective Bayesian classifiers. Proceedings of the 10thConference on Uncertainty in Artificial Intelligence, 399-406.

S.L. Lauritzen (1995). The EM algorithm for graphical association models with missing data. ComputationalStatistics and Data Analysis, 19, 191-201.

S.L. Lauritzen, A.P. Dawid, B.N. Larsen and H.G. Leimer (1990). Independence properties of directedMarkov fields. Networks, 20(5), 491-505.

S.L. Lauritzen and D.J. Spiegelhalter (1988). Local computations with probabilities on graphical structuresand their application to expert systems. Journal of the Royal Statistical Society, Series B (Methodological),50(2), 157-224.

G. Madjarov, D. Kocev, D. Gjorgjevikj and S. Džeroskii (2012). An extensive experimental comparison ofmethods for multi-label learning. Pattern Recognition, 45, 9, 3084-3014.

M. Minsky (1961). Steps toward artificial intelligence. Transactions on Institute of Radio Engineers, 49, 8-30.

U. Nodelman, C.R. Shelton and D. Koller (2002). Continuous time Bayesian networks. Proceedings of the18th Conference on Uncertainty in Artificial Intelligence, 378-387.



References

A. Pastink and L. van der Gaag (2015). Multi-classifiers of small treewidth. Proceedings of the 13th EuropenConference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty, Springer, 199-209.

J. Pearl (1987). Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence,32(2), 245-257.

J. Pearl (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann.

Y. Peng and J. A. Reggia (1987a). A probabilistic causal model for diagnostic problem solving. Part I:Integrating symbolic causal inference with numeric probabilistic inference. IEEE Transactions on Systems,Man and Cybernetics, 17 (2), 146-162.

Y. Peng and J. A. Reggia (1987b). A probabilistic causal model for diagnostic problem solving. Part II:Diagnostic strategy. IEEE Transactions on Systems, Man and Cybernetics, 17 (3), 395-406.

J. Read, B. Pfahringer, G. Homes and E. Frank (2011). Classifier chains for multi-label classification.Machine Learning, 85, 3, 333-359.

S.Y. Rhee, M.J. Gonzales, R. Kantor, J. Betts, J. Ravela and R.W. Shafer (2003). Human immunodeficiencyvirus reverse transcriptase and protease sequence database. Nucleic Acids Research, 31(1), 298-303.

R. Robinson (1977). Counting unlabeled acyclic digraphs. Lecture Notes in Mathematics, 622, 28-43,Springer.

Y. Saeys, I. Inza and P. Larrañaga (2007). A review of feature selection techniques in bioinformatics.Bioinformatics, 23 (19), 2507-2517.

G. Schwarz (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461-464.

R.D. Shachter and M.A. Peot (1989). Simulation approaches to general probabilistic inference on beliefnetworks. Proceedings of the 5th Annual Conference on Uncertainty in Artificial Intelligence, 221-234.

D.J. Spiegelhalter and S.L. Lauritzen (1990). Sequential updating of conditional probabilities on directedgraphical structures. Networks, 20, 579-605.

P. Spirtes and C. Glymour (1991). An algorithm for fast recovery of sparse causal graphs. Social ScienceComputer Review, 90(1), 62-72.



References

E. Sucar, C. Bielza, E. F. Morales, P. Hernandez-Leal, J. H. Zaragoza and P. Larrañaga (2014). Multi-labelclassification with Bayesian network-based chain classifiers. Pattern Recognition Letters, 41, 14-22.L. Tenenboim, L. Rokach and B. Shapira (2010). Identification of label dependencies for multi-labelclassification. Proceedings of the 2nd International Workshop on Learning from Multi-Label Data, 53-60.G. Tsoumakas, A. Dimou, E. Spyromitros, V. Mezaris, I. Kompatsiaris and I. Vlahavas (2009).Correlation-based pruning of stacked binary relevance models for multi-label learning. Proceedings of the1st International Workshop on Learning from Multi-Label Data, 101-116.G. Tsoumakas, I. Katakis and I. Vlahavas (2010). Random k -labelsets for multi-label classification. IEEETransactions on Knowledge and Data Engineering, 23, 7, 1079-1089.G. Tsoumakas and I. Katakis (2007). Multi-label classification: An overview. International Journal of DataWarehousing and Mining, 3, 1-13.L.C. van der Gaag and P.R. de Waal (2006). Multi-dimensional Bayesian network classifiers. Proceedings ofthe 3rd European Workshop on Probabilistic Graphical Models, 107-114.G. Varando, C. Bielza and P. Larrañaga (2015). Decision boundary for discrete Bayesian network classifiers.Journal of Machine Learning Research, 16, 2725-2749.G. I. Webb, J. Boughton and Z. Wang (2002). Not so naive Bayes: Aggregating one-dependenceestimators. Machine Learning, 58, 5-24.M.L. Zhang, J.M. Peña and V. Robles (2009). Feature selection for multi-label naive Bayes classification.Information Sciences, 179 (19), 3218-3229.N. Zhang and D. Poole (1994). A simple approach to Bayesian network computations. Proceedings of the10th Biennial Canadian Conference on Artificial Intelligence, 171-178.M.L. Zhang and Z.H. Zhou (2006). Multilabel neural networks with applications to functional genomics andtext categorization. IEEE Transactions on Knowledge and Data Engineering, 18,10, 1338-1351.M. L. Zhang and Z.H. Zhou (2007). ML-KNN: A lazy learning approach to multi-label learning. PatternRecognition, 40, 7, 2038-2048.M. L. Zhang and Z.H. Zhou (2014). A review on multi-label learning algorithms. IEEE Transactions onKnowledge and Data Engineering, 26, 8, 1819-1837.



MULTI-DIMENSIONALBAYESIAN NETWORK CLASSIFIERS

Pedro Larrañaga

Computational Intelligence GroupArtificial Intelligence Department

Technical University of Madrid

EAIA 2018 - DS4BD. Advanced School on Data Science for Big DataPorto, July 4, 2018


Multi-label classification and multi-dimensional classificationBinary relevance, classifier chain and label power setBayesian networksMulti-dimensional Bayesian network classifiersApplicationsConclusionsReferences

multi-dimensional bayesian network...

Documents