multi-dimensional bayesian network...

90
Intro BR, CC and LBP BN MBC Appl Concl Ref MULTI - DIMENSIONAL B AYESIAN NETWORK CLASSIFIERS Pedro Larra˜ naga Computational Intelligence Group Artificial Intelligence Department Technical University of Madrid EAIA 2018 - DS4BD. Advanced School on Data Science for Big Data Porto, July 4, 2018 Pedro Larra ˜ naga Multi-dimensional Bayesian network classifiers

Upload: others

Post on 19-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    MULTI-DIMENSIONALBAYESIAN NETWORK CLASSIFIERS

    Pedro Larrañaga

    Computational Intelligence GroupArtificial Intelligence Department

    Technical University of Madrid

    EAIA 2018 - DS4BD. Advanced School on Data Science for Big DataPorto, July 4, 2018

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Outline

    1 Multi-label classification and multi-dimensional classification

    2 Binary relevance, classifier chain and label power set

    3 Bayesian networks

    4 Multi-dimensional Bayesian network classifiers

    5 Applications

    6 Conclusions

    7 References

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Outline

    1 Multi-label classification and multi-dimensional classification

    2 Binary relevance, classifier chain and label power set

    3 Bayesian networks

    4 Multi-dimensional Bayesian network classifiers

    5 Applications

    6 Conclusions

    7 References

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Simultaneous object recognition in images (multi-label)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Medical diagnosis (multi-label)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Multiple fault diagnosis (multi-label)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Single label classification versus multi-label classification

    X1 X2 X3 X4 X5 C3.2 1.4 4.7 7.5 3.7 12.8 6.3 1.6 4.7 2.7 07.7 6.2 4.1 3.3 7.7 19.2 0.4 2.8 0.5 3.9 05.5 5.3 4.9 0.6 6.6 1

    Single label classification

    X1 X2 X3 X4 X5 C1 C2 C3 C43.2 1.4 4.7 7.5 3.7 1 0 1 12.8 6.3 1.6 4.7 2.7 0 0 1 07.7 6.2 4.1 3.3 7.7 1 0 1 19.2 0.4 2.8 0.5 3.9 0 1 0 05.5 5.3 4.9 0.6 6.6 1 1 0 1

    Multi-label classification

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Multi-label classification vs multi-dimensional classification

    X1 X2 X3 X4 X5 C1 C2 C3 C43.2 1.4 4.7 7.5 3.7 1 0 1 12.8 6.3 1.6 4.7 2.7 0 0 1 07.7 6.2 4.1 3.3 7.7 1 0 1 19.2 0.4 2.8 0.5 3.9 0 1 0 05.5 5.3 4.9 0.6 6.6 1 1 0 1

    Multi-label classification

    X1 X2 X3 X4 X5 C1 C2 C3 C43.2 1.4 4.7 7.5 3.7 1 0 2 42.8 6.3 1.6 4.7 2.7 0 0 1 07.7 6.2 4.1 3.3 7.7 3 0 2 19.2 0.4 2.8 0.5 3.9 2 1 0 25.5 5.3 4.9 0.6 6.6 3 1 0 3

    Multi-dimensional classification

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Neuronal cell-type classification (multi-dimensional)

    X1, ...,Xn are morphological variables

    C1, ...,CdAnimal specie: rat, mouse, monkey, cat, human, ...Neuronal cell-type: pyramidal, interneuron, Purkinje, Martinotti,...Brain region: amygdala, cerebral cortex, hippocampus, ...Age of the animal: young, adult

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Dow Jones companies stock market values prediction

    (multi-dimensional + time )

    X1, ...,Xn are the stock market values of the companies during the last week

    C1, ...,CdCompany 1: [-5%,-2%), [-2%,-1%), [-1%,0%) [0%,1%) [1%,2%) [2%,5%)Company 2: [-5%,-2%), [-2%,-1%), [-1%,0%) [0%,1%) [1%,2%) [2%,5%)Company 3: [-5%,-2%), [-2%,-1%), [-1%,0%) [0%,1%) [1%,2%) [2%,5%)Company 4: [-5%,-2%), [-2%,-1%), [-1%,0%) [0%,1%) [1%,2%) [2%,5%)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Weather forescast (multi-dimensional + time + spatial)

    X1, ...,Xn are variables extracted from meteorological stations: thermometer,barometer, hygrometer, anenometer, wind vane, rain gauge, disdrometer,transmissometer, ceilling projector, ....

    C1, ...,CdTemperature: [-100, 00), [00, 100), [100, 250), [250, 400)Relative humidity: [0, 30), [30, 60), [60, 80), [80, 100)Rain: light rain (precipitation rate is < 2.5 mm/h), moderate rain (2.5 mm/h- 10 mm/h), heavy rain (10 mm/h - 50 mm/h), violent rain (> 50 mm/h)Wind: calm (< 1km/h), light air (1 km/h - 5 km/h), ......, storm (103 km/h -117 km/h), hurricane (>117 km/h)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning multi-label classification models from data

    A multi-label (multi-dimensional) data setD = {(x(1),c(1)), ..., (x(N),c(N))} wherex(i) ∈ ΩX =

    ∏nj=1 ΩXj and c

    (i) ∈ ΩC =∏d

    k=1 ΩCkThe learning task for a multi-label (multi-dimensional)classification paradigm is to output a function φ

    ΩC ≡ ΩX1 × · · · × ΩXnφ−→ ΩC ≡ ΩC1 × · · · × ΩCd

    x ≡ (x1, ..., xn) → c ≡ (c1, ..., cd )

    Reviews on this topic: Tsoumakas and Katakis (2007),Zhang and Zhou (2014) and Gibaja and Ventura (2015)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Overview of learning methods for multi-label classification

    Problem transformation methodsThey transform the learning task into one or more single-labelclassification tasksThey are algorithm independent(a) Transform to binary classification: binary relevance (Godbole andSarawagi, 2004), classifiers chain (Read et al., 2011)(b) Transform to multiclass classification: label powerset (Boutell et al.,2004), RAKEL (Tsoumakas et al., 2010)(c) Identifying label dependencies: correlation based pruning (Tsoumakaset al., 2009), LBPR algorithm (Tenenboim et al., 2010)

    Algorithm adaptation methodsThey extend specific learning algorithms in order to handle multi-label datadirectlyExamples: classification trees (Clare and King, 2001), neural networks(Zhang and Zhou, 2006), k -nearest neighbors (Zhang and Hou, 2007),support vector machines (Elisseeff and Weston, 2001), random forest(Madjarov et al., 2012), Bayesian networks (van der Gaag and de Waal,2006).

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Mean accuracy and exact match for multi-label classification

    xi C1 C2 C3 C4 C5 Ĉ1 Ĉ2 Ĉ3 Ĉ4 Ĉ5x1 1 0 1 0 0 1 0 0 1 0x2 0 1 0 1 0 0 1 0 1 0x3 1 0 0 1 0 1 0 0 1 0x4 0 1 1 0 0 0 1 0 0 0x5 1 0 0 0 0 1 0 0 1 0

    Mean accuracy computes the mean of the accuracies in each of the d classvariables

    Mean accuracy(φ) =1d

    d∑j=1

    1N

    N∑i=1

    I(ĉ ij = cij )

    where I(true) = 1 and I(false) = 0In the table: Mean accuracy(φ) = 15 (1 + 1 + 0.6 + 0.6 + 1) = 0.804Exact match computes the fraction of correctly classified instances. An instancewill be correctly classified if the binary vector containing the values of eachbinary class variable coincides with the binary vector containing their predictions

    Exact match(φ) =1N

    N∑i=1

    I(ĉi = ci )

    In the table: Exact match(φ) = 15 (0 + 1 + 1 + 0 + 0) = 0.4

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Outline

    1 Multi-label classification and multi-dimensional classification

    2 Binary relevance, classifier chain and label power set

    3 Bayesian networks

    4 Multi-dimensional Bayesian network classifiers

    5 Applications

    6 Conclusions

    7 References

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Binary relevance (Godbole and Sarawagi, 2004)

    X C1 C2 C3 C4x(1) 1 0 0 1x(2) 0 0 1 1x(3) 1 0 0 1x(4) 0 1 0 0x(5) 1 1 1 0

    X C1 X C2 X C3 X C4x(1) 1 x(1) 0 x(1) 0 x(1) 1x(2) 0 x(2) 0 x(2) 1 x(2) 1x(3) 1 x(3) 0 x(3) 0 x(3) 1x(4) 0 x(4) 1 x(4) 0 x(4) 0x(5) 1 x(5) 1 x(5) 1 x(5) 0

    Learns one binary classifier for each label independently of the rest of labels

    Outputs the concatenation of their predictions

    Does not consider label relationships

    Any supervised classification method can be used, i.e., Bayesian network basedclassifiers

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Binary relevance with Bayesian network classifiers

    c∗ = arg maxc

    p(c|x) = arg maxc

    p(x, c)

    Naive Bayes

    Selective naive Bayes

    Semi-naive Bayes

    ODE

    TAN

    SPODE

    k-DB

    BAN

    Markov blanket-based

    Unrestricted

    Bayesian multinet

    p(c)p(x1, . . . , xn | c)p(c | pa(c))

    n∏i=1

    p(xi | pa(xi))

    p(c)n∏

    i=1

    p(xi | pac(xi))

    p(c | x1, . . . , xn) ∝ p(c, x1, . . . , xn)

    Taxonomy of discrete Bayesian network classifiers according to the factorization of p(x, c)(Bielza and Larrañaga (2014))

    Decision boundary for discrete Bayesian network classifiers (Varando et al.,2015)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Binary relevance with Bayesian network classifiers

    Naive Bayes (Minsky, 1961)

    p(c|x) ∝ p(c)n∏

    i=1

    p(xi |c)

    A naive Bayes structure from which p(c|x) ∝ p(c)p(x1|c)p(x2|c)p(x3|c)p(x4|c)p(x5|c)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Binary relevance with Bayesian network classifiers

    Selective naive Bayes (Langley and Sage, 1994)

    p(c|x) ∝ p(c|xF ) = p(c)∏i∈F

    p(xi |c)

    XF denotes the projection of X onto the selected feature subset F ⊆ {1, 2, ..., n}

    A selective naive Bayes structure from which p(c|x) ∝ p(c)p(x1|c)p(x2|c)p(x4|c). The variables in the shadednodes have not been selected

    Feature subset selection: filter (univariate and multivariate) based on informationtheory measures as mutual information, and wrapperFor multivariate filtering and wrapper approaches a heuristic search (best first,floating search, simulated annealing, tabu search, genetic algorithms, estimationof distribution algorithms) in the huge space of cardinality 2n must be carried outSee Saeys et al. (2007) for a review on feature subset selection

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Binary relevance with Bayesian network classifiers

    One-dependence estimators (ODEs). TAN (Friedman et al. 1997)

    One-dependence estimators are more general naive Bayes where each predictorvariable is allowed to depend on at most one other predictor in addition to the class.Tree-augmented naive Bayes (TAN)

    p(c|x) ∝ p(c)p(xr |c)n∏

    i=1,i 6=rp(xi |c, xj(i))

    where Xr denotes the root node and {Xj(i)} = Pa(Xi ) \ C, for any i 6= r

    (a) (b)

    (a) A TAN structure, whose root node is X3, from whichp(c|x) ∝ p(c)p(x1|c, x2)p(x2|c, x3)p(x3|c)p(x4|c, x3)p(x5|c, x4);

    (b) Selective TAN (Blanco et al. 2005), from which p(c|x) ∝ p(c)p(x2|c, x3)p(x3|c)p(x4|c, x3)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Binary relevance with Bayesian network classifiers

    One-dependence estimators (ODEs). TAN (Friedman et al. 1997)

    Algorithm 1: Learning a TAN structureInput : A data setD = {(x1, c1), ..., (xN , cN )} with X = (X1, ..., Xn)Output: A TAN structure

    1 for i < j, i, j = 1, ..., n do

    Compute MI(Xi , Xj |C)=∑

    i,j,r p(xi , xj , cr ) logp(xi ,xj |cr )

    p(xi |cr )p(xj |cr )

    end2 Build a complete undirected graph where the nodes are X1, ..., Xn . Annotate the weight of an edge connecting Xi

    and Xj by MI(Xi , Xj |C)3 Build a maximum weighted spanning tree:

    3a Select the two edges with the heaviest weight3b while The tree contains less that n − 1 edges do

    if They do not form a cycle with the previous edges thenSelect the next heaviest edge

    elseReject the edge and continue

    endend

    4 Transform the resulting undirected into a directed tree by choosing a root node and setting the direction of all edgesto be outward from this node

    5 Construct a TAN structure by adding a node C and an arc from C to each Xi

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Binary relevance with Bayesian network classifiers

    One-dependence estimators (ODEs). TAN (Friedman et al. 1997)

    MI(X1,X3|C) > MI(X2,X4|C) > MI(X1,X2|C) > MI(X3,X4|C) > MI(X1,X4|C) >MI(X3,X5|C) > MI(X1,X5|C) > MI(X2,X3|C) > MI(X2,X5|C) > MI(X4,X5|C)

    (a) (b) (c)

    (d) (e) (f)

    (g) (h)

    An example of TAN structure construction. (a-c) Edges are added according to conditional mutual informationquantities arranged in ascending order. (d-e) Edges X3 − X4 and X1 − X4 (dashed lines) cannot be added since

    they form a cycle. (f) Maximum weighted spanning tree. (g) The respective directed tree obtained by choosing X1 asthe root node. (h) Final TAN structure.

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Binary relevance with Bayesian network classifiers

    Markov blanket-based Bayesian classifier (Koller and Sahami, 1996)

    If C has parents:

    p(c|x) ∝ p(c|pa(c))n∏

    i=1

    p(xi |pa(xi ))

    The Markov blanket of C is the only knowledge needed to predict its behavior

    Bayesian classifiers based on identifying the Markov blanket of the class variable

    A Markov blanket structure for C, MBC = {X1, X2, X3, X4}, from whichp(c|x) ∝ p(c|x2)p(x1|c)p(x2)p(x3)p(x4|c, x3)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Discrete Bayesian network classifiers

    c∗ = arg maxc

    p(c|x) = arg maxc

    p(x, c)

    Taxonomy of discrete Bayesian network classifiers according to the factorization of p(x, c)(Bielza and Larrañaga (2014))

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Classifier chain with Bayesian network classifiers

    Classifier chain (Read et al. (2011)) overcomes the label independenceassumption by transforming the multi-label learning problem into a chain ofbinary classification problems

    Each binary classifier in the chain is built upon the predictions (probabilisticpredictions) of preceding classifiers

    A classifier chain learns d functions φi on augmented input spaces, takingĉ1, ..., ĉi−1 as additional features:

    ΩX × {0, 1}i−1 φi−→ [0, 1]

    (x , ĉ1, ..., ĉi−1) → p(ci |x , ĉ1, ..., ĉi−1)

    φi can be learnt using c1, ..., ci−1 instead of ĉ1, ..., ĉi−1. However, when appliedto new instances, the classifier chain method necessarily uses predictionsĉ1, ..., ĉi−1A total order among the class variables should be defined beforehand (the resultis dependent of that order)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Classifier chain with Bayesian network classifiers

    Tree naive Bayesian classifier chain (TNBCC)(Sucar et al., 2014)

    Only one parent per class in a chain: A TAN structure for class variables

    The root of the TAN can be selected in d different ways

    TNBCC uses naive Bayes as baseline classifiers

    Structure of a TNBCC

    Several heuristics for selecting the root node: random, the node with morenumber of incident edges, the node with the highest mutual information, ...

    One unique TNBCC versus an ensemble of TNBCC (best results)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Label powerset. Boutell et al. (2004)

    X C1 C2 C3 C4 Cx(1) 1 0 0 1 10x(2) 0 0 1 1 4x(3) 1 0 0 1 10x(4) 0 1 0 0 5x(5) 1 1 1 0 15

    Each different set of labels becomes a different class in a new single-labelclassification task

    Most implementations of label powerset classifiers essentially ignore labelcombination that are not presented in the training set (cannot predict unseenlabelsets)

    Limited training examples for many classes

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Label powerset. Multiple diagnosis problem with probabilistic

    models. Peng and Reggia (1987a; 1987b)

    X1 . . . Xm C1 . . . Cd(x(1), c(1)) x (1)1 . . . x

    (1)m c

    (1)1 . . . c

    (1)d

    (x(2), c(2)) x (2)1 . . . x(2)m c

    (2)1 . . . c

    (2)d

    . . . . . . . . .

    (x(N), c(N)) x (N)1 . . . x(N)m c

    (N)1 . . . c

    (N)d

    Optimal diagnosis as abductive inference: searching for the most probable explanation(MPE)

    (c∗1 , . . . , c∗d )

    = arg max(c1,...,cd )

    p(C1 = c1, . . . ,Cd = cd |X1 = x1, . . . ,Xm = xm)

    = arg max(c1,...,cd )

    p(C1 = c1, . . . ,Cd = cd )p(X1 = x1, . . . ,Xm = xm|C1 = c1, . . . ,Cd = cd )

    Number of parameters to be estimated: 2d − 1 + 2d (2m − 1)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Outline

    1 Multi-label classification and multi-dimensional classification

    2 Binary relevance, classifier chain and label power set

    3 Bayesian networks

    4 Multi-dimensional Bayesian network classifiers

    5 Applications

    6 Conclusions

    7 References

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks (Pearl (1988) and Koller and Friedman (2009))

    A Bayesian network is a compact representation of the joint probabilitydistribution (JPD) p(X1, ...,Xn)

    Bayesian networks solve this problem by using the concept of conditionalindependence between triplets of variables

    Two random variables X and Y are conditionally independent givenanother random variable Z if p(x |y , z) = p(x |z) ∀x , y , zAn equivalent definition is p(x , y |z) = p(x |z)p(y |z) ∀x , y , z. LetIp(X ,Y |Z ) denote this condition

    Suppose that for each Xi , a subset Pa(Xi ) ⊆ {X1, ...,Xi−1} such that givenPa(Xi ), Xi is conditionally independent of all variables in {X1, ...,Xi−1} \ Pa(Xi ),i.e., p(Xi |X1, ...,Xi−1) = p(Xi |Pa(Xi )). Then the JPD factorizes as

    p(X1, ...,Xn) = p(X1)p(X2|X1)p(X3|X1,X2) · · · p(Xn|X1, ...,Xn−1)= p(X1|Pa(X1)) · · · p(Xn|Pa(Xn))

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Structure and parameters. Example: Factory production

    Years Y is the factory’s age, where y denotes ‘more than 10 years’ and ¬y denotes ‘less than 10 years’Employees E represents the number of employees: more (e) or less (¬e) than 100 employeesMachines M has two values: m and ¬m for ‘more than 20 machines’ and ‘less than 20 machines’Pieces P also has two options: more (p) or less (¬p) than 10,000 produced pieces per yearFailures F includes f that stands for ‘more than two failures on average per month’ and otherwise thestate is ¬f

    p(Y)

    0.750.25

    Y

    p(M|Y)

    0.100.90

    Y M

    0.850.15

    p(F|P)

    0.750.25

    M F

    0.050.95

    p(E|Y)

    0.700.30

    Y E

    0.200.80

    p(P|E,M)

    0.960.04

    M P

    0.400.60

    E

    0.450.550.100.90

    ¬y

    yy

    ¬y

    e

    e¬e

    ¬e

    y¬y

    ¬y

    yy

    ¬y

    m

    m¬m

    ¬m

    ¬m

    mm

    ¬m

    f

    f¬ f

    ¬ f

    e

    e

    ¬e

    ¬e

    ee

    ¬e¬e

    m

    ¬mm

    ¬mm

    ¬mm

    ¬m

    p

    p¬p

    ¬pp

    p¬p

    ¬p

    E M

    P

    YYears

    Employees Machines

    Pieces Failures

    F

    Hypothetical Bayesian network modeling factory production. To fully specify the JPD: 25 − 1 = 31 parameters. TheBayesian network representing requires 17 input conditional probabilities

    The Bayesian networks factorizes the JPD as: p(Y , E,M, P, F ) = p(Y )p(E|Y )p(M|Y )p(P|E,M)p(F |M)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Markov condition

    Markov condition or local directed Markov property

    The descendant nodes of Xi are the nodes reachable from Xi by repeatedly following the arcs

    Let ND(Xi ) denote the non-descendant nodes of XiFollowing the arcs in the opposite direction, we find the ancestors

    In a Bayesian network each node is conditionally independent of its non-descendants, given its parents:Ip(Xi ,ND(Xi )|Pa(Xi ))

    p(Y)

    0.750.25

    Y

    p(M|Y)

    0.100.90

    Y M

    0.850.15

    p(F|P)

    0.750.25

    M F

    0.050.95

    p(E|Y)

    0.700.30

    Y E

    0.200.80

    p(P|E,M)

    0.960.04

    M P

    0.400.60

    E

    0.450.550.100.90

    ¬y

    yy

    ¬y

    e

    e¬e

    ¬e

    y¬y

    ¬y

    yy

    ¬y

    m

    m¬m

    ¬m

    ¬m

    mm

    ¬m

    f

    f¬ f

    ¬ f

    e

    e

    ¬e

    ¬e

    ee

    ¬e¬e

    m

    ¬mm

    ¬mm

    ¬mm

    ¬m

    p

    p¬p

    ¬pp

    p¬p

    ¬p

    E M

    P

    YYears

    Employees Machines

    Pieces Failures

    F

    All nodes are descendants of Y . The descendants of M are P and F . The Markov condition for M statesthat E and M are c.i. given Y

    All nodes are non-descendants of P and hence P and {Y , F} are c.i. given {E,M}

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. u-separation

    u-separation (Lauritzen et al., 1990): A graphical criterion for finding additional (apartfrom Markov condition) conditional independences

    If X is u-separated from Y given Z, then X and Y are c.i. given Z, for any X, Y, Z disjoint random vectors

    Checking whether X and Y are u-separated by Z is a three-step procedure:

    1 Get the smallest subgraph containing X, Y, and Z and their ancestors. This is called the ancestralgraph

    2 Moralize the ancestral graph, i.e., add an undirected link between parents having a common childand then drop directions on all arcs.

    3 Z u-separates X and Y whenever Z is in all paths between X and Y

    E M

    P

    Y

    Years

    Employees Machines

    Pieces Failures

    F

    Moralized ancestral graph of the factory production Bayesian network

    Let us check whether P and F are u-separated by {E,M}. Since E or M is always found in every path (inthe moralized ancestral graph) from P to F , P and F are u-separated by {E,M}. Hence P and F are c.i.given {E,M}However, Y and F are not u-separated by P because we can go from F to Y through M, without crossing P

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Types of inference

    Probabilistic reasoning: p(Xi |e) where E = e is the observed evidenceAbductive inference finds the values of a set of variables that best explain theobserved evidence

    Total abduction: arg maxU p(U|e), i.e., we find the most probable explanation (MPE),Partial abduction solves the same problem for a subset of variables in u (the explanation set),referred to as the partial maximum a posteriori (MAP)

    Predictive reasoning: we predict the effect (produced pieces) given a cause(machines), p(p|m) = 0.62 (in the figure) and p(p|¬m) = 0.30 (not shown)

    (a) (b)Inference on the factory production example. (a) Prior distributions p(Xi ). (b) p(Xi |m)

    Diagnostic reasoning: we diagnose the causes given the effects. Given the effectF = f , the probability of the cause being many machines is p(m|f ) = 0.86Intercausal reasoning: if we know that the factory has many employees (E = e),this would explain the observed high number of pieces produced p and wouldlower the probability of Machines = m being the cause:p(m|p, e) = 0.32 < p(m|p) = 0.45

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Inference methods

    Exact inferenceBrute-force approach

    p(P) =∑

    Y ,E,M,F

    p(Y ,E ,M,F ,P)

    =∑

    Y ,E,M,F

    p(Y )p(E |Y )p(M|Y )p(P|E ,M)p(F |M)

    Variable elimination algorithm (Zhang and Poole, 1994)

    p(P) =∑

    Y

    p(Y )∑

    E

    p(E |Y )∑

    M

    p(M|Y )p(P|E ,M)∑

    F

    p(F |M)

    Junction tree algorithm (Lauritzen and Spiegelhalter, 1988) based onmessage passing algorithm on a junction tree

    Approximate inferenceProbabilistic logic sampling (Henrion, 1988)Likelihood weighting (Shachter and Peot, 1989)Gibbs sampling (Pearl, 1987)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Learning parameters

    For parameter learning, we need to have the structure

    Ri = |ΩXi |

    qi = |ΩPa(Xi )|, i.e., ΩPa(Xi ) = {pa1i , ...,pa

    qii }

    The parameter θijk = p(Xi = k |Pa(Xi ) = paji ): the conditional probability that Xi

    takes its k -th value given that its parents take their j-th value

    θijk parameters are organized into θ = (θ1, ...,θn) with a total of∑n

    i=1 Ri qicomponents, and are estimated from D = {x1, ..., xN}

    Nijk be the number of cases in D where Xi = k and Pa(Xi ) = paji have been

    observed at the same time

    Let Nij be the number of cases in D in which Pa(Xi ) = paji has been observed

    (Nij =∑Ri

    k=1 Nijk )

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Learning parameters

    The maximum likelihood estimation finds θ̂ML such that it maximizes thelikelihood of the dataset given the model:

    θ̂ML = arg maxθL(θ|D,G) = arg max

    θp(D|G,θ) = arg max

    θ

    N∏h=1

    p(xh|G,θ)

    Under the global parameter independence and local parameter independenceasumptions (Spiegelhalter and Lauritzen, 1990):

    L(θ|D,G) =n∏

    i=1

    qi∏j=1

    Ri∏k=1

    θNijkijk

    and θ̂MLijk =NijkNij

    With sparse datasets Laplace estimator is used: θ̂Lapijk =Nijk +1Nij +Ri

    If D has incomplete instances (with missing values), the estimations can becalculated with the EM algorithm (Lauritzen 1995) or even with the structural EM(Friedman 1998) where not only parameters but also structures can be updatedat each EM iteration

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Learning parametersX1

    X2

    X3

    X4

    X1X2X3X4

    1 2 21

    1 1 21

    2 1 2 1

    2 1 2 2

    1 2 21

    1 3 11

    (a) (b)

    (a) A Bayesian network structure with |ΩXi | = 2, i = 1, 3, 4, |ΩX2 | = 3 and q1 = q2 = 0, q3 = 6, q4 = 2. (b) Adataset with N = 6 for {X1, ..., X4} from which the structure in (a) has been learned

    Parameters Meaningθ1 = (θ1−1, θ1−2) (p(X1 = 1), p(X1 = 2))θ2 = (θ2−1, θ2−2, θ2−3) (p(X2 = 1), p(X2 = 2), p(X2 = 3))θ3 = (θ311, θ312, . . . , θ361, θ362) (p(X3 = 1|X1 = 1, X2 = 1),

    p(X3 = 2|X1 = 1, X2 = 1), . . .p(X3 = 1|X1 = 2, X2 = 3),p(X3 = 2|X1 = 2, X2 = 3))

    θ4 = (θ411, θ412, θ421, θ422) (p(X4 = 1|X3 = 1), p(X4 = 2|X3 = 1),p(X4 = 1|X3 = 2), p(X4 = 2|X3 = 2))

    To estimate θ1−1 = p(X1 = 1), we find four out of six instances in the X1 column, and hence, θ̂ML1−1 = 2/3

    To estimate θ322 = p(X3 = 2|X1 = 1, X2 = 2), we find that neither of the two instances withX1 = 1, X2 = 2, include either X3 = 2 or θ̂

    ML322 = 0 (this is a case where Nijk = 0)

    To estimate θ361 = p(X3 = 1|X1 = 2, X2 = 3), we find that θ̂ML361 is undefined since there are no instanceswith X1 = 2, X2 = 3 (i.e., Nij = 0). However, the Laplace estimates yield θ̂

    Lap322 = 1/4, θ̂

    Lap361 = 1/2

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Learning structures. Constraint-based

    methods

    Constraint-based methods use the data to statistically test conditionalindependences among triplets of variables

    The goal is to build a DAG that represents a large percentage (and wheneverpossible all) of the identified conditional independence constraints

    The PC algorithm (Spirtes and Glymour, 1991) starts with all nodes connectedby edges and follows three steps:

    1 Step 1 outputs the adjacencies in the graph, i.e., the skeleton of thelearned structure

    2 Step 2 identifies colliders (a collider or converging connection at node X isY → X ← Z )

    3 Step 3 orients the edges and outputs the complete partially DAG(CPDAG), the Markov equivalence class of DAGs

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Learning structures. Constraint-based

    methods

    Algorithm 2: Step 1 of the PC algorithm: estimation of the skeletonInput : A complete undirected graph and an ordering σ on the variables {X1, ...,Xn}Output: Skeleton G of the learned structure

    1 Form the complete undirected graph G on nodes {X1, ...,Xn}2 t = −13 repeat

    t = t + 1repeat

    4 Select a pair of adjacent nodes Xi − Xj in G using ordering σ5 Find S ⊂ Adj i \ {Xj} in G with |S| = t (if any) using ordering σ6 Remove edge Xi − Xj from G iff Xi and Xj are c.i. given S

    until all ordered pairs of adjacent nodes have been testeduntil all adjacent pairs Xi − Xj in G satisfy |Adj i \ {Xj}| ≤ t

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Score and search methods

    AIC, BICBD, K2,

    BDe, BDeu

    Greedy, simulated annealing, EDAs, genetic algorithms, MCMC

    Dynamic programming, branch & bound, mathematical programming

    Score and search

    Search spaces Scores Search

    DAGsEquiv.classes

    OrderingsPenalized likelihood

    Bayesian Exact Approximate

    Methods for Bayesian network structure learning based on score and search

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Score and search methods

    Finding a network structure that optimizes a score is an NP-hard (Chickering,1996)

    Different search spaces:Space of DAGs whose cardinality was given by Robinson (1977):

    f (n) =n∑

    i=1

    (−1)i+1(n

    i

    )2i(n−i)f (n − i), for n > 2

    with f (0) = f (1) = 1Space of Markov equivalence classes is smaller than the space of DAGs.The #DAGs/#CPDAGs ratio approaches an asymptote of about 3.7(Gillispie and Perlman, 2002)Space of orderings of the variables: some learning algorithms (e.g., the K2algorithm) only work with a fixed order. Cardinality: n!

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Score and search methods. ScoresThe need of penalized likelihood scores

    The estimated log-likelihood of the data given the BN:

    logL(θ̂|D,G) = log p(D|G, θ̂) = logn∏

    i=1

    qi∏j=1

    Ri∏k=1

    θ̂Nijkijk =

    n∑i=1

    qi∑j=1

    Ri∑k=1

    Nijk log θ̂ijk

    where θ̂ijk = θ̂MLijk =NijkNij

    , the frequency counts in DThis score increases monotonically with the complexity of the modelThe optimal structure would be the complete graph

    X1

    Lik

    elih

    oo

    d

    DAG complexity

    Training data

    Test data

    X2

    X3

    X4

    X1

    X2

    X3

    X4

    X1

    X2

    X3

    X4

    Structural overfitting: the likelihood of the training data is higher for denser graphs, but it degrades for the test data

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Score and search methods. Scores

    Penalized likelihood scores

    General expression of a family of penalized log-likelihood scores:

    QPen(D,G) =n∑

    i=1

    qi∑j=1

    Ri∑k=1

    Nijk logNijkNij− dim(G)pen(N)

    where dim(G) =∑n

    i=1(Ri − 1)qi denotes the model dimension (number ofparameters needed in the Bayesian network), and pen(N) is a non-negativepenalization function

    Depending on pen(N):If pen(N) = 1, the score is called Akaike’s information criterion (Akaike,1974)If pen(N) = 12 log N, the score is the Bayesian information criterion (BIC)(Schwartz, 1978)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Score and search methods. Scores

    Bayesian scores

    The Bayesian approach to structure learning aims to find G that maximizes itsposteriori probability given the data, i.e., find arg maxG p(G|D)Using Bayes’ formula: p(G|D) ∝ p(D,G) = p(D|G)p(G)The second factor, p(G), is the prior distribution over structuresThe first factor, p(D|G), is the marginal likelihood of the data, defined as

    p(D|G) =∫

    p(D|G,θ)f (θ|G)dθ

    where p(D|G,θ) is the likelihood of the data given the Bayesian network(structure G and parameters θ), and f (θ|G) is the prior distribution over theparameters

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Score and search methods. ScoresBayesian scores

    Depending on f (θ|G), we have different scores: If (θij |G) follows a Dirichlet ofparameters αij1, ..., αijRi , we have the Bayesian Dirichlet score (BD score). A Dirichletdistribution is determined by hyperparameters αijk for all i, j, k

    The K2 score (Cooper and Herskovits, 1992) uses the uninformative assignmentαijk = 1, for all i, j, k , resulting in

    QK 2(D,G) = p(G)n∏

    i=1

    qi∏j=1

    (Ri − 1)!(Nij + Ri − 1)!

    Ri∏k=1

    Nijk !.

    The K2 algorithm uses a greedy search method and the K2 score. The user gives a node ordering and themaximum number of parents that any node is permitted to have. Starting with an empty structure, thealgorithm incrementally adds, from the set of nodes that precede each node Xi in the node ordering, theparent whose addition most increases the function:

    g(Xi , Pa(Xi )) =qi∏

    j=1

    (Ri − 1)!(Nij + Ri − 1)!

    Ri∏k=1

    Nijk !.

    When the score does not increase further with the addition of a single parent, no more parents are added to

    node Xi , and we move on to the next node in the ordering

    The likelihood-equivalent Bayesian Dirichlet score (BDe score) (Heckerman et al.(1995)) sets the hyperparameters as αijk = α p(Xi = k ,Pa(Xi ) = pa

    ji |G). The

    equivalent sample size α expresses the user’s confidence in the prior networkIn the BDeu score (Buntine, 1991), αijk = α 1qi Ri

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Dynamic, temporal and continuous time Bayesian networks

    Dynamic Bayesian networks (Dean and Kanazawa, 1989)

    X1[1]

    X2[1]

    X3[1]

    X4[1]

    X1[t+1]X 1[t ]

    X2[ t ]

    X3[ t ]

    X2[ t+1]

    X3[ t+1]

    X4[t ] X

    4[t+1]

    X1[1]

    X2[1]

    X3[1]

    X4[1]

    X1[2]

    X2[2 ]

    X3[2]

    X4[2]

    X1[3]

    X2[3]

    X3[3]

    X4[3]

    (a) (b) (c)

    A dynamic Bayesian network structure with four variables X1, X2, X3 and X4 and three time slices (T = 3):(a) prior Bayesian network; (b) transition network, with the first-order Markovian transition assumption; (c)

    dynamic Bayesian network unfolded in time for three time slices

    Temporal nodes Bayesian networks (Galán et al., 2007)

    Continuous time Bayesian networks (Nodelman et al., 2002)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Bayesian networks. Software

    HUGIN1 GeNIe2 Open-Markov3 gRain4 bnlearn5

    Exact inferenceJunction tree

    √ √ √

    Approximate inferenceProbabilistic logic sampling

    √ √ √ √

    Constraint-based learningPC algorithm

    √ √ √ √

    Score+searchK2 algorithm

    √ √ √ √

    The symbol√

    denotes that the inference or structure learning method is available in the respective software

    1https://www.hugin.com/

    2https://www.bayesfusion.com/

    3http://www.openmarkov.org/

    4https://CRAN.R-project.org/package=gRain

    5http://www.bnlearn.com/

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Outline

    1 Multi-label classification and multi-dimensional classification

    2 Binary relevance, classifier chain and label power set

    3 Bayesian networks

    4 Multi-dimensional Bayesian network classifiers

    5 Applications

    6 Conclusions

    7 References

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Multiple diagnosis problem. Direct approach

    X1 . . . Xm C1 . . . Cd(x(1), c(1)) x (1)1 . . . x

    (1)m c

    (1)1 . . . c

    (1)d

    (x(2), c(2)) x (2)1 . . . x(2)m c

    (2)1 . . . c

    (2)d

    . . . . . . . . .

    (x(N), c(N)) x (N)1 . . . x(N)m c

    (N)1 . . . c

    (N)d

    Optimal diagnosis as abductive inference: searching for the most probable explanation(MPE)

    (c∗1 , . . . , c∗d )

    = arg max(c1,...,cd )

    p(C1 = c1, . . . ,Cd = cd |X1 = x1, . . . ,Xm = xm)

    = arg max(c1,...,cd )

    p(C1 = c1, . . . ,Cd = cd )p(X1 = x1, . . . ,Xm = xm|C1 = c1, . . . ,Cd = cd )

    Number of parameters to be estimated for the case of binary predictors and classes:2d − 1 + 2d (2m − 1)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Multiple diagnosis with multi-dimensional Bayesian network classifiers (MBCs) (van der

    Gaag and de Waal, 2006)

    In a multi-dimensional Bayesian network classifier (MBC) the set of vertices V ispartitioned into:

    VC = {C1, ...,Cd} of class variables andVX = {X1, ...,Xm} of feature variables

    with (d + m = n)

    Three subgraphs in the structure of a multi-dimensional Bayesian networkclassifier

    Class subgraph: GCBridge subgraph: GCXFeature subgraph: GX

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Examples of MBCs structures

    (a) Empty-empty MBC

    (b) Tree-tree MBC

    (c) Polytree-DAG MBC

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Tractability of MPE in MBCs with class bridgedecomposable MBCs (Bielza et al., 2011)

    MPE is generally NP-hard in Bayesian networks (Kwisthout, 2011)

    An MBC is class-bridge decomposable (CB-decomposable MBC) (Bielza et al., 2011)if:

    1 GC ∪ GCX can be decomposed as: GC ∪ GCX =⋃r

    i=1(GC i ∪ G(CX ) i ) whereGC i ∪ G(CX ) i with i = 1, ..., r are its maximal connected components

    2 Non-shared children: Ch(VC i ) ∩ Ch(VC j ) = ∅, with i, j = 1, ..., r and i 6= j , whereCh(VC i ) denotes the children of all the variables in VC i

    (a) A CB-decomposable MBC

    (b) Its two maximal connected components

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Tractability of MPE in MBCs with class bridge decomposable MBCs (Bielza et al., 2011)

    Theorem (Bielza et al., 2011)

    Given a CB-decomposable MBC where Ii =∏

    C∈VC iΩC denotes the sample space associated with VC i , then

    maxc1,...,cd

    p(C1 = c1, ...,Cd = cd |X1 = x1, ..., Xm = xm)

    ∝r∏

    i=1

    maxc↓VC i ∈Ii

    ∏C∈VC i

    p(c|pa(c))∏

    X∈Ch(VC i )p(x|paVC (x), paVX (x))

    where c↓VC i represents the projection of vector c to the coordinates found in VC i

    maxc1,...,c5 p(C1 = c1, ...,C5 = c5|X1 = x1, ..., X6 = x6) ∝maxcp(c1)p(c2)p(c3|c2)p(c4)p(c5|c4)p(x1|c1)p(x2|c1, c2, x1)p(x3|c3)p(x4|c3, x3, x5, x6)p(x5|c4, c5, x6)p(x6|c5)= maxc1,c2,c3 p(c1)p(c2)p(c3|c2)p(x1|c1)p(x2|c1, c2, x1)p(x3|c3)p(x4|c3, x3, x5, x6)·

    maxc4,c5 p(c4)p(c5|c4)p(x5|c4, c5, x6)p(x6|c5)

    Here VC1 = {C1,C2,C3},VC2 = {C4,C5},Ch(VC1) = {X1, X2, X3, X4} and Ch(VC2) = {X5, X6}. Notethat Ch(VC1) ∩ Ch(VC2) = ∅ as required in the theorem

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Tractability of MPE in MBCs according to thetreewidth

    Exact methods for MPE computations in Bayesian networks are exponential in thetreewidth of G

    Treewidth as the size of the largest clique in its triangulated graph (all cycles of four ormore nodes have a chord)

    Several results for bounding treewidths in MBCs:1 For MBCs with empty feature subgraph (Pastink and van der Gaag, 2015):

    treewidth(G) < treewidth(G′ ), where G′ is the pruned graph (first moralize G andthen remove the feature nodes from the moral graph)

    2 For general MBCs (DAGs-DAGs MBCs) (de Waal and van der Gaag, 2007):treewidth(G) ≤ treewidth(GX ) + d

    3 For CB-decomposable MBCs (Kwisthout, 2011):treewidth(G) ≤ treewidth(GX ) + |dmax|, where |dmax| is the number of classvariables of the component with maximun number of class variables

    4 For general MBCs (DAGs-DAGs MBCs) (Benjumeda et al., 2018) present underwhich circunstances MPE computation can be done in polynomial time

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning MBCs from data. Cardinality of thesearch space

    Theorem (Bielza et al., 2011)

    The number of all possible MBC structures with d class variables and n feature variables, MBC(d, n), is

    MBC(d,m) = S(d) · 2dm · S(m),

    where S(m) =∑m

    i=1(−1)i+1(

    mi

    )2i(m−i)S(m − i) is Robinson’s formula that counts the number of possible DAG

    structures of n nodes, which is initialized as S(0) = S(1) = 1

    Theorem (Bielza et al., 2011)

    The number of all possible bridge subgraphs, BRS(d,m), m ≥ d, for MBCs satisfying the following two conditions:(a) for each Xi ∈ VX , there is a Cj ∈ VC with (Cj , Xi ) ∈ ACX and (b) for each Cj ∈ VC , there is an Xi ∈ VXwith (Cj , Xi ) ∈ ACX , is given by the recursive formula

    BRS(d,m) = 2dm −m−1∑k=0

    (dmk

    )−

    dm∑k=m

    ∑x≤d,y≤m

    k≤xy≤dm−d

    (dx

    )(my

    )BRS(x, y, k)

    where BRS(x, y, k) denotes the number of bridge subgraphs with k arcs in an MBC with x class variables and yfeature variables which is initialized as BRS(1, 1, 1) = BRS(1, 2, 2) = BRS(2, 1, 2) = 1

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning MBCs from data. Empty-empty MBC

    Empty-empty MBC learned Multi-label naive Bayes (MLNB)(Zhang et al., 2009)

    A two-stage filter-wrapper feature selection strategy is incorporated:First stage (filter): feature extraction techniques based on principlecomponent analysis (PCA)Second stage (wrapper): subset selection techniques based on a geneticalgorithm (GA) (on the space of PCAs) are used to choose the mostappropriate subset of features for classification

    For continuous features a Gaussian assumption is assumed: the density of thefeatures variables given the class values follows a Gaussian density

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning MBCs from data. Tree-tree MBC

    Tree-tree MBC learned in a three steps algorithm (van derGaag and de Waal, 2006)

    1 The class subgraph is learnt by searching for the maximum weighted undirectedspanning tree and transforming it into a directed tree using Chow and Liu’s(1968) algorithm

    The weight of an edge is the mutual information between a pair of classvariables

    2 For a fixed bridge subgraph, the feature subgraph is then learnt by building amaximum weighted directed spanning tree

    The weight of an arc is the conditional mutual information between pairs offeature variables given the parents (classes) of the second featuredetermined by the bridge subgraph

    3 The bridge subgraph is greedily changed in a wrapper-like way trying to improvethe considered metric (i.e., exact match)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning MBCs from data. DAG-DAG MBC. Filter

    DAG-DAG MBC (Bielza et al., 2011)

    Score: penalized likelihood or Bayesian scoreGreedy search in five steps:

    1 Learn the class subgraph2 Learn the feature subgraph3 Propose a candidate bridge subgraph4 Obtain a candidate feature subgraph5 Decide on bridge and feature subgrah candidates

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning MBCs from data. DAG-DAG MBC. Filter

    DAG-DAG MBC (Bielza et al., 2011)

    1. Learn the class subgraph

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning MBCs from data. DAG-DAG MBC. Filter

    DAG-DAG MBC (Bielza et al., 2011)

    2. Learn the feature subgraph and3. Propose a candidate bridge subgraph

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning MBCs from data. DAG-DAG MBC. Filter

    DAG-DAG MBC (Bielza et al., 2011)

    2. Learn the feature subgraph and3. Propose a candidate bridge subgraph

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning MBCs from data. DAG-DAG MBC. Filter

    DAG-DAG MBC (Bielza et al., 2011)

    4. Obtain a candidate feature subgraph and5. Decide on bridge and feature subgrah candidates

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning MBCs from data. DAG-DAG MBC.Wrapper

    DAG-DAG MBC (Bielza et al., 2011)

    i = 01. G(i) = ∅. Acc = Acc(i)

    2. Whenever there are arcs that can be added to G(i) (and not previously discarded):Add one arc to GC (i), GCX (i) or GX (i) and obtain the new G(i+1) and Acc(i+1)

    3. If Acc(i+1) > Acc(i), Acc = Acc(i+1), i = i + 1, and go to 2Else discard the arc and go to 2

    4. Stop and return G(i) and Acc

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning CB decomposable MBCs from data (Borchani et al., 2010)

    Phase I. Learn bridge subgraph

    X2

    C1

    X5 X1

    C2

    X4X2 X1X4

    C3

    X3X6 X5

    C4

    X4X6

    Starting from an empty graphical structure, learn a selective naive Bayes for each class variable

    X2

    C1

    X1

    C2

    X3

    C3

    X6X4 X5

    C4

    Check the non-shared children property to induce an initial CB-decomposable MBC (remove all commonchildren based on two criteria: feature insertion rank and accuracy)The result of this phase is a simple CB-decomposable MBC where only the bridge subgraph is defined(class and feature subgraphs are empty)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning CB decomposable MBCs from data (Borchani et al., 2010)

    Phase II. Learn feature subgraph

    X2

    C1

    X1

    C2

    X3

    C3

    X6X4 X5

    C4

    Introduce the dependence relationships between the feature subgraph

    Fix the maximum number of iterations

    In each iteration an arc is selected at random between a pair of feature variables. If there is an accuracyimprovement, the arc is added, otherwise it is discarded

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning CB decomposable MBCs from data (Borchani et al., 2010)

    Phase III. Merge maximal connected components

    X2

    C1

    X1

    C2

    X3

    C3

    X6X4 X5

    C4

    X2

    C1

    X1

    C2

    X3

    C3

    X6X4 X5

    C4

    Merging the maximal connected components Bridge subgraph update

    All possible arc additions between the class variables are evaluated, adding the arc improving the accuracythe most (from r to r − 1 maximal connected components)A bridge update step is performed inside the new induced maximal connected component

    X2

    C1

    X1

    C2

    X3

    C3

    X6X4 X5

    C4

    Feature subgraph update

    Update the feature subgraph by inserting, one by one, additional arcs between feature variables

    This phase iterates over these three steps (stopping criteria: no more component merging can improve theaccuracy or r = 1)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Learning MBCs by a constrained-based approach (Borchani et al., 2013)

    Markov blanket MBC (MB-MBC) learning algorithm

    Apply the HITON algorithm (Aliferis et al., 2003) to each class variable todetermine Markov blankets for each of them

    Given the MBC definition, direct parents of any class variable Ci , i = 1, . . . , d ,can only be among the remaining class variables, whereas direct children orspouses of Ci can include either class or feature variables

    The MBC subgraphs based on the results of the HITON algorithm:

    Class subgraph: firstly insert an edge between each class variable Ci andany class variable belonging to its corresponding parents-children setPC(Ci ). Then, direct all these edges using the PC algorithm’s edgeorientation rulesBridge subgraph: this is built by inserting an arc from each class variableCi to every feature variable belonging to PC(Ci )Feature subgraph: for every feature X in the set MB(Ci ) \ PC(Ci ), i.e., forevery spouse X , we insert an arc from X to the corresponding commonchild given by PC(X) ∩ PC(Ci )

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Multi-dimensional Bayesian network classifier trees (Gil-Begue et al., 2018)

    A multi-dimensional Bayesian network classifier tree (MBCTree) is a classification treewith MBCs in the leaves

    An internal node of an MBCTree corresponds to a feature variable Xi , as instandard classification trees, and has a labelled branch to a child for each of itspossible valuesA leaf of an MBCTree is an MBC over all the class variables and those featuresnot present in the path from the root to the leaf

    Wrapper approach (greedy search) guided by the exact match accuracy for learningMBCTrees

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Detecting multi-dimensional concept drift in data streams (Borchani et al., 2016)

    The concept drift detection used to be based on ensemble updatingWe use a single MBC and our drift detection method is performed locallyThe average local log-likelihood of each variable in the MBC network:

    llsi =

    1Ns∑qi

    j=1∑ri

    k=1 Nsijk log

    NsijkNsij

    Change point detection with the Page-Hinkley testCUMs =

    ∑st=1(LL

    t −meanLLt− δ) where mean

    LLt= 1t

    ∑th=1 LL

    h

    denotes the mean of LL1, ..., LL

    tvalues, and δ is a tolerance parameter

    MAX s = max{

    CUM t , t = 1, ..., s}

    The Page-Hinkley test value: PHs = MAX s − CUMs . If PHs > λ, the nullhypothesis is rejected and the Page-Hinkley test alarms a change,otherwise, no change is signaled

    Evolution of the average local log-likelihood values of four different class variablesPedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Outline

    1 Multi-label classification and multi-dimensional classification

    2 Binary relevance, classifier chain and label power set

    3 Bayesian networks

    4 Multi-dimensional Bayesian network classifiers

    5 Applications

    6 Conclusions

    7 References

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Predicting human immunodeficiency virus (HIV) type 1 inhibitors (Borchani et al., 2013)

    A cell with the HIV-1. Reverse transcription

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Predicting human immunodeficiency virus (HIV) type 1 inhibitors (Borchani et al., 2013)

    Therapies for HIV-1 are combinations or cocktails of antiretroviral drugs

    We aim to gain insight into the different interactions between drugs andresistance mutations

    Reverse transcriptase inhibitors (RTIs) consist of two groups of antiretroviraldrugs preventing HIV-1 replication: nucleoside and nucleotide reversetranscriptase inhibitors (NRTIs) and non-nucleoside reverse transcriptaseinhibitors (NNRTIs)

    NRTIs consist of seven drugs: Abacavir (ABC), Didanosine (DDI), Emtricitabine(FTC), Lamivudine (3TC), Stavudine (D4T), Tenofovir (TDF), and Zidovudine(AZT)

    NNRTIs considers three drugs: Efavirenz (EFV), Nevirapine (NVP), andDelavirdine (DLV)

    A total of 38 mutations associated with resistance to RTIs: 22 associated withNNRTs and 16 with NNRTIs

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Predicting human immunodeficiency virus (HIV) type 1 inhibitors (Borchani et al., 2013)

    We analyzed reverse transcriptase and protease data sets obtained from theonline Stanford HIV-1 database (Rhee et al., 2003)Treatment histories from 2855 patients that received either NRTIs, NNRTIs orbothThe data set contained a total of 4884 samplesThe number of RTIs varies from 1 to 8 drugs: 17 samples with 1 RTI, 25 with 2RTIs, 157 with 3 RTIs, 698 with 4 RTIs, 1852 with 5 RTIs, 1600 with 6 RTIs, 483with 7 RTIs, and 56 with 8 RTIs

    Mean accuracy Exact matchmaxCS = 1 0.7108± 0.0221 0.1151± 0.0466maxCS = 2 0.7062± 0.0191 0.0881± 0.0403

    MB-MBC maxCS = 3 0.7019± 0.0153 0.0780± 0.0363maxCS = 4 0.6995± 0.0145 0.0701± 0.0336maxCS = 5 0.6978± 0.0106 0.0646± 0.0241

    Tree-Tree 0.6968± 0.0163 0.0364± 0.0101DAG-DAG Filter 0.7074± 0.0063 0.0240± 0.0066DAG-DAG Wrapper 0.7095± 0.0040 0.0291± 0.0008CB-MBC 0.7261± 0.0113 0.0382± 0.0105

    Estimated performance metrics (mean± std. deviation) for RTIs data set

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Predicting human immunodeficiency virus (HIV) type 1 inhibitors (Borchani et al., 2013)

    The graphical structure of the MBC learnt using RTIs data set

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    EQ-5D health states from PDQ-39 in Parkinson (Borchani et al., 2012)

    Parkinson disease motor symptoms

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    EQ-5D health states from PDQ-39 in Parkinson (Borchani et al., 2012)

    PDQ-39 and EQ-5D: quality of life instruments to measure the degree of disability in PD

    39-item Parkinson’s Disease Questionnaire: a specific instrument

    PDQ-39 captures patient’s perception of his illness covering 8 dimensions:

    1 Mobility

    2 Activities of dailyliving

    3 Emotional well-being

    4 Stigma

    5 Social support

    6 Cognitions

    7 Communication

    8 Bodily discomfort

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    EQ-5D health states from PDQ-39 in Parkinson (Borchani et al., 2012)

    European Quality of Life - 5 Dimensions: a generic instrument

    EQ-5D is a generic measure of health for clinical and economic appraisal

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    EQ-5D health states from PDQ-39 in Parkinson (Borchani et al., 2012)

    Mapping PDQ-39 to EQ-5D

    PDQ1 PDQ2 ... ... PDQ39 EQ1 EQ2 EQ3 EQ4 EQ53 1 ... ... 3 1 3 3 2 12 3 ... ... 2 1 1 2 3 25 2 ... ... 4 1 3 3 1 2... ... ... ... ... ... ... ... ... ...4 4 ... ... 3 3 1 2 3 24 4 ... ... 3 3 1 2 3 25 5 ... ... 4 2 3 2 3 3

    φ : (PDQ1, ...,PDQ39)→ (EQ1, ...,EQ5)

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    EQ-5D health states from PDQ-39 in Parkinson (Borchani et al., 2012)

    488 Parkinson’s patients. Estimated measures over 5-fold cross-validation

    Method Mean accuracy Exact matchMB-MBC 0.7119± 0.0338 0.2030± 0.0718CB-MBC 0.6807± 0.0285 0.1865± 0.0429MNL 0.6926± 0.0430 0.1802± 0.0713OLS 0.4201± 0.0252 0.0123± 0.0046CLAD 0.4254± 0.0488 0.0143± 0.0171

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    EQ-5D health states from PDQ-39 in Parkinson (Borchani et al., 2012)

    MB-MBC graphical structure

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Outline

    1 Multi-label classification and multi-dimensional classification

    2 Binary relevance, classifier chain and label power set

    3 Bayesian networks

    4 Multi-dimensional Bayesian network classifiers

    5 Applications

    6 Conclusions

    7 References

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    MULTI-DIMENSIONAL BAYESIAN NETWORK CLASSIFIERSSpecialized Bayesian networks for solving multi-label andmuti-dimensional problemsAdvantages:

    Transparency, interpretabilityVariety of inference (MPE) and learning algorithmsHierarchy of models organized by structural complexityCompetitive results with the state-of-the-art methods whend is about dozens

    Disadvantages:Wrapper approaches for learning are very computationallydemandingIn large d problems difficult to compete with state-of-the-artmethods

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    Outline

    1 Multi-label classification and multi-dimensional classification

    2 Binary relevance, classifier chain and label power set

    3 Bayesian networks

    4 Multi-dimensional Bayesian network classifiers

    5 Applications

    6 Conclusions

    7 References

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    References

    H. Akaike (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control,6, 716-723.C. F. Aliferis, I. Tsamardinos and M. S. Statnikov (2003). HITON: A novel Markov blanket algorithm foroptimal variable selection. AMIA Annual Symposium Proceedings, 2125.M. Benjumeda, C. Bielza, and P. Larrañaga (2018). Tractability of most probable explanations inmultidimensional Bayesian network classifiers. International Journal of Approximate Reasoning, 93, 74-87.C. Bielza and G. Li and P. Larrañaga (2011). Multi-dimensional classification with Bayesian networks.International Journal of Approximate Reasoning, 52, 705-727.C. Bielza and P. Larrañaga (2014). Discrete Bayesian network classifiers: A survey. ACM ComputingSurveys, 47 (1), Article 5.R. Blanco, I. Inza, M. Merino, J. Quiroga, and P. Larrañaga (2005). Feature selection in Bayesian classifiersfor the prognosis of survival of cirrhotic patients treated with TIPS. Journal of Biomedical Informatics, 38(5),376-388.Borchani, H., C. Bielza, and P. Larrañaga (2010). Learning CB-decomposable multi-dimensional Bayesiannetwork classifiers. Proceedings of the 5th European Workshop on Probabilistic Graphical Models, 25-32.H. Borchani, C. Bielza, P. Martı́nez-Martı́n and P. Larrañaga (2012). Multidimensional Bayesian networkclassifiers applied to predict the European quality of life-5 dimensions (EQ-5D) from the 39-item Parkinson’sdisease questionnaire (PDQ-39), Journal of Biomedical Informatics, 45, 1175-1184H. Borchani, C. Bielza, and P. Larrañaga (2013). Predicting human immunodeficiency virus inhibitors usingmulti-dimensional Bayesian network classifiers. Artificial Intelligence in Medicine, 57 (3), 219-229.H. Borchani, P. Larrañaga, J. Gama, and C. Bielza (2016) Mining multi-dimensional concept-drifting datastreams using Bayesian network classifiers. Intelligent Data Analysis, 20 (2), 257-280.M. R. Boutell, J. Luo, X. Shen and C.M. Brown (2004). Learning multi-label scene classification. PatternRecognition 37, 17571771.W.L. Buntine (1991). Theory refinement on Bayesian networks. Proceedings of the 7th Conference onUncertainty in Artificial Intelligence, 52-60.D.M. Chickering (1996). Learning Bayesian networks is NP-Complete. Learning from Data: ArtificialIntelligence and Statistics V, 121-130, Springer.

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    References

    C.K. Chow and C.N. Liu (1968). Approximating discrete probability distributions with dependence trees.IEEE Transactions on Information Theory, 14(3), 462-467.

    A. Clare and R.D. King (2001). Knowledge discovery in multi-label phenotype data. Proceedings of the 5thEuropean Conference on Principles of Data Mining and Knowledge Discovery, 42-53.

    G.F. Cooper and E. Herskovits (1992). A Bayesian method for the induction of probabilistic networks fromdata. Machine Learning, 9, 309-347.

    P.R. de Waal and L. van der Gaag (2007). Inference and learning in multi-dimensional Bayesian networkclassifiers. Proceedings of the 9th Europen Conference on Symbolic and Quantitative Approaches toReasoning with Uncertainty, Springer, 501-511.

    T. Dean and K. Kanazawa (1989). A model for reasoning about persistence and causation. ComputationalIntelligence, 5(3), 142-150.

    K. Deb, A. Pratap, S. Agarwal and T. Meyarivan (2002). A fast and elitist multiobjective genetic algorithm:NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182-197.

    A. Elisseeff and J. Weston (2002). A kernel method for multi-labelled classification. Advances in NeuralInformation Processing Systems 14, 681-687.

    N. Friedman (1998). The Bayesian structural EM algorithm. Proceedings of the 14th Conference onUncertainty in Artificial Intelligence, 129-138.

    N. Friedman, D. Geiger and M. Goldszmidt (1997). Bayesian network classifiers. Machine Learning, 29,131-163.S.F. Galán, G. Arroyo-Figueroa, F.J. Dı́ez L.E. and Sucar (2007). Comparison of two types of eventBayesian networks: A case study. Applied Artificial Intelligence, 21(3), 185-209.

    S. Gil-Begue, C. Bielza, P. Larrañaga (2018). Multi-dimensional Bayesian network classifier trees. The 9thInternational Conference on Probabilistic Graphical Models, submitted.S.B. Gillispie and M.D. Perlman (2002). The size distribution for Markov equivalence classes of acyclicdigraph models. Artificial Intelligence, 141(1/2), 137-155.

    E. Gibaja and S. Ventura (2015). A tutorial on multi-label learning. ACM Computing Surveys, 47, 3, Article52.

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    References

    S. Godbole and S. Sarawagi (2004). Discriminative methods for multi-labeled classification. Proceedings ofthe 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 22-30.

    D. Heckerman, D. Geiger and D.M. Chickering (1995). Learning Bayesian networks: The combination ofknowledge and statistical data. Machine Learning, 20, 197-243.

    M. Henrion (1988). Propagating uncertainty in Bayesian networks by probabilistic logic sampling.Uncertainty in Artificial Intelligence 2, 149-163, Elsevier Science.

    D. Koller and N. Friedman (2009). Probabilistic Graphical Models: Principles and Techniques. The MITPressD. Koller and M. Sahami (1996). Toward optimal feature selection. Proceedings of the 13th InternationalConference on Machine Learning, 284-292.

    J. Kwisthout (2011). Most probable explanations in Bayesian networks: Complexity and tractability.International Journal of Approximate Reasoning, 52(9), 1452-1469.

    P. Langley and S. Sage (1994). Induction of selective Bayesian classifiers. Proceedings of the 10thConference on Uncertainty in Artificial Intelligence, 399-406.

    S.L. Lauritzen (1995). The EM algorithm for graphical association models with missing data. ComputationalStatistics and Data Analysis, 19, 191-201.

    S.L. Lauritzen, A.P. Dawid, B.N. Larsen and H.G. Leimer (1990). Independence properties of directedMarkov fields. Networks, 20(5), 491-505.

    S.L. Lauritzen and D.J. Spiegelhalter (1988). Local computations with probabilities on graphical structuresand their application to expert systems. Journal of the Royal Statistical Society, Series B (Methodological),50(2), 157-224.

    G. Madjarov, D. Kocev, D. Gjorgjevikj and S. Džeroskii (2012). An extensive experimental comparison ofmethods for multi-label learning. Pattern Recognition, 45, 9, 3084-3014.

    M. Minsky (1961). Steps toward artificial intelligence. Transactions on Institute of Radio Engineers, 49, 8-30.

    U. Nodelman, C.R. Shelton and D. Koller (2002). Continuous time Bayesian networks. Proceedings of the18th Conference on Uncertainty in Artificial Intelligence, 378-387.

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    References

    A. Pastink and L. van der Gaag (2015). Multi-classifiers of small treewidth. Proceedings of the 13th EuropenConference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty, Springer, 199-209.

    J. Pearl (1987). Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence,32(2), 245-257.

    J. Pearl (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann.

    Y. Peng and J. A. Reggia (1987a). A probabilistic causal model for diagnostic problem solving. Part I:Integrating symbolic causal inference with numeric probabilistic inference. IEEE Transactions on Systems,Man and Cybernetics, 17 (2), 146-162.

    Y. Peng and J. A. Reggia (1987b). A probabilistic causal model for diagnostic problem solving. Part II:Diagnostic strategy. IEEE Transactions on Systems, Man and Cybernetics, 17 (3), 395-406.

    J. Read, B. Pfahringer, G. Homes and E. Frank (2011). Classifier chains for multi-label classification.Machine Learning, 85, 3, 333-359.

    S.Y. Rhee, M.J. Gonzales, R. Kantor, J. Betts, J. Ravela and R.W. Shafer (2003). Human immunodeficiencyvirus reverse transcriptase and protease sequence database. Nucleic Acids Research, 31(1), 298-303.

    R. Robinson (1977). Counting unlabeled acyclic digraphs. Lecture Notes in Mathematics, 622, 28-43,Springer.

    Y. Saeys, I. Inza and P. Larrañaga (2007). A review of feature selection techniques in bioinformatics.Bioinformatics, 23 (19), 2507-2517.

    G. Schwarz (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461-464.

    R.D. Shachter and M.A. Peot (1989). Simulation approaches to general probabilistic inference on beliefnetworks. Proceedings of the 5th Annual Conference on Uncertainty in Artificial Intelligence, 221-234.

    D.J. Spiegelhalter and S.L. Lauritzen (1990). Sequential updating of conditional probabilities on directedgraphical structures. Networks, 20, 579-605.

    P. Spirtes and C. Glymour (1991). An algorithm for fast recovery of sparse causal graphs. Social ScienceComputer Review, 90(1), 62-72.

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    References

    E. Sucar, C. Bielza, E. F. Morales, P. Hernandez-Leal, J. H. Zaragoza and P. Larrañaga (2014). Multi-labelclassification with Bayesian network-based chain classifiers. Pattern Recognition Letters, 41, 14-22.L. Tenenboim, L. Rokach and B. Shapira (2010). Identification of label dependencies for multi-labelclassification. Proceedings of the 2nd International Workshop on Learning from Multi-Label Data, 53-60.G. Tsoumakas, A. Dimou, E. Spyromitros, V. Mezaris, I. Kompatsiaris and I. Vlahavas (2009).Correlation-based pruning of stacked binary relevance models for multi-label learning. Proceedings of the1st International Workshop on Learning from Multi-Label Data, 101-116.G. Tsoumakas, I. Katakis and I. Vlahavas (2010). Random k -labelsets for multi-label classification. IEEETransactions on Knowledge and Data Engineering, 23, 7, 1079-1089.G. Tsoumakas and I. Katakis (2007). Multi-label classification: An overview. International Journal of DataWarehousing and Mining, 3, 1-13.L.C. van der Gaag and P.R. de Waal (2006). Multi-dimensional Bayesian network classifiers. Proceedings ofthe 3rd European Workshop on Probabilistic Graphical Models, 107-114.G. Varando, C. Bielza and P. Larrañaga (2015). Decision boundary for discrete Bayesian network classifiers.Journal of Machine Learning Research, 16, 2725-2749.G. I. Webb, J. Boughton and Z. Wang (2002). Not so naive Bayes: Aggregating one-dependenceestimators. Machine Learning, 58, 5-24.M.L. Zhang, J.M. Peña and V. Robles (2009). Feature selection for multi-label naive Bayes classification.Information Sciences, 179 (19), 3218-3229.N. Zhang and D. Poole (1994). A simple approach to Bayesian network computations. Proceedings of the10th Biennial Canadian Conference on Artificial Intelligence, 171-178.M.L. Zhang and Z.H. Zhou (2006). Multilabel neural networks with applications to functional genomics andtext categorization. IEEE Transactions on Knowledge and Data Engineering, 18,10, 1338-1351.M. L. Zhang and Z.H. Zhou (2007). ML-KNN: A lazy learning approach to multi-label learning. PatternRecognition, 40, 7, 2038-2048.M. L. Zhang and Z.H. Zhou (2014). A review on multi-label learning algorithms. IEEE Transactions onKnowledge and Data Engineering, 26, 8, 1819-1837.

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

  • Intro BR, CC and LBP BN MBC Appl Concl Ref

    MULTI-DIMENSIONALBAYESIAN NETWORK CLASSIFIERS

    Pedro Larrañaga

    Computational Intelligence GroupArtificial Intelligence Department

    Technical University of Madrid

    EAIA 2018 - DS4BD. Advanced School on Data Science for Big DataPorto, July 4, 2018

    Pedro Larrañaga Multi-dimensional Bayesian network classifiers

    Multi-label classification and multi-dimensional classificationBinary relevance, classifier chain and label power setBayesian networksMulti-dimensional Bayesian network classifiersApplicationsConclusionsReferences