research article extraction of belief knowledge from a relational...

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2013, Article ID 297121, 10 pageshttp://dx.doi.org/10.1155/2013/297121

Research ArticleExtraction of Belief Knowledge from a Relational Database forQuantitative Bayesian Network Inference

LiMin Wang

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University,ChangChun City, JiLin Province 130012, China

Correspondence should be addressed to LiMin Wang; [email protected]

Received 22 May 2013; Revised 5 August 2013; Accepted 19 August 2013

Academic Editor: Praveen Agarwal

Copyright © 2013 LiMin Wang.This is an open access article distributed under the Creative Commons Attribution License, whichpermits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The problem of extracting knowledge from a relational database for probabilistic reasoning is still unsolved. On the basis of athree-phase learning framework, we propose the integration of a Bayesian network (BN) with the functional dependency (FD)discovery technique. Association rule analysis is employed to discover FDs and expert knowledge encoded within a BN; thatis, key relationships between attributes are emphasized. Moreover, the BN can be updated by using an expert-driven annotationprocesswherein redundant nodes and edges are removed. Experimental results show the effectiveness and efficiency of the proposedapproach.

1. Introduction

The challenge of extracting knowledge for an uncertaininference is related to the study of Bayesian networks (BNs),which represent one of the key research areas of knowledgediscovery and machine learning [1, 2]. The advantages ofusing prior knowledge by Bayesian statistical methods andclear semantic expression have increased the applicationof BN in medical diagnosis, manufacturing, and patternrecognition.

A BN consists of two parts: a qualitative part and a quan-titative part. The qualitative part denotes the graphical struc-ture of the network, whereas the quantitative part consistsof the conditional probability tables (CPTs) in the network.Although BNs are considered efficient inference algorithms,the quantitative part is considered a complex component,and learning an optimal BN structure from existing data hasbeen proven to be a NP-hard problem [3, 4]. Researchershave used qualitative research to improve the efficiency ofprobabilistic inferences by introducing domain knowledge.Domain experts have proposed ideas on qualitative relationsthat should be required in themodel, such as the requirementof an arc between two nodes. However, domain knowledgemay have a negative effect when attributes have similar prop-erties and crossed functional zones.Thus, some attributes are

necessary but redundant. Although these attributes are notbiased to the network structure because they do not con-tribute any negative information, these attributes increase thecomputational complexity of the network. Given the limitedinstances for parameter estimation, these necessary attributesshould be removed to build a robust structure. Furthermore,the definition and application of expert knowledge in thelearning procedure of BN are still unsolved.

In the real world, the widespread use of databases has cre-ated a considerable need for knowledge discovery method-ologies. Database systems are designed to manage largequantities of information, define structures for storage, andprovide mechanisms for mass data manipulation. Functionaldependencies (FDs) are a key concept in relational theoryand are the foundation of data organization in relationaldatabase design. A FD is treated as a constraint that needs tobe enforced by the database system to preserve data integrity.A FD can also be used as domain knowledge for knowledgerepresentation and reasoning.

Researchers have recently suggested the linking of therelational database and probabilistic reasoning model toconstruct a BN from a new perspective. FDs are importantdata dependencies that provide conditional independencyinformation in relational databases. FDs can be generated

2 Mathematical Problems in Engineering

from an entity-relationship diagram instead of being minedfrom data. Therefore, constructing a BN from FDs is inter-esting and useful, particularly when data are incompleteand inaccurate. Researchers have indicated the similaritiesbetween the BN and relational model. Jaehui and Sang-goo [5] propose a probabilistic ranking model to exploitstatistical relationships that exist in relational data of cat-egorical attributes. To quantify the information, the extentof the dependency between correlative attribute values iscomputed on a Bayesian network.Thimm and Kern-Isberner[6] take an approach of lifting maximum entropy methodsto the relational case by employing a relational version ofprobabilistic conditional logic. They address the problems ofambiguity that are raised by the difference between subjectiveand statistical views, and develop a comprehensive list ofdesirable properties for inductive model-based probabilisticinference in relational frameworks. Liu et al. [7] discussed therelationship between fuzzy FD and BN.

Considering the 0/1 data analysis, the association rulemining technique is quite popular and has been studiedextensively from computational and objective perspectivesby using measures such as frequency and confidence. Manyalgorithms have been designed to compute frequent andvalid association rules.The algorithm proposed in this paper,namely, the tree-augmented Naive Bayes (TAN) classifierwith FDs (i.e., TAN-FDA), applies extracted FDs based onassociation rule analysis. TAN-FDA has two focuses: (1)the use of association rules to infer belief FDs; redundantattributes deduced from FDs can be proven from the view-points of information theory and probability theory; (2)the effective simplification and estimation of BN structuresand probability distributions, respectively.The feasibility andaccuracy of the proposed method are also proven to explainthe necessity of finding and eliminating redundant attributes.

This paper is organized as follows. Section 2 introducesrelated background theories wherein the FD rules of prob-ability are proposed to link FD and probability distribution.Sections 3 and 4 introduce the theoretical foundation used inthis paper and the learning procedure of TAN-FDA, respec-tively. Section 5 compares TAN-FDA with other algorithms.Section 6 concludes.

2. Related Background Theory

2.1. Functional Dependency Rules of Probability. In the fol-lowing discussion, we use Greek letters (𝛼, 𝛽, 𝛾, . . .) to denoteattribute sets. Lower-case letters represent the specific valuesused by corresponding attributes (e.g., 𝑥

𝑖represents𝑋

𝑖= 𝑥𝑖).

P(⋅) denotes the probability. Given a relation𝑅 (in a relationaldatabase), attribute 𝑌 of 𝑅 is functionally dependent onattribute𝑋 of𝑅, and𝑋 of𝑅 functionally determines𝑌of𝑅 (insymbols 𝑋 → 𝑌). Armstrong [8] proposed a set of axioms(inference rules) to infer all FDs on a relational database thatrepresents the expert knowledge of organizational data andtheir inherent relationships. The axioms mainly include thefollowing rules.

(i) Augmentation rule: if 𝛼 → 𝛽 is true and 𝛾 is a set ofattributes, then 𝛼𝛾 → 𝛽𝛾.

(ii) Transitivity rule: if 𝛼 → 𝛽 and 𝛽 → 𝛾 are true, then𝛼 → 𝛾.

(iii) Union rule: if 𝛼 → 𝛽 and 𝛼 → 𝛾 are true, then 𝛼 →

𝛽𝛾.

(iv) Decomposition rule: if 𝛼 → 𝛽𝛾 is true, then 𝛼 → 𝛽

and 𝛼 → 𝛾.

(v) Pseudotransitivity rule: if 𝛼 → 𝛽 and 𝛾𝛽 → 𝛿 aretrue, then 𝛼𝛾 → 𝛿.

On the basis of the aforementioned rules, we use theFD rules of probability in [9, 10] to link FD and probabilitytheory.The following rules are included in the FD-probabilitytheory link.

(i) Representation equivalence of probability: assumethat data set 𝑆 consists of two attribute sets {𝛼, 𝛽} andthat 𝛽 can be inferred by 𝛼; that is, FD 𝛼 → 𝛽 istrue; then the following joint probability distributionis true:

𝑃 (𝛼) = 𝑃 (𝛼, 𝛽) . (1)

(ii) Augmentation rule of probability: if 𝛼 → 𝛽 is trueand 𝛾 is an attribute set, then the following jointprobability distribution is true:

𝑃 (𝛼, 𝛾) = 𝑃 (𝛼, 𝛽, 𝛾) . (2)

(iii) Transitivity rule of probability: if 𝛼 → 𝛽 and𝛽 → 𝛾 are true, then the following joint probabilitydistribution is true:

𝑃 (𝛼) = 𝑃 (𝛼, 𝛾) . (3)

(iv) Pseudotransitivity rule of probability: if 𝛾𝛽 → 𝛿 and𝛼 → 𝛽 are true, then the following joint probabilitydistribution is true:

𝑃 (𝛼, 𝛾) = 𝑃 (𝛼, 𝛾, 𝛿) . (4)

2.2. Unrestrictive BN of Naive Bayes and TAN. BN is adirected acyclic graph that represents a joint probabilisticdistribution wherein the nodes denote the domain variables𝑈 = (𝑋

1, 𝑋2, . . . , 𝑋

𝑛) and a graph of conditional depen-

dencies denotes a network structure among variables in 𝑈.A conditional dependency connects a child variable with aset of parent variables. This dependency is represented by atable of conditional distributions of the child variable for eachcombination of the parent variables. Dependency-analysis-based algorithms construct a BN by using dependencyinformation. The mutual information is used to determinethe BN structure, which explains the causal relationshipbetween random variables. Hence, dependency-analysis-based algorithms construct a BN by testing the validityof any independence assertions, which lead to a NP-hardcomputational problem.

Mathematical Problems in Engineering 3

X1 X2 Xn

C

· · ·

Figure 1: Network structure of NB.

Definition 1. Entropy is a measure of uncertainty of randomvariable𝑋:

𝐻(𝑋) = −∑𝑥∈X

𝑃 (𝑥) log𝑃 (𝑥) , (5)

where 𝑃(𝑥) is the probability distribution of𝑋.

Definition 2. Mutual information is the reduction of entropyabout variable𝑋 after observing

𝐼 (𝑋; 𝑌) = 𝐻 (𝑋) − 𝐻 (𝑋 | 𝑌)

= ∑𝑦∈𝑌

∑𝑥∈𝑋

𝑃 (𝑥, 𝑦) log𝑃 (𝑥 | 𝑦) − ∑𝑥∈X

𝑃 (𝑥) log𝑃 (𝑥) .

(6)

The mutual information between 𝑋 and 𝑌 measuresthe expected information on 𝑋 after observing the valueof 𝑌. If two nodes are dependent in BNs, knowledge ofthe value of one node will provide information about thevalue of the other node. Hence, the mutual informationbetween two nodes can show the dependency of the twonodes and the degree of their relationship. To solve theNP-hard computational complexity, Naıve Bayes (NB) [11]assumes that the attributes are independent in a given class.NB has simple structures that contain arcs from the classnode to each other node and do not have arcs between othernodes (Figure 1). Although the independence assumptionis unrealistic in many practical scenarios, NB has exhib-ited competitive accuracy with other learning algorithms.Researchers have tried to adjust the Naıve strategy to allowfor violations of independence assumptions and improve theprediction accuracy of NB. One straightforward approachin overcoming NB limitations involves the extension of theNB structure to represent explicitly the dependencies amongattributes. Friedman et al. [12] presented a compromiserepresentation, that is, TAN, which allows arcs between thechildren of the class attribute, thereby relaxing the assump-tion of CI (Figure 2). Given the independence assumption,NB and TAN are both considered restrictive BN.

X1

X2

X3

X4

X5

C

Figure 2: Network structure of TAN.

2.3. Association Rule to Belief Functional Dependency. Dis-cerning FDs fromexisting databases is an important issue andhas been investigated for a number of years [13].This issue hasrecently been addressed in a novel and efficient manner by adata-mining viewpoint. Rather than exhibiting the set of allFDs in a relation, relatedworks have aimed to discover a smallcover equivalent of this set. This problem is known as the FDinference problem. Association rules are used to discover therelationships and potential associations of items or attributesin large quantities of data. These rules can be effective inuncovering unknown relationships and providing resultsthat can be the basis of forecasts and decisions. Associationrules have also been proven useful tools for enterprises inimproving competitiveness and profitability.

Agrawal and Srikant [14] proposed the apriori associationrule algorithm, which can discover meaningful item sets andconstruct association rules within large databases. However,this algorithm generates a large number of candidate itemsets from single item sets and requires comparisons tobe performed against the whole database level by level increating association rules. Given a data set 𝑆, an associationrule is a rule in the form 𝛼 → [𝑠, 𝑐], which satisfies thefollowing:

(i) 𝛼 ⊆ 𝑋, 𝛽 ⊆ 𝑋,(ii) 𝛼 ∩ 𝛽 = Φ,(iii) 𝑠 = 𝑁

𝛼𝛽/𝑁𝑟and 𝑐 = 𝑁

𝛼𝛽/𝑁𝛼.

𝛼 and 𝛽 are attribute-value vectors, 𝑁𝑟is the number of

samples in 𝑆, 𝑁𝛼𝛽

is the number of samples that contain theset of items 𝛼 ∪ 𝛽, and 𝑁

𝛼is the number of samples that

contain 𝛼. 𝑠 and 𝑐 are the support and confidence of the rule,respectively. The pseudocode of the apriori association rulealgorithm is shown in Algorithm 1.

Association rules can then be extracted from frequentitem sets. When an association rule has a nonzero supportand a confidence of 100%, such an association rule canbe interpreted as a FD and introduced in this paper forpreprocessing. Considering that probability theory is one ofthe bases of BN, BN requires mass data to ensure preciseparameter estimations. For restrictive BN, the structure canonly be learned from the training data because the testingdata is incomplete, thusmaking the confidence level relativelylow. Some researchers have tried to apply a semisupervisedlearning or active learning to use useful information in


𝐶𝑘: candidate item set of size 𝑘

𝐿𝑘: frequent item set of size 𝑘

𝐿1= frequent items;

For (𝑘 = 1; 𝑘 = 𝜙; 𝑘 + +) do begin𝐶𝑘+1

= candidates generated from 𝐿𝑘;

for each transaction 𝑡 in database doincrement the count of all candidates in 𝐶

𝑘+1that are contained in 𝑡

𝐿𝑘+1

= candidates in 𝐶𝑘+1

with min supportendreturn ∪

𝑘𝐿𝑘;

Algorithm 1

the testing data and increase the robustness of the finalmodel.These algorithms may cause noise propagation. However,FDs transformed from the association rule will naturallydecrease noise to aminimum. Given data set𝑈with variables{𝑋1, 𝑋2, . . . , 𝑋

𝑛} and class label𝐶, suppose an association rule

𝛼 → 𝛽[𝑠, 𝑐] has been deduced from a whole data set and𝛼, 𝛽 ⊂ 𝑈, such an association rule is transformed to be a FDas 𝛼 → 𝛽. From the FD rule of probability, we can obtain thefollowing:

𝑃 (𝛼) = 𝑃 (𝛼, 𝛽) . (7)

By applying the augmentation rule of probability,

𝑃 (𝛼, 𝑐) = 𝑃 (𝛼, 𝛽, 𝑐) or 𝑃 (𝛼 | 𝑐) = 𝑃 (𝛼, 𝛽 | 𝑐) . (8)

Thus, FD 𝛼 → 𝛽 still occurs in the training data. Thus,we can use the whole data set to extract FDs with highconfidence.

3. BN Learning Method

3.1. Redundant Attributes for Restrictive and Unrestrictive BN.Given that a BN is a complete model for variables andtheir relationships, a BN can be used to answer probabilisticqueries about variables. For example, the network can be usedto find updated knowledge on the state of a subset of variableswhen other variables (e.g., evidence variables) are observed.This process of computing the posterior distribution ofvariables with evidence is called probabilistic inference. IfFD 𝑥1→ 𝑥2exists, the following will be obtained in the

augmentation rule of probability:

𝑃 (𝑥2| 𝑥1, 𝑐) =

𝑃 (𝑥2, 𝑥1, 𝑐)

𝑃 (𝑥1, 𝑐)

=𝑃 (𝑥1, 𝑐)

𝑃 (𝑥1, 𝑐)

= 1. (9)

Thereafter, conditional entropy H(𝑋2| 𝑋1, 𝐶) = 0 and

conditional mutual information 𝐼(𝑋2; 𝑋1| 𝐶) = 𝐻(𝑋

2|

𝐶) reach their maximum values. Thus, a definite arc existsbetween𝑋

2and𝑋

1in the restrictive structure of BN.

Furthermore, from the viewpoint of information theory,the information quantity supplied by𝑋

1and𝑋

2to any other

attribute sets 𝑌 is expressed as follows:

𝐼 (𝑌;𝑋1, 𝑋2)

= 𝐻 (𝑌) − 𝐻 (𝑌 | 𝑋1, 𝑋2)

= ∑𝑦∈𝑌

𝑃 (𝑦) log𝑃 (𝑦) − ∑𝑦∈𝑌

𝑃 (𝑦, 𝑥1, 𝑥2) log𝑃 (𝑦 | 𝑥

1, 𝑥2)

= ∑𝑦∈𝑌

𝑃 (𝑦) log𝑃 (𝑦) − ∑𝑦∈𝑌

𝑃 (𝑦, 𝑥1) log𝑃 (𝑦 | 𝑥

1)

= 𝐻 (𝑌) − 𝐻 (𝑌 | 𝑋1) = 𝐼 (𝑌;𝑋

1) .

(10)

Hence, from the viewpoint of probability and informationtheory, 𝑋

2is a redundant attribute that does not need to be

considered during structure learning.If the information supplied by X

1is implicated in the

information supplied by 𝑋2, 𝑋1and 𝑋

2have a strong

relationship, but𝑋1is redundant for modeling. By contrasts,

if the information boundaries of X1and 𝑋

2are crossed, a

relationship exists at some point between 𝑋1and 𝑋

2. If the

boundaries are disjointed, 𝑋1and 𝑋

2are independent. This

result is clearly displayed in Figure 3.

3.2. Structure Simplification for BN Learning. In relationaldatabase theory, the minimal FD set can be inferred bycanonical cover analysis. Let 𝐹

𝑐be a canonical cover for a

set of simple FDs over 𝑈 and closure of 𝛼 denoted as 𝛼+,which represents all attributes functionally determined by 𝛼.Redundant attributes can be found from the viewpoints ofchain formula and FD rules of probability.

A joint probability distribution can be expressed by achain formula that includes construction information fora BN. However, constructing the structure solely from ajoint probability distribution without using any conditionalindependencies is impractical because such an approach willrequire an exponentially large number of variable combina-tions.We can use FDs to simplify the structure of a BNwhichis represented by a set of probability distributions [15] (seeAlgorithm 2).

We use the following example to elaborate the algorithm.


X1 X1 X1X2X2

X2

Containing Crossed Disjoint

Figure 3: Information boundaries of attributes𝑋1and𝑋

2.

Input: FDs and probability distribution sets (𝑆1, 𝑆2, . . . , 𝑆

𝑛),

where 𝑆1= 𝑃 (𝐴

1), 𝑆2= 𝑃 (𝐴

2| 𝐴1) , . . . , 𝑆

𝑛= 𝑃 (𝐴

𝑛| 𝐴1𝐴2, ⋅ ⋅ ⋅ , 𝐴

𝑛−1).

Output: A chain formula with independence conditions implied by FDs.BeginFor every 𝑆

𝑘(𝑘 > 2) Do

If FD = 𝑥𝑘→ 𝑎𝑘, substitute 𝑃 (𝑎

𝑘| 𝑎1𝑎2, . . . , 𝑎

𝑘−1) with 𝑃 (𝑎

𝑘| 𝑥𝑘),

where 𝑥𝑘is a minimal set of (𝑎

1𝑎2, . . . , 𝑎

𝑘−1).

End ForEnd.

Algorithm 2

Example 3. Let 𝑅(𝑈 + 𝐹) be a probabilistic scheme; 𝑈 =

{𝑋1, 𝑋2, 𝑋3, 𝑋4} and 𝐹 = {𝑥

2→ 𝑥1, 𝑥3→ 𝑥4} are sets of

FDs over 𝑅.According to the order of (𝑋

2, 𝑋3, 𝑋1, 𝑋4), the joint

probability distribution 𝑃(𝑥1, 𝑥2, 𝑥3, 𝑥4) will be expressed as

follows:𝑃 (𝑥1, 𝑥2, 𝑥3, 𝑥4) = 𝑃 (𝑥

2) 𝑃 (𝑥

3| 𝑥2)

× 𝑃 (𝑥1| 𝑥2, 𝑥3) 𝑃 (𝑥

4| 𝑥1, 𝑥2, 𝑥3) .

(11)

The following results can be generated based onAlgorithm 2.

(1) By using FD 𝑥2→ 𝑥1, 𝑃(𝑥1𝑥2, 𝑥3) = 𝑃(𝑥

1𝑥2).

(2) By using FD 𝑥3→ 𝑥4, 𝑃(𝑥4𝑥1, 𝑥2, 𝑥3) = 𝑃(𝑥

4𝑥3).

The joint probability distribution can then be simplifiedas follows:

𝑃 (𝑥1, 𝑥2, 𝑥3, 𝑥4) = 𝑃 (𝑥

2) 𝑃 (𝑥

3| 𝑥2) 𝑃 (𝑥

1| 𝑥2) 𝑃 (𝑥

4| 𝑥3) .

(12)

The corresponding BN structure of the joint probabilitydistribution is shown in Figure 4(a). From the FD rule ofprobability, given that 𝑃(𝑥

1| 𝑥2) = 𝑃(𝑥

4| 𝑥3) = 1 is true,

(11) is changed into the following expression:

𝑃 (𝑥1, 𝑥2, 𝑥3, 𝑥4) = 𝑃 (𝑥

2) ⋅ 𝑃 (𝑥

3| 𝑥2) ⋅ 1.1

= 𝑃 (𝑥2) ⋅ 𝑃 (𝑥

3| 𝑥2) .

(13)

The corresponding structure is shown in Figure 4(b).

Theorem 4 (see [9]). Given two attribute sets {𝛼, 𝛽} and classlabel 𝐶, if 𝛼 → 𝛽 is true, then the following conditionalprobability distribution is true:

𝑃 (𝑐 | 𝛼) = 𝑃 (𝑐 | 𝛼, 𝛽) . (14)

Theorem 5 (see [16]). Let 𝑈 be a set of attributes and 𝐹 anacyclic set of simple FDs over 𝑈. One canonical cover will existfor F; that is, 𝐹

1= 𝐹2if 𝐹1and 𝐹

2are canonical covers for 𝐹.

The chosen canonical cover is the set of minimal depen-dencies. Such a cover is information lossless and is consid-erably smaller than the set of all valid dependencies. Thesequalities are particularly important for the end user becauseof the provision of relevant knowledge wherein redundancyis minimized and extraneous information is discarded.

Lemma 6. Let 𝐹𝑐be a canonical cover for 𝐹, which is a set

of simple FDs, and let 𝛼 and 𝛽 be the union of the left-handsides of all dependencies in 𝐹

𝑐and 𝐹, respectively.The following

conditional probability distribution is true:

𝑃 (𝑐 | 𝛼) = 𝑃 (𝑐 | 𝛽+) . (15)

Lemma 7. Let𝑈 be a set of attributes and𝐶 the class label. Let𝐹𝑐be a canonical cover for 𝐹, which is a set of simple FDs, and

let 𝛼 be the union of the left-hand sides of all dependencies in𝐹𝑐. The following conditional probability distribution is true:

𝑃 (𝑐 | 𝛼, 𝑈 − 𝛼+) = 𝑃 (𝑐 | 𝑈) , (16)

where {𝑈 − 𝛼+} represents the difference set between 𝑈 and𝛼+. Therefore, the difference set {𝛼+ − 𝛼} is redundant forclassification. From Lemma 7 and chain formula rule,

𝑃 (𝑐 | 𝑥1, 𝑥2, 𝑥3, 𝑥4)

= argmax 𝑃 (𝑥2| 𝑐) 𝑃 (𝑥

3| 𝑥2, 𝑐) 𝑃 (𝑥

1| 𝑥2, 𝑥3, 𝑐)

× 𝑃 (𝑥4| 𝑥1, 𝑥2, 𝑥3, 𝑐) 𝑃 (𝑐)


3| 𝑥2, 𝑐) 𝑃 (𝑥

1| 𝑥2)

× 𝑃 (𝑥4| 𝑥3) 𝑃 (𝑐) .

(17)


X1

X2

X3

X4

(a)

X2

X3

(b)

Figure 4: Two simplification steps of BN structure by applying FDs.

The corresponding structure is shown in Figure 5(a). Byapplying the FD rule of probability, (17) is changed into thefollowing:

𝑃 (𝑐𝑥1, 𝑥2, 𝑥3, 𝑥4)


3| 𝑥2, 𝑐) ⋅ 1.1 ⋅ 𝑃 (𝑐)


3| 𝑥2, 𝑐) 𝑃 (𝑐) .

(18)

The corresponding structure is shown in Figure 5(b).

4. TAN-FDA

TAN allows tree-like structures to be used in represent-ing dependencies among attributes. The class node directlypoints to all attribute nodes, and an attribute node can haveonly one parent from another attribute node (in addition tothe class node).

The architecture of the TAN model can be depicted asshown in Algorithm 3.

The proposed TAN-FDA has three phases: drafting,thinning, and thickening. Given that TAN performs well inthe experimental study, the drafting phase does not start fromthe empty graph but from the TAN structure that is expectedto be close to the correct graph. Moreover, we proposeanother method for performing the thinning phase. Thisnew methodology uses a data-mining technique to infer thenumber of FDs, which describe the correlation of events andcan be viewed as probabilistic rules. Two events are correlatedif they are frequently observed together. Each FD can be usedto remove redundant attributes, thus simplifying the networkstructure. Some relationships can be neglected for TAN witha limited number of training samples. However, after thethinning phase, the attribute space decreases, and the trainingsamples can provide useful information. In the thickeningphase, the conditional mutual information is computed toobtain a new TAN structure with the rest of the attributes.We then compared the differences between TAN structuresbefore and after the thickening phase. If a new edge exists,the edge is added to the graph in the thickening phase. Thegraph produced by this phase will contain all the edges of theunderlying dependency model.

Table 1: Description of the data sets used in the experiments.

Data set Size Attributes Number of classesAnneal 798 39 6Balance-scale 625 5 3German 1000 24 2Glass 214 10 7Heart-c 303 14 5Ionosphere 351 34 2Pima Indians 768 8 2Audiology 226 70 24Sonar 208 61 2

The learning procedure of the TAN-FDA algorithm isdescribed as shown in Algorithm 4.

5. Experiments

To verify the efficiency and effectiveness of the proposedTAN-FDA, we conduct experiments on nine data sets fromthe UCI machine-learning repository (Table 1). For eachbenchmark data set, we compare two Bayes models with theselected attributes. The following abbreviations are used forthe different classification approaches: SNB-FSS [17]: selectiveNaive Bayes classifier with forward sequential selection;TAN-CFS [18]: tree-augmented Naive Bayes classifier withclassical floating search.

SNB-FSS selects a subset of attributes by using leave-one-out cross validation as a selection criterion and establishesan NB with these attributes. SNB-FSS uses the reverse searchdirection; that is, SNB-FSS iteratively adds the attribute toimprove accuracy, starting with the empty set of attributes.Independence is assumed among the resulting attributes in agiven class.

TAN-CFS includes new features that maximize the cri-terion 𝐽 by means of the sequential forward selection pro-cedure and starts from the current feature set. Thereafter,the conditional exclusions of previously updated subsets areimplemented. If no feature can be excluded, the algorithmproceeds with the sequential forward selection algorithm.


X1

X2

X3

X4

C

(a)

X2

X3

C

(b)

Figure 5: Two simplification steps of restrictive BN by applying FDs.

Input: Training set 𝑇, attribute set𝑋 = {𝑋1, . . . , 𝑋

𝑛}, and class 𝐶.

Output: TAN constructed by conditional mutual information metric.Step 1. Compute the conditional mutual information 𝐼 (𝑋

𝑖; 𝑋𝑗| 𝐶) between

each pair of attributes {𝑋𝑖, 𝑋𝑗}, 𝑖 = 𝑗.

Step 2. Build a complete undirected graph wherein the vertices are attributes {𝑋1, . . . , 𝑋

𝑛}.

Annotate the weight of an edge connecting𝑋𝑖to𝑋𝑗by 𝐼 (𝑋

𝑖; 𝑋𝑗, | 𝐶).

Step 3. Build a maximum weighted spanning tree.Step 4. Transform the resulting undirected tree to a directed tree by choosing a root nodeand setting the direction of all edges outward from the directed tree.Step 5. Build a TAN model by adding a vertex labeled by 𝐶 and adding an arc from 𝐶 to each𝑋

𝑖.

Algorithm 3

The floating methods are allowed to correct the wrong deci-sionsmade in the previous steps.Thesemethods also approx-imate the optimal solution better than sequential featureselection methods.

The current implementation of TAN-FDA is limited tocategorical data. Hence, we assess only the relative capacitiesof these algorithms with respect to categorical data, and allnumeric attributes are discretized.WhenMDL discretization[19], which is a common discretization method for NB, isused to discretize quantitative attributes within each cross-validation fold, many attributes will have only one value.In these experiments, we discretize quantitative attributesby using three-bin equal-frequency discretization prior toclassification.

The base probabilities are estimated by using m-estimation [20] (𝑚 = 1), which often results in more accurateprobabilities than the Laplace estimation for NB and TAN.The above experiments are coded in Matlab 7.0 on an IntelPentium 2.93GHz computer with 1 GB of RAM.

The main advantage of the proposed method is that ituses FD to simplify the learning procedure and builds arobust BN structure. For different testing samples, differentFDs may be applied and the final BN structure may differgreatly, which make TAN-FDA much more flexible. Duringour experiments, we try to use the most general FDs asinductive rules wherein the number of attributes (denoted

as 𝑙 in the following discussion) of the left side of eachFD is less than three. We first study the performance ofthe state-of-the-art classifiers NB and TAN to reveal howperformance varies with changing 𝑙. To explore how theclassification performance of TAN-FDA compares with SNB-FSS and TAN-CFS, we estimate Pcost, which is the meanposterior probability of the submodels. The probabilisticcosting is equivalent to the accumulated code length of thetotal test data encoded by an inferred model. Given that anoptimal code length can only be achieved by encoding the realprobability distribution of the test data, a smaller probabilisticcosting will indicate a better performance on the overallprobabilistic prediction than a larger probabilistic costing.For each benchmark data set, the classification performanceis evaluated by fourfold and tenfold cross-validation.

5.1. Comparison with State-of-the-Art Algorithms. We chooseNB and TAN as comparator algorithms because they arerelatively unparameterized and can readily produce clearlyunderstood performance outcomes. We first consider therelative performance when the induction rules or FDs takedifferent 𝑙 values, that is, one or two. In an attempt to assessthese predictions, we calculated the mean of error, bias, andvariance for NB and TAN over the nine data sets. The exper-imental results are presented in Tables 2 and 3. We observe


Input: Training set 𝑇1, testing set 𝑇

2, and attribute set𝑋 = {𝑋

1, . . . , 𝑋

𝑛}.

Output: Restrictive TAN model.Step 1. In the drafting phase, the TAN model is used as the basic structure.Step 2. Mine association rules from data set {𝑇

1+ 𝑇2}, and transform these rules into FDs.

Thereafter, obtain the closure of FDs.Step 3. In the thinning phase, remove redundant attributes and corresponding edges thatoriginate from these attributes. Thereafter, obtain the simplified structure.Step 4. Learn the TAN model as the mapping structure with the rest of the attributes.Step 5. In the thickening phase, compare the mapping structure and simplified structure.If a new edge exists in the mapping structure, add the edge to the simplified structure.

Algorithm 4

Table 2: Comparative error, bias, and variance for NB.

Classifier NB (𝑙 = 1) NB (𝑙 = 2) NBMean error 0.109 0.097 0.112Mean bias 0.121 0.117 0.135Mean variance 0.035 0.037 0.032

Table 3: Comparative error, bias, and variance for TAN.

Classifier TAN (𝑙 = 1) TAN (𝑙 = 2) TANMean error 0.093 0.087 0.105Mean bias 0.105 0.093 0.124Mean variance 0.028 0.033 0.027

Table 4: Comparative error, bias, and variance for TAN-FDA versusSNB-FSS.

Classifier TAN-FDA (𝑙 = 1) TAN-FDA (𝑙 = 2) SNB-FSSMean error 0.091 0.088 0.097Mean bias 0.157 0.166 0.143Mean variance 0.022 0.032 0.036

Table 5: Comparative error, bias, and variance for TAN-FDA versusTAN-CFS.

Classifier TAN-FDA (𝑙 = 1) TAN-FDA (𝑙 = 2) TAN-CFSMean error 0.091 0.088 0.089Mean bias 0.157 0.166 0.120Mean variance 0.022 0.032 0.044

that increasing 𝑙 from one to two consistently decreases biasat the cost of an increase in variance. This trade-off deliverslow errors for NB and TAN. If our reasoning on the expectedbias profiles of these algorithms is accepted, the performanceof NB or TAN should increase with increasing data quantity,and unrestricted BN classifiers should also achieve low errors.

5.2. Comparison with SNB-FSS and TAN-CFS. Tables 4 and5 show the mean of error, bias, and variance results ofTAN-FDA, SNB-FSS, and TAN-CFS. TAN-FDA has higherbiases, lower variances, and lower mean errors than SNB-FSS and TAN-CFS. When 𝑙 increases from one to two,fewer instances may satisfy the stopping criterion of the

association rule. Moreover, the rules or extracted FDs mayonly be coincidental and are not real domain knowledge.The occurrence of low variance is less obvious than theoccurrence of high mean bias. This phenomenon remainsan interesting unexplained topic that is worthy of furtherinvestigation.

To obtain Pcost for BN, the probabilistic prediction foreach test instance is calculated by arithmetically averagingthe probabilistic predictions submitted by each iteration. Weobserve that TAN-FDA has achieved better (lower) proba-bilistic costing than SNB-FSS and TAN-CFS in almost alldata sets in terms of logarithm of probability Pcost bit costing(Figure 6). The superior performance of TAN-FDA on prob-abilistic prediction can be attributed to the belief that FDsare extracted from a whole data set, irrelevant attributes arenot included, and classifiers are based on subsets of selectedattributes. This restricted network structure maximizes theclassification performance by removing irrelevant attributesand relaxing independence assumptions between correlatedattributes. The computational demands for determining thenetwork structure are low, particularly if a large number ofattributes are available.

The size of the CPTs of the nodes increases exponentiallywith the number of parents, thus possibly resulting in anunreliable probability estimate of the nodes that have a largenumber of parents. However, the introduction of FDs reducesthis negative effect significantly.

6. Conclusion

For high-dimensional data, only very low-dimensional formsof BN are robust. FDs provide a novel way of solving this com-mon problem that exists in machine-learning techniques.Moreover, we have established that higher-dimensional vari-ants are likely to deliver greater accuracy than lower-dimensional alternatives when providedwith reliable domainknowledge.Thus, a promising direction for future research isthe development of computationally efficient techniques forapproximating TAN-FDA with high 𝑙 values.

A further unresolved issue is the manner of selecting anappropriate 𝑙 value for any specific data set 𝑇. Furthermore,a number of techniques have been developed for extendingTAN to handle numeric data [21]. Extending the currentstudy to the general TAN-FDA framework is needed.


AnnealBalance-scale

German

GlassHeart-c

IonospherePima Indians

Audiology

Sonar

0 0.3 0.6 0.9 1.2

TenfoldsFourfolds

(a) TAN-FDA/SNB-FSS

Balance-scaleGerman

GlassHeart-c

IonospherePima Indians

Audiology

Sonar

0 0.3 0.6 0.9 1.2

TenfoldsFourfolds

Australian

(b) TAN-FDA/TAN-CFS

Figure 6: Comparison of TAN-FDA to SNB-FSS and TAN-CFS.

We have presented a strategy for deriving FDs andremoving redundant nodes. However, mining relationshipsbetween different FDs should also be important. Reducingthe errors of BNs has been proven possible by appropriatefeature selection and submodel weighting [22]. Therefore,exploring efficientmethods for removing redundant FDpartsand obtaining high 𝑙 values is significant. If fast classificationis required and training time is not constrained, approachesthat use search methods to select small numbers of FDs froma TAN-FDA model are likely to be appropriate. If sufficienttraining time is available, searching for appropriate 𝑙 valueswill be useful.

For forward sequential selection (FSS) to consider joiningtwo attributes, FSS must first construct a classifier whereinone of the attributes is used independently. However, onexclusive—or, a Bayesian classifier with only one relevantattribute will not perform better than a simple guess on themost frequent class, and correct relevant attributes will rarelybe joined. By contrast, classical floating search (CFS) startswith the TAN classifier of all attributes and considers all pairsof joins. CFS always considers joining two relevant attributes(specifically on larger data sets) and has high accuracy. Apotential weakness of the algorithm for joining attributes isthat to join more than two attributes, two attributes mustfirst be joined before joining other attributes. This processwill not occur unless the formation of the first pair results inan increase in accuracy. This problem is common in manyalgorithms. In this paper, we have developed a generativelearning algorithm that generalizes the principles underlyingTAN-FDA. We have shown that searching for dependenciesamong attributes while learning Bayesian classifiers resultsin significant increases in accuracy. All redundant attributeswill be considered only once to improve the accuracy of theBayesian classifier when more than two interacting attributesare present.

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no. 61272209) and PostdoctoralScience Foundation of China (Grant no. 20100481053).

References

[1] V.-F. Joan, S. Juan, and S.-O. Emilio, “Expert system forpredicting unstable angina based on Bayesian networks,” ExpertSystems With Applications, vol. 40, no. 12, pp. 5004–5010, 2013.

[2] J. Yu Zhang and H. Song, “Bayesian-network-based reliabilityanalysis of PLC systems,” IEEE Transactions on IndustrialElectronics, vol. 60, no. 11, pp. 5325–5336, 2013.

[3] P. Larranaga, H. Karshenas, C. Bielza, and R. Santana, “A reviewon evolutionary algorithms in Bayesian network learning andinference tasks,” Information Sciences, vol. 233, pp. 109–125,2013.

[4] S. S. Young and L. A. Sung, “Bayesian network analysis forthe dynamic prediction of early stage entrepreneurial activityindex,” Expert Systems With Applications, vol. 40, no. 10, pp.4003–4009, 2013.

[5] P. Jaehui and L. Sang-goo, “Pragmatic correlation analysis forprobabilistic ranking over relational data,” Expert Systems withApplications, vol. 40, no. 7, pp. 2649–2658, 2013.

[6] M.Thimm andG. Kern-Isberner, “On probabilistic inference inrelational conditional logics,” Logic Journal of the IGPL, vol. 20,no. 5, pp. 872–908, 2012.

[7] W.-Y. Liu, K. Yue, and W.-H. Li, “Constructing the Bayesiannetwork structure from dependencies implied in multiplerelational schemas,” Expert Systems with Applications, vol. 38,no. 6, pp. 7123–7134, 2011.

[8] W. W. Armstrong, “Dependency structures of data base rela-tionships,” in Proceedings of the International Federation forInformation Processing Congress (IFIP ’74), pp. 580–583, 1974.

[9] L. Wang, G. Yao, and X. Li, “Extracting logical rules andattribute subset from confidence domain,” Information, vol. 15,no. 1, pp. 173–180, 2012.

[10] L. M. Wang, X. H. Wei, and L. Y. Dong, “Bayesian networkinference based on functional dependency mining of relationaldatabase,” Information, vol. 15, no. 6, pp. 2441–2446, 2012.

[11] Z. Xie, W. Hsu, and Z. Liu, “SNNB: a selective neighborhoodbased naive Bayes for lazy learning,” in Proceedings of the 6thPacific-Asia Conference on Advances in Knowledge Discoveryand Data Mining, pp. 104–114, 2002.

[12] N. Friedman, D. Geiger, andM.Goldszmidt, “Bayesian networkclassifiers,”Machine Learning, vol. 29, no. 2-3, pp. 131–163, 1997.


[13] F. Ferrarotti, S. Hartmann, and S. Link, “Reasoning about func-tional and full hierarchical dependencies over partial relations,”Information Sciences, vol. 235, pp. 150–173, 2013.

[14] R. Agrawal and R. Srikant, “Fast algorithms for mining asso-ciation rules in large databases,” in Proceedings of the 20thInternational Conference onVery Large Data Bases, pp. 487–499,1994.

[15] S. S. Liao, H. Q. Wang, Q. D. Li, and W. Y. Liu, “A functional-dependencies-based Bayesian networks learning method andits application in a mobile commerce system,” IEEE Transac-tions on Systems,Man, and Cybernetics B, vol. 36, no. 3, pp. 660–671, 2006.

[16] J. Lechtenborger, “Computing unique canonical covers forsimple FDs via transitive reduction,” Information ProcessingLetters, vol. 92, no. 4, pp. 169–174, 2004.

[17] I. Inza, P. Larranaga, and B. Sierra, “Feature subset selection byBayesian networks: a comparison with genetic and sequentialalgorithms,” International Journal of Approximate Reasoning,vol. 27, no. 2, pp. 143–164, 2001.

[18] F. Pernkopf and P. O’Leary, “Floating search algorithm forstructure learning of Bayesian network classifiers,” PatternRecognition Letters, vol. 24, no. 15, pp. 2839–2848, 2003.

[19] U. M. Fayyad and K. B. Irani, “Multi-interval discretizationof continuous valued attributes for classification learning,” inProceedings of the 13th International Conference on ArtificialIntelligence, pp. 1022–1029, 1993.

[20] R. Kohavi, “Scaling up the accuracy of naive-Bayes classifiers:a decision-tree hybrid,” in Proceedings of the 2nd InternationalConference on Knowledge Discovery and Data Mining, pp. 202–207, 1996.

[21] M. J. Flores, J. A. Gamez, A. M. Martınez, and J. M. Puerta,“Handling numeric attributes when comparing Bayesian net-work classifiers: does the discretization method matter?”Applied Intelligence, vol. 34, no. 3, pp. 372–385, 2011.

[22] P.A.D.Castro andF. J. vonZuben, “LearningBayesian networksto perform feature selection,” in Proceedings of the InternationalJoint Conference on Neural Networks (IJCNN ’09), pp. 467–473,June 2009.

Submit your manuscripts athttp://www.hindawi.com

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

MathematicsJournal of


Mathematical Problems in Engineering

Hindawi Publishing Corporationhttp://www.hindawi.com

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of


Probability and StatisticsHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of


CombinatoricsHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of


Operations ResearchAdvances in

Journal of


Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of Mathematics and Mathematical Sciences


The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014


Algebra

Discrete Dynamics in Nature and Society



Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com

Volume 2014 Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Stochastic AnalysisInternational Journal of

research article extraction of belief knowledge from a relational...

Documents