bayesian networks for supporting query processing over incomplete autonomous...

24
J Intell Inf Syst DOI 10.1007/s10844-013-0277-0 Bayesian networks for supporting query processing over incomplete autonomous databases Rohit Raghunathan · Sushovan De · Subbarao Kambhampati Received: 28 August 2012 / Revised: 15 August 2013 / Accepted: 19 August 2013 © Springer Science+Business Media New York 2013 Abstract As the information available to naïve users through autonomous data sources continues to increase, mediators become important to ensure that the wealth of information available is tapped effectively. A key challenge that these information mediators need to handle is the varying levels of incompleteness in the underlying databases in terms of missing attribute values. Existing approaches such as QPIAD aim to mine and use Approximate Functional Dependencies (AFDs) to predict and retrieve relevant incomplete tuples. These approaches make independence assumptions about missing values—which critically hobbles their performance when there are tuples containing missing values for multiple correlated attributes. In this paper, we present a principled probabilistic alternative that views an incomplete tuple as defining a distribution over the complete tuples that it stands for. We learn this distribution in terms of Bayesian networks. Our approach involves min- ing/“learning” Bayesian networks from a sample of the database, and using it to do both imputation (predict a missing value) and query rewriting (retrieve relevant results with incompleteness on the query-constrained attributes, when the data sources are autonomous). We present empirical studies to demonstrate that (i) at higher levels of incompleteness, when multiple attribute values are missing, Bayesian networks do provide a significantly higher classification accuracy and (ii) the relevant possible answers retrieved by the queries reformulated using Bayesian networks This research is supported by ONR grant N000140910032 and two Google research awards. R. Raghunathan Amazon, Seattle WA, USA e-mail: [email protected] S. De (B ) · S. Kambhampati Computer Science and Engineering, Arizona State University, Tempe AZ, USA e-mail: [email protected] S. Kambhampati e-mail: [email protected]

Upload: others

Post on 28-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf SystDOI 10.1007/s10844-013-0277-0

Bayesian networks for supporting query processing overincomplete autonomous databases

Rohit Raghunathan · Sushovan De ·Subbarao Kambhampati

Received: 28 August 2012 / Revised: 15 August 2013 / Accepted: 19 August 2013© Springer Science+Business Media New York 2013

Abstract As the information available to naïve users through autonomous datasources continues to increase, mediators become important to ensure that the wealthof information available is tapped effectively. A key challenge that these informationmediators need to handle is the varying levels of incompleteness in the underlyingdatabases in terms of missing attribute values. Existing approaches such as QPIADaim to mine and use Approximate Functional Dependencies (AFDs) to predictand retrieve relevant incomplete tuples. These approaches make independenceassumptions about missing values—which critically hobbles their performance whenthere are tuples containing missing values for multiple correlated attributes. In thispaper, we present a principled probabilistic alternative that views an incompletetuple as defining a distribution over the complete tuples that it stands for. Welearn this distribution in terms of Bayesian networks. Our approach involves min-ing/“learning” Bayesian networks from a sample of the database, and using it todo both imputation (predict a missing value) and query rewriting (retrieve relevantresults with incompleteness on the query-constrained attributes, when the datasources are autonomous). We present empirical studies to demonstrate that (i) athigher levels of incompleteness, when multiple attribute values are missing, Bayesiannetworks do provide a significantly higher classification accuracy and (ii) the relevantpossible answers retrieved by the queries reformulated using Bayesian networks

This research is supported by ONR grant N000140910032 and two Google research awards.

R. RaghunathanAmazon, Seattle WA, USAe-mail: [email protected]

S. De (B) · S. KambhampatiComputer Science and Engineering, Arizona State University, Tempe AZ, USAe-mail: [email protected]

S. Kambhampatie-mail: [email protected]

Page 2: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

provide higher precision and recall than AFDs while keeping query processing costsmanageable.

Keywords Data cleaning · Bayesian networks · Query rewriting ·Autonomous database

1 Introduction

As the popularity of the World Wide Web continues to increase, naïve users haveaccess to more and more information in autonomous databases. Incompleteness inthese autonomous sources is extremely commonplace. Such incompleteness mainlyarises due to the way in which these databases are populated—by naïve users,or through (inaccurate) automatic extraction. Dealing with incompleteness in thedatabases requires tools for dealing with uncertainty. Previous attempts at dealingwith this uncertainty by systems like QPIAD (Wolf et al. 2009) have mainly focusedon using rule-based approaches, popularly known in the database community asApproximate Functional Dependencies (AFDs). The appeal of AFDs is due tothe ease of specifying the dependencies, learning and reasoning with uncertainty.However, uncertain reasoning using AFDs adopts the certainty factors model, whichassumes that the principles of locality and detachment (Russell and Norvig 2010)hold. But, these principles do not hold for uncertain reasoning and can lead toerroneous reasoning. As the levels of incompleteness in the information sourcesincreases, the need for more scalable and accurate reasoning becomes paramount.

Full probabilistic reasoning avoids the traps of AFDs. Graphical models are anefficient way of doing full probabilistic reasoning. A Bayesian network is such amodel, where direct dependencies between the variables in a problem are modeledas a directed acyclic graph, and the indirect dependencies can be inferred. Asdesired, Bayesian networks can model both causal and diagnostic dependencies.Using Bayesian networks for uncertain reasoning has largely replaced rule-basedapproaches in Artificial Intelligence. However, learning and inference on Bayesiannetworks can be computationally expensive which might inhibit their applicationsto handling incompleteness in autonomous data sources. In this paper, we considerif these costs can be handled without compromising on the improved accuracyoffered by Bayesian networks, in the context of incompleteness in the autonomousdatabases.

1.1 Incompleteness in autonomous databases

Increasingly many of the autonomous web databases are being populated by auto-mated techniques or by naïve users, with very little curation. For example, databaseslike autotrader.com are populated using automated extraction techniques by crawl-ing the text classifieds and by car owners entering data through forms. Scientificdatabases such as CBioC (2013), also use similar techniques for populating thedatabase. However, Gupta and Sarawagi (2006) have shown that these techniquesare error prone and lead to incompleteness in the database in the sense that manyof the attributes have missing values. Wolf et al. (2009) report that 99 % of the35,000 tuples extracted from Cars Direct were incomplete. When the mediator has

Page 3: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

privileges to modify the data sources, the missing values in these data sources can becompleted using “imputation”, which attempts to fill in the missing values with themost likely value. As the levels of incompleteness in these data sources increase, it isnot uncommon to come across tuples with multiple missing values. Effectively findingthe most likely completions for these multiple missing values would require capturingthe dependencies between them. A second challenge arises when the underlyingdata sources are autonomous, i.e., access to these databases are through forms, themediator cannot complete the missing values with the most likely values. Therefore,mediators need to generate and issue a set of reformulated queries, in order toretrieve the relevant answers with missing values. Efficiency considerations dictatethat the number of reformulations be kept low. In such scenarios, it becomes veryimportant for mediators to send queries that not only retrieve results with a largefraction of relevant results (precision), but also a large number of relevant results(recall).

QPIAD & AFDs The QPIAD system addresses the challenges in retrieving rel-evant incomplete answers by learning the correlations between the attributes inthe database as AFDs and the value distributions as Naïve Bayesian Classifiers.AFDs are rule-based methods for dealing with uncertainty. AFDs adopt the certaintyfactors model (Shortliffe 1976) which makes two strong assumptions:

1. Principle of Locality: Whenever there is a rule A → B, given evidence of A, wecan conclude B, regardless of the other rules and evidences.

2. Principle of Detachment: Whenever a proposition B is found to be true, the truthof B can be used regardless of how it was found to be true.

However, these two assumptions do not hold in the presence of uncertainty. Whenpropagating beliefs, not only is it important to consider all the evidence but also theirsources. Therefore, using AFDs for reasoning with uncertainty can lead to cyclicreasoning and fail to capture the correlations between multiple missing values. Inaddition to these shortcomings, the beliefs are represented using a Naive-BayesianClassifier, which makes strong conditional independence assumptions, often leadingto inaccurate values.

1.2 Overview of our approach

Given the advantages of Bayesian networks over AFDs, we investigate if replacingAFDs with Bayesian networks in QPIAD system, provides higher accuracy andwhile keeping the costs manageable. Learning and inference with Bayesian net-works are computationally harder than AFDs. Therefore, the challenges involvedin replacing AFDs with Bayesian networks include learning and using them to doboth imputation and query rewriting by keeping costs manageable. We use BANJOsoftware package (Hartemink et al. 2005) to learn the topology of the Bayesiannetwork and use BNT (Murphy et al. 2001) and INFER.NET (Minka et al. 2010)software packages to do inference on them. Even though learning the topologyfor the Bayesian network from a sample of the database involves searching overthe possible topologies, we found that high fidelity Bayesian networks could belearned from a small fraction of the database by keeping costs manageable (interms of time spent in searching). Inference in Bayesian networks is intractable in

Page 4: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

the worst case if the network is multiply connected, i.e., there is more than oneundirected path between any two nodes in the network. We handle this challengeby using approximate inference techniques. Approximate inference techniques areable to retain the accuracy of exact inference techniques and keep the cost ofinference manageable. We compare the cost and accuracy of using AFDs andBayesian networks for imputing single and multiple missing values at different levelsof incompleteness in test data.

We also develop new techniques for generating rewritten queries using Bayesiannetworks. The three challenges that are involved in generating rewritten queries are:

1. Selecting the attributes on which the new queries will be formulated. Selectingthese attributes by searching over all the attributes becomes too expensive as thenumber of attributes in the database increases.

2. Determining the values to which the attributes in the rewritten query willbe constrained to. The size of the domains of attributes in most autonomousdatabases is often large. Searching over each and every value can be expensive.

3. Most autonomous data sources have a limit on the number of queries to whichit will answer. The rewritten queries that we generate should be able to carefullytradeoff precision with the throughput of the results returned.

We propose techniques to handle these challenges and compare them with AFD-based approaches in terms of precision and recall of the results returned.

Organization The rest of the paper is organized as follows—We begin with adiscussion of related work, then in Section 3, we describe the problem setting andbackground. In Section 4, we discuss how Bayesian network models of autonomousdatabases can be learned by keeping costs manageable. In Section 5, we comparethe prediction accuracy and cost of using Bayesian network and AFDs for imputingmissing values. Next, in Section 6, we discuss how rewritten queries are generatedusing Bayesian networks and compare them with AFD-approaches for single andmulti-attribute queries. Finally, we conclude in Section 7.

2 Related work

This work is a significant extension of the QPIAD system (Wolf et al. 2007, 2009),which also deals with incompleteness in databases. While the QPIAD system alsolearns attribute correlations, it does so using Approximate Functional Dependencies(AFDs) and uses Naive Bayesian Classifiers for representing value distributions andreformulating queries. Additionally, the QPIAD system can only handle missing val-ues on a single attribute. In contrast, we use Bayesian network models learned froma sample of the database to represent attribute correlations and value distributions.We use the methods used in the QPIAD system as our baseline approach.

Completing missing values in databases using Bayesian networks has been ad-dressed previously (Ramoni and Sebastiani 1997, 2001; Romero and Salmerón 2004;Fernández et al. 2012). Other methods have also been proposed to address learningfrom missing data, for example, Batista and Monard (2002) propose using k-NearestNeighbor approach; Dempster et al. (1977) propose using an EM approach.

Page 5: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

But most methods focus on completing the missing values so as to preserve theoriginal data statistics so that other data mining techniques can be applied to it. Weconcentrate on retrieving relevant possible answers in the presence of missing valueson single and multiple attributes. For example, Wu et. al. use association rules toimpute the value of the missing attributes (Wu et al. 2004), whereas we use Bayesiannetworks to impute the value as well as retrieve results. In Section 5.2, we will showthat using Bayesian networks is clearly superior to using rule-based imputation.

Like QPIAD, and other work on querying over incomplete databases, we tooassume that the level of incompleteness in the database is small enough that it ispossible to get a training sample that is mostly complete. Thus, we use techniquesfor learning from complete training data. If the training sample itself were tobe incomplete, then we will need to employ expectation-maximization techniquesduring learning (Dempster et al. 1977).

Work on querying inconsistent databases usually focuses on fixing problems withthe query itself (Muslea and Lee 2005; Nambiar and Kambhampati 2006). If thequery has an empty result set, or if the query does not include all relevant keywords,it can be automatically augmented to fix those shortcomings. The objective of thiswork is to deal with shortcomings of the data—our query rewriting algorithms helpretrieve useful tuples even in the presence of multiple missing values in them.

3 Problem setting & background

3.1 Overview of QPIAD

Since our main comparison is with the QPIAD system, we will provide a briefoverview of its operation. Given a relation R, a subset X of its attributes, and asingle attribute A of R, an approximate functional dependency (AFD) holds on arelation R, between X and A, denoted by, X � A, if the corresponding functionaldependency X → A holds on all but a small fraction of tuples of R.

To illustrate how QPIAD works consider the query Q : Body = SUV issued toTable 1. Traditional query processors will only retrieve tuples t7 and t9. However, theentities represented by tuples t8 and t10 are also likely to be relevant. The QPIADsystem’s aim is to retrieve tuples t8 and t10, in addition to t7 and t9. In order to retrievetuples t8 and t10 it uses AFDs mined from a sample of the database. For example, anAFD Model � Body may be mined for the fragment of the cars database shown inTable 1. This indicates that the value of a car’s Model attribute often (but not always)determines the value of its Body attribute. These rules are used to retrieve relevantincomplete answers.

When the mediators have access privileges to modify the database, AFDs areused along with Naive Bayesian Classifiers to fill in the missing values as a simpleclassification task and then traditional query processing will suffice to retrieverelevant answers with missing values. However, in more realistic scenarios, whensuch privileges are not provided, mediators generate a set of rewritten queries andsend to the database, in addition to the original user query. According to the AFDmentioned above and tuple t7 retrieved by traditional query processors, a rewrittenquery Q′

1 : σModel=Santa may be generated to retrieve t8. Similarly Q′2 : σModel=MDX

may be generated which will retrieve t10.

Page 6: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

Table 1 A fragment of a cardatabase

ID Make Model Year Body Mileage

1 Audi Null Null Sedan 200002 Audi A8 Null Sedan 150003 BMW 745 2002 Sedan 400004 Audi Null 2005 Sedan 200005 Audi A8 2005 Sedan 200006 BMW 645 1999 Convt Null7 Hyundai Santa 1990 SUV 450008 Hyundai Santa 1993 Null 400009 Acura MDX 1990 SUV 3000010 Acura MDX 1990 Null 12000

Multiple rules can be mined for each attribute, for example, the mileage andyear of a car might determine the body style of the car. So a rule {Year, Mileage}� {Body} could be mined. Each rule has a confidence associated with it, whichspecifies how accurate the determining set of an attribute’s AFD is in predicting it.The current QPIAD system uses only the highest confidence AFD1 of each attributefor imputation and query rewriting. In addition, it only aims to retrieve relevantincomplete answers with atmost one missing value on query-constrained attributes.

To illustrate the shortcomings of AFDs, consider a query Q : σModel=A8∧Year=2005

issued to the fragment of the car database shown in Table 1. When the mediatorhas modification privileges, the missing values for attributes Model and Year can becompleted with the most likely values, before returning the answer set. Using AFDsto predict the missing values in tuple t1, ignores the correlation between the Modeland Year; predicting them independently. Substituting the value for missing attributeYear in tuple t2 using just the highest confidence rule as is done in QPIAD, oftenleads to inaccurate propagation of beliefs as the other rules are ignored. When themediator does not have privileges to modify the database, a set of rewritten queriesare generated and issued to the database to retrieve the relevant uncertain answers.Issuing Q to the database fragment in Table 1 retrieves t5. The rewritten queriesgenerated by methods discussed in QPIAD retrieve tuples t2 and t4. However, itdoes not retrieve tuple t1, but it is highly possible that the entity represented by itis relevant to the user’s query.

3.2 Bayesian networks

A Bayesian network (Pearl 1988) is a graphical representation of the probabilisticdependencies between the variables in a domain. The generative model of a rela-tional database can be represented using a Bayesian network, where each node inthe network represents an attribute in the database. The edges between the nodesrepresent direct probabilistic dependencies between the attributes. The strength ofthese probabilistic dependencies are modeled by associating a conditional probabilitydistribution (CPD) with each node, which represents the conditional probability ofa variable, given the combination of values of its immediate parents. A Bayesian

1The actual implementation of QPIAD uses a variant to the highest confidence AFD for some of theattributes. For details we refer the reader to Wolf et al. (2009).

Page 7: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

network is a compact representation of the full joint probability distribution of thenodes in the network. The full joint distribution can be constructed from the CPDsin the Bayesian network. Given the full joint distribution, any probabilistic querycan be answered. In particular, the probability of any set of hypotheses can becomputed, given any set of observations, by conditioning and marginalizing overthe joint distribution. Since the semantics of Bayesian networks are in terms of thefull joint probability distribution, inference using them considers the influence of allvariables in the network. Therefore, Bayesian networks, unlike AFDs, do not makethe Locality and Detachment assumptions.

The Markov blanket of a node X in the Bayesian network is the set of nodescomprising its parents, its children, and its children’s other parents. If the value ofall the nodes in the Markov blanket of X are given as evidence, then the valueof X is independent of all other nodes in the network (Pearl 1988). The junction-tree algorithm (Jensen et al. 2006), also known as a clique tree, is a message passingalgorithm that efficiently evaluates the posterior distribution on a Bayesian network.Gibb’s sampling (Bishop et al. 2006) is a method of approximate inference on aBayesian network that relies on Markov sampling of the network.

4 Learning Bayesian network models

In this section we discuss how we learn the topology and parameters of the Bayesiannetwork by keeping costs manageable. We learn the generative model of twodatabases—The first dataset is extracted from Cars.com (Cars.com 2013) with theschema Cars(Model, Year, Body, Make, Price, Mileage). We call it the ‘car database’in this paper. The second database is the adult database consisting of 15000 tuplesobtained from UCI data repository (Frank and Asuncion 2010) with the schemaAdult(WorkClass, Occupation, Education, Sex, HoursPerWeek, Race, Relationship,NativeCountry, MaritalStatus, Age). The car database has 55,000 tuples. We extracta fragment of 8,000 tuples from this dataset to learn the generative model. The adultdatabase has 15,000 tuples, and we use the entire dataset for learning. Tables 2 and 3describe the schema and the domain sizes of the attributes in the two databases.The first row in these table corresponds to the scenario where a mediator systemis accessing the database. In such a scenario, the system does not have completeknowledge about the domain, and is therefore able to access only a fraction of thedata. The attributes with continuous values are discretized and used as categoricalattributes. Price and Mileage attributes in the cars database are discretized byrounding off to the nearest five thousand. In the adult database attributes Age andHours Per Week are discretized to the nearest multiple of five.

The structure of the Bayesian network is learned from a complete sample of theautonomous database. We use the BANJO package (Hartemink et al. 2005) as ablack box for learning the structure of the Bayesian network. To keep the learning

Table 2 Domain size of attributes in car database

Database Year Model Make Price Mileage Body

Cars-8000-20(Mediator) 9 38 6 19 17 5Cars-8000-100(Complete) 12 41 6 30 20 7

Page 8: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

Table 3 Domain size of attributes in adult database

Database Age Work Education Marital Occupation Relationship Race Sex Hours Nativeclass status per country

week

Adult-15000- 8 7 16 7 14 6 5 2 10 3720(Mediator)

Adult-15000- 8 7 16 7 14 6 5 2 10 40100(Complete)

costs manageable we constrain nodes to have at most two parents. In cases wherethere are more than two attributes directly correlated to an attribute, these attributescan be modeled as children. There is no limit on the number of children a nodecan have. Figure 1a shows the structure of a Bayesian network learned for the Carsdatabase and Fig. 1b for the adult database. We used samples of sizes varying from5–20 % of the database and found that the structure of the highest scoring networkremained the same.

The BANJO settings used were: searcher choice = Simulated Annealing, proposerchoice = Random Local Move (Addition, deletion, or reversal of an edge in the

Fig. 1 Bayesian networkslearned from a sample of thedata. The size of the samplewas varied from 5 to 20 %; thesame structure was observed inall cases

(a)

(a)

Page 9: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

current network, selected at random), evaluator choice = BDe (Heckerman etal. 1995), decider choice = Metropolis (A Metropolis-Hastings stochastic decisionmechanism, where any network with a higher score is accepted, and any with alower score is accepted with a probability based on a system parameter known asthe “temperature”.)

We also experimented with different time limits for the search, ranging from 5–30 min. We did not see any change in the structure of the highest confidence network.

Bayesian network inference is used in both imputation and query rewriting tasks.Imputation involves substituting the missing values with the most likely values,which involves inference. Exact inference in Bayesian networks is NP-hard (Cooper1990) in the worst case, if the network is multiply connected. Therefore, to keepquery processing costs manageable we use approximate inference techniques. Inour experiments described in Section 5, we found that using approximate inferencetechniques retains the accuracy edge of exact inference techniques, while keepingthe prediction costs manageable. We use the BNT package (Murphy et al. 2001) fordoing inference on the Bayesian Network2 for the imputation task. We experimentedwith various exact inference engines that BNT offers and found the junction-treeengine to be the fastest. While querying multiple variables, junction tree inferenceengine can be used only when all the variables being queried form a clique. Whenthey do not form a clique we use the variable elimination algorithm (Jensen andNielsen 2007).

5 Imputation using Bayesian networks

In this section we compare the prediction accuracy and cost of Bayesian networksversus AFDs for imputing single and multiple missing values when there is in-completeness in test data. The prediction accuracy will be measured in controlledexperiments where the ground truth is known, and the cost is determined as the timetaken by the algorithm to generate the result. When the mediator has privileges tomodify the underlying autonomous database, the missing values can be substitutedwith the most probable value. Imputation using Bayesian networks first computesthe posterior of the attribute that is to be predicted given the values present in thetuple and completes the missing value with the most likely value given the evidence.When predicting multiple missing values, the joint posterior distribution over themissing attributes are computed and the values with the highest probability are usedfor substituting the missing values. Computing the joint probability over multiplemissing values captures the correlations between the missing values, which givesBayesian networks a clear edge over AFDs. In contrast, imputation using AFDs usesthe AFD with the highest confidence for each attribute for prediction. If an attributein the determining set of an AFD is missing, then that attribute is first predictedusing other AFDs (chaining), before the original attribute can be predicted. Themost likely value for each attribute is used for completing the missing value. Whenmultiple missing values need to be predicted, each value is predicted independently.

2In this prototype, we manually transferred the output of the BANJO module to the BNT module.In future systems, we will integrate them programmatically.

Page 10: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

Before discussing the experimental results, we note that an implicit assumptionfor our method to work is that the incompleteness in the data is random. In moreprecise statistical terms, we assume that our data attributes are missing at random(Heitjan and Basu 1996). While this assumption does hold by-and-large, it is certainlynot guaranteed to always hold in real world data. In used car data, for example,the owners of cars with known defects (e.g. Yugo), may well opt to not specify themake. If this happens systematically and completely, then our system will not beable to learn how make being Yugo will be correlated with any other attribute/valuecombination.

We use the car and adult databases described in the previous section. We compareAFD approach used in QPIAD which uses Naïve Bayesian Classifiers to representvalue distributions with exact and approximate inference in Bayesian networks. Wecall exact inference in Bayesian network as BN-Exact. We use Gibbs sampling asthe approximate inference technique, which we call BN-Gibbs. For BN-Gibbs, theprobabilities are computed using 250 samples. For the imputation experiments, thedata was divided into a training set and a test set. In the test set, randomly chosenvalues from the attribute being tested were removed.

5.1 Imputing single missing values

Our experiments show that prediction accuracy using Bayesian networks is higherthan AFDs for attributes which have multiple high confidence rules. Approachesfor combining multiple rules for classification have been shown to be ineffective byKhatri (2006). Since there is no straightforward way for propagating beliefs usingmultiple AFDs, only the AFD with the highest confidence is used for propagatingbeliefs. This method, however, fails to take advantage of additional informationthat the other rules provide. Bayesian networks, on the other hand, systematicallycombine evidences from multiple sources. Figure 2 shows the prediction accuracy inthe presence of a single missing value for each attribute in the Cars database. Wenotice that there is a statistically significant difference in prediction accuracies by thetwo approaches for the attributes Model and Year. There are multiple rules that aremined for these two attributes but using just the rule with highest confidence, ignoresthe influence of the other available evidence, which affects the prediction accuracy.

Fig. 2 Single attributeprediction accuracy (cars)

Page 11: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

5.2 Imputing multiple missing values

In most real-world scenarios, however, the number of missing values per tuple islikely to be more than one. The advantage of using a more general model likeBayesian networks becomes even more apparent in these cases. Firstly, AFDs cannotbe used to impute all combinations of missing values, this is because when thedetermining set of an AFD contains a missing attribute, then the value needs tobe first predicted using a different AFD by chaining. While chaining, if we comeacross an AFD containing the original attribute to be predicted in its determiningset, then predicting the missing value becomes impossible. When the missing valuesare highly correlated, AFDs often get into such cyclic dependencies. In Fig. 3 wecan see that the attribute pairs Year-Mileage, Body-Model and Make-Model cannotbe predicted by AFDs. As the number of missing values increases, the number ofcombinations of missing values that can be predicted reduces. In our experimentswith the Cars database, when predicting three missing values, only 9 out of the 20possible combinations of missing values could be predicted.

On the other hand, Bayesian networks can predict the missing values regardless ofthe number and combination of values missing in a tuple. Secondly, while predictingthe missing values, Bayesian networks compute the joint probability distributionover the missing attributes which allows them to capture the correlations betweenthe attributes. In contrast, prediction using AFDs, which use a Naïve BayesianClassifier to represent the value distributions, predict each of the missing attributesindependently, ignoring the interactions between them. The attribute pair Year-Model in Fig. 3 shows that the prediction accuracy is significantly higher whencorrelations between the missing attributes are captured. We also observe that insome cases, when the missing values are D-separated (Geiger et al. 1990) given thevalues for other attributes in the tuple, the performance of AFDs and Bayesian net-works is comparable. In Fig. 3, we can see that the prediction accuracy for Mileage-Make and Mileage-Model are comparable for all the techniques since attributes areD-separated given the other evidence in the tuple. However, the number of attributesthat are D-separated is likely to decrease with increase in incompleteness in thedatabases.

There is an interesting trade-off between the time taken and the accuracy ofimputation when dealing with tuples with 2 or more missing attributes. Suppose

Fig. 3 Multiple attributeprediction accuracy (cars)

Page 12: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

that the attributes that are missing are X = {x1, x2, ..., xn} and the attributes that areknown are A = {a1, a2, ..., am}. Then one approach would be to find those attributevalues for xi that maximize P(xi|A) individually. This would be the “most likelyestimation” approach, and the fastest method; however, this is also the least accuratesince it ignores all interactions among the attributes in X. A slightly more accurateapproach would be a “greedy search” approach, where the value for one of thevariables in X is (i) found using arg max(xi|A), and then (ii) set as evidence. Thisprocess is repeated for the remaining unknown attributes. The imputation of thoseattributes is, therefore, informed by A ∪ xi. An even more accurate approach is touse “most probable explanation”, where we find the entire set of the attributes Xthat maximizes the joint conditional probability P(X|A). This is the approach thatwe use in this paper.

5.3 Prediction accuracy with increase in incompleteness in test data

As the incompleteness in the database increases, not only does the number ofvalues that need to be predicted increase, but also the evidence for predicting thesemissing values reduces. We compared the performance of Bayesian networks andAFDs as the incompleteness in the autonomous databases increases. We see thatthe prediction accuracy of AFDs drops faster with increase in incompleteness. Thisis because the chaining required for predicting missing values using AFDs increaseswhich in turn increases the chances of getting into cyclic dependencies. Also, when anattribute has multiple AFDs, propagating beliefs using just one rule and ignoring theothers, often violates the principles of detachment and locality (Heckerman 1992)impacting the prediction accuracy.

On the other hand, Bayesian networks, being a generative model, can systemat-ically infer the values of any set of attributes given the evidence of any other set.Therefore, as the incompleteness of the database increases, the prediction accuracyof Bayesian Networks will be significantly higher than that of AFDs. Figure 4 showsthe prediction accuracy of AFDs and Bayesian networks when single and multipleattributes need to be predicted in the car and adult databases. We see that bothBayesian networks approaches have a higher prediction accuracy than AFDs at alllevels of incompleteness.

5.4 Time taken for imputation

We now compare the time taken for imputing the missing values using AFDs, exactinference (junction tree) and Gibbs sampling (250 samples) as the number of missingvalues in the autonomous database increases. Table 4 reports the time taken toimpute a Cars database with 5,479 tuples and Fig. 4 shows the accuracy. In Fig. 4we can clearly see that the AFD based method is outperformed by both the Gibbssampling based inference method as well as the Exact Inference method. Graphs(a–b) use the used car database and graphs (c–d) use the census dataset. We can seefrom the curves that using approximate inference is not much less accurate than usingexact inference both in the case of single attribute imputation (a, c) and two attribute

Page 13: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

(a) (b)

(c) (d)

Fig. 4 Prediction Accuracy with increase in percentage of incompleteness in test data for variousmissing attributes. a–c Car database, d adult database

imputation (b, d). The difference is about 4 percentage points in the worst case, sowe can say that the preferred method for most applications is Gibbs sampling.

Table 4 shows the time taken by each of these methods as the percentage ofincompleteness varies. Note that the time taken by the AFD method reduces withhigher incompleteness because there are fewer AFDs learned from the data.

Table 4 Time taken for predicting 5,479 tuples by AFDs, BN-Gibbs (250 samples) and BN-exact inseconds

Percentage of Time taken Time taken Time takenincompleteness (%) for AFD (s) by BN-Gibbs (s) by BN-exact (s)

0 0.271 44.46 16.2310 0.267 47.15 44.8820 0.205 52.02 82.5230 0.232 54.86 128.2640 0.231 56.19 182.3350 0.234 58.12 248.7560 0.232 60.09 323.7870 0.235 61.52 402.1380 0.262 63.69 490.3190 0.219 66.19 609.65

Page 14: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

6 Query rewriting with Bayesian networks

In information integration scenarios when the underlying data sources are au-tonomous, missing values cannot be completed using imputation. Our goal is toretrieve all relevant answers to the user’s query, including tuples which are relevant,but have missing values on the attributes constrained in the user’s query. Sincequery processors are allowed read-only access to these databases, the only way toretrieve the relevant answers with missing values on query-constrained attributes isby generating and sending a set of reformulated queries that constrain other relevantattributes. We describe two techniques—BN-All-MB and BN-Beam—for retrievingsuch relevant incomplete results using Bayesian networks.

6.1 Generating rewritten queries

Table 5 shows a different fragment of the database shown in Table 1. We will use thisfragment to explain our approach. Notice that tuples t2, t3 have one missing (null)value and tuples t5, t6, t7, t8 have two missing values. To illustrate query rewritingwhen a single attribute is constrained in the query, consider a query(Q) σBody=Sedan.First, the query Q is issued to the autonomous database and all the certain answerswhich correspond to tuples t1, t3, t4 and t5 in the Table 5 are retrieved. This set ofcertain answers forms the base result set. However, tuple t2, which has a missingvalue for Body (possibly due to incomplete extraction or entry error), is likely tobe relevant since the value for Body should have been Sedan had it not been missing.In order to determine the attributes and their values on which the rewritten queriesneed to be generated, we use the Bayesian network learned from the sample of theautonomous database.

Using the same example, we now illustrate how rewritten queries are generated.First, the set of certain answers which form the base result set are retrieved andreturned to the user. The attributes on which the new queries are reformulatedconsist of all attributes in the Markov blanket of the original query-constrainedattribute. We consider all attributes in the Markov blanket while reformulatingqueries because given the values of these attributes, the original query-constrainedattribute is dependent on no other attribute in the Bayesian network. From thelearned Bayesian network shown in Fig. 1a, the Markov blanket of the attribute Bodyconsists of the attributes {Year, Model}. The value that each of the attributes in therewritten query can be constrained to is limited to the distinct value combinations

Table 5 A Fragment of a cardatabase

ID Make Model Year Body Mileage

1 Audi A8 2005 Sedan 200002 Audi A8 2005 Null 150003 Acura tl 2003 Sedan Null4 BMW 745 2002 Sedan 400005 Null 745 2002 Sedan Null6 Null 645 1999 Convt Null7 Null 645 1999 Coupe Null8 Null 645 1999 Convt Null9 BMW 645 1999 Coupe 4000010 BMW 645 1999 Convt 40000

Page 15: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

Algorithm 1 Algorithm for BN-All-MBLet R(A1, A2, .., An) be a database relation. Suppose MB(Ap) is the set of attributesin the Markov blanket of attribute Ap. A query Q: σAp=vp is processed as follows

1. Send Q to the database to retrieve the base result set RS(Q). Show RS(Q), theset of certain answers, to the user.

2. Generate a set of new queries Q’, order them, and send the most relevant onesto the database to retrieve the extended result set ̂RS(Q) as relevant possibleanswers of Q. This step contains the following tasks.

(a) Generate Rewritten Queries. Let π(MB(Ap))(RS(Q)) be the projection of RS(Q)onto MB(Ap). For each distinct tuple ti in π(MB(Ap))(RS(Q)), create a selec-tion query Q’i in the following way. For each attribute Ax in MB(Ap), createa selection predicate Ax=ti.vx. The selection predicates of Q’i consist of theconjunction of all these predicates

(b) Select the Rewritten Queries. For each rewritten query Q’i, compute theestimated precision and estimated recall using the Bayesian network asexplained earlier. Then order all Q’is in order of their F-Measure scores andchoose the top-K to issue to the database.

(c) Order the Rewritten Queries. The top-K Q’is are issued to the database in thedecreasing order of expected precision.

(d) Retrieve extended result set. Given the ordered top-K queries{Q’1, Q’2, ..., Q’K} issue them to the database and retrieve their result sets.The union of result sets RS(Q’1), RS(Q’2), ..., RS(Q’K) is the extended resultset ̂RS(Q).

for each attribute in the base result set. This is because, the values that the otherattributes take are highly likely to be present in relevant incomplete tuples. Thistremendously reduces the search effort required in generating rewritten queries,without affecting the recall too much. However, at higher levels of incompleteness,this might have a notable impact on recall, in which case, we would search over thevalues in the entire domain of each attribute. Continuing our example, when thequery Q is sent to the database fragment shown in Table 5, tuples t1, t3, t4 and t5 areretrieved. The values over which the search is performed for Model is {A8, tl, 745}and for Year is {2002, 2003, 2005}.

Some of the rewritten queries that can be generated by this process are

Q’1: σModel=A8∧Year=2005, Q’2 : σModel=tl∧Year=2003 andQ’3: σModel=745∧Year=2002.

Each of these queries differ in the number of results that they retrieve and thefraction of retrieved results that are relevant. An important issue here is to decidewhich of these queries should be issued to the autonomous database and in whichorder. If we are allowed to send as many queries as we want, ordering the queries interms of their expected precision would obviate the need for ranking the relevantpossible results once they are retrieved. This is because the probability that themissing value in a tuple is exactly the value the user is looking for is the same as theexpected precision of the query that retrieves the tuple. However, limits are often

Page 16: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

imposed on the number of queries that can be issued to the autonomous database.These limits could be due to network or processing resources of the autonomous datasources. Given such limits, the precision of the answers need to be carefully traded offwith selectivity (the number of results returned) of the queries. One way to addressthis challenge is to pick the top-K queries based on the F-measure metric (Manning etal. 2008), as pointed out by Wolf et al. F-measure is defined as the weighted harmonicmean of precision (P) and recall (R) measures :

(1 + α) · P · Rα · P + R

For each rewritten query, the F-measure metric is evaluated in terms of its ex-pected precision and expected recall. The latter which can be computed from theexpected selectivity. Expected precision is computed from the Bayesian networkand expected selectivity is computed the same way as computed by the QPIADsystem, by issuing the query to the sample of the autonomous database. For ourexample, the expected precision of the rewritten query σModel=A8∧Year=2005 can becomputed as the P(Body=Sedan | Model=A8 ∧Year=2005) which is evaluated byinference on the learned Bayesian network. Expected selectivity is computed asSmplSel(Q)·SmplRatio(R), where SmplSel(Q) is the sample selectivity of the queryQ, which is the fraction of the tuples in the sample returned by the query andSmplRatio(R) is the ratio of the original database size to the size of the sample. Forexample, if the original database has 1 million tuples, and our sample database has10,000 tuples, then the SmplRatio(R) is 104/106 = 0.01. If the query results have 1,000tuples in them, then SmplSel(Q) is 103/104 = 0.1. We send queries to the originaldatabase and its sample offline and use the cardinalities of the result sets to estimatethe ratio.

We refer to this technique for generating rewritten queries by constraining allattributes in the Markov blanket as BN-All-MB. In Section 6.2.1 we compare theperformance of BN-All-MB and AFD approaches in retrieving uncertain relevanttuples. However, the issue with constraining all attributes in the Markov blanketis that its size could be arbitrarily large. Since the Markov blanket comprisesthe children, the parents, and the children’s other parent nodes, the number ofattributes contrained in the rewritten queries becomes very large. This will reducethe throughput of the queries significantly. As we mentioned earlier, in cases wherethe autonomous database has a limit on the number of queries to which it willrespond, we need to carefully trade off precision of the rewritten queries withtheir throughput. BN-All-MB and AFD approaches decide upfront the attributesto be constrained and search only over the values to which the attributes will beconstrained. Both these techniques try to address this issue by using the F-measuremetric to pick the top-K queries for issuing to the database—all of which have thesame number of attributes constrained. A more effective way to trade off precisionwith the throughput of the rewritten queries is by making an “online” decision on thenumber of attributes to be constrained. We propose a technique, BN-Beam whichsearches over the Markov blanket of the original query-constrained attribute, andpicks the best subset of the attributes with high precision and throughput.

Page 17: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

6.1.1 Generating rewritten queries using BN-beam

We now describe BN-Beam, our technique for generating rewritten queries whichfinds a subset of the attributes in the Markov blanket of the query-constrainedattribute with high precision and throughput. To illustrate how rewritten queries aregenerated using BN-Beam, consider the same query(Q) σBody=Sedan. First, Q is sent tothe database to retrieve the base result set. We consider the attributes in the Markovblanket of the query-constrained attribute to be the potential attributes on which thenew queries will be formulated. We call this set the candidate attribute set.

For query Q, the candidate attribute set consists of attributes in the Markovblanket of attribute Body which consists of attributes {Model, Year} for the BayesianNetwork in Fig. 1a. Once the candidate attribute set is determined, a beam searchwith a beam width, K and level, L, is performed over the distinct value combinationsin the base result set of the attributes in the candidate attribute set. For example,when the query Q is sent to the database fragment shown in Table 5, tuples t1, t3, t4

and t5 are retrieved. The values over which the search is performed for Model is {A8,tl, 745} and for Year is {2002, 2003, 2005}. Starting from an empty rewritten query, thebeam search is performed over multiple levels, looking to expand the partial queryat the previous level by adding an attribute-value to it. For example, at the first levelof the search five partial rewritten queries: σModel=745, σModel=A8, σModel=tl, σYear=2002

and σYear=2003 may be generated. An important issue here is to decide which of thequeries should be carried over to the next level of search. Since there is a limit on thenumber of queries that can be issued to the autonomous database and we want togenerate rewritten queries with high precision and throughput while keeping queryprocessing costs low, we pick the top-K queries based on the F-measure metric, asdescribed earlier. The advantage of performing a search over both attributes andvalues for generating rewritten queries is that there is much more control over thethroughput of the rewritten queries as we can decide how many attributes will beconstrained.

The top-K queries at each level are carried over to the next level for furtherexpansion. For example, consider query σModel=745 which was generated at level one.At level two, we try to create a conjunctive query of size two by constraining the otherattributes in the candidate attribute set. Say we try to add attribute Year, we searchover the distinct values of Year in the base set with attribute model taking the value745. At each level i, we will have the top-K queries with highest F-measure valueswith i or fewer attributes constrained. The top-K queries generated at the LevelL are sorted based on expected precision and sent to the autonomous database inthat order to retrieve the relevant possible answers. We now describe the BN-Beamalgorithm for generating rewritten queries for single-attribute queries.

In step 2(d), it is important to remove duplicates from ̂RS(Q). Since rewrittenqueries may constrain different attributes, the same tuple might be retrieved by dif-ferent rewritten queries. For example, consider two rewritten queries- Q’1:σModel=A8

and Q’2:σYear=2005, that can be generated at level one for the same user query Q, thataims to retrieve all Sedan cars. All A8 cars manufactured in 2005 will be returnedin the answer sets of both queries. Therefore, we need to remove all duplicatetuples.

Page 18: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

Algorithm 2 Algorithm for BN-BeamLet R(A1, A2, . . . , An) be a database relation. Suppose MB(Ap) is the set ofattributes in the Markov blanket of attribute Ap. All the steps in processing a queryQ: σAp=vp is the same as described for BN-All-MB except step 2(a) and 2(d).

2(a) Generate Rewritten Queries. A beam search is performed over the attributesin MB(Ap) and the value for each attribute is limited to the distinct valuesfor each attribute in RS(Q). Starting from an empty rewritten query, a partialrewritten query (PRQ) is expanded, at each level, to add an attribute-valuepair from the set of attributes present in MB(Ap) but not added to the partialrewritten query already. The queries with top-K values for F-measure scores,computed from the estimated precision and estimated recall computed fromthe sample, are carried over to the next level of the search. The search isrepeated over L levels.

2(d) Post-filtering. Remove the duplicates in ̂RS(Q).

6.1.2 Handling multi-attribute queries

Retrieving relevant uncertain answers for multi-attribute queries has been onlysuperficially addressed in the QPIAD system. It attempts to retrieve only uncertainanswers with missing values on any one of the multiple query-constrained attributes.Here we describe how BN-All-MB and BN-Beam can be extended to retrieve tupleswith missing values on multiple query-constrained attributes.

BN-All-MB: The method described to handle single-attribute queries using BN-All-MB can be easily generalized to handle multi-attribute queries. The rewrittenqueries generated will constrain every attribute in the union of the Markov blanketof the constrained attributes.

BN-Beam: Similarly, using BN-Beam to handle multi-attribute queries is sim-ple extension of the method described for single-attribute queries. The candidateattribute set consists of the union of the attributes in the Markov blanket of eachquery-constrained attribute.

To illustrate how new queries are reformulated to retrieve possibly relevantanswers with multiple missing values on query-constrained attributes, consider anexample query σMake = BMW∧Mileage = 40000 sent to database fragment in Table 5. First,this query retrieves the base result set which consists of tuples t4, t9, t10. The set ofcandidate attributes on which the new queries will be formulated is obtained by theunion of attributes in the Markov blanket of the query-constrained attributes. Forthe learned Bayesian network shown in Fig. 1a, this set consists of {Model, Year}.Once the candidate attribute set is determined, a beam search with a beam width,K, is performed similar to the case when a single attribute is constrained. At thefirst level of the search some of the partial rewritten queries that can be generatedare σModel=745, σModel=645 and σYear=1999. The top-K queries with highest F-measurevalues are carried over to the next level of the search. The top-K queries generatedat the Level L are sent to the autonomous database to retrieve the relevant possibleanswers.

Page 19: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

6.2 Empirical evaluation of query rewriting

The aim of the experiments reported in this section is to compare the precisionand recall of the relevant uncertain results returned by rewritten queries generatedby AFDs and Bayesian networks for single and multi-attribute queries. We usethe datasets as described in Section 4 and partition them into test and trainingsets. 15 % of the tuples are used as the training set. The training set is used forlearning the topology and parameters of the Bayesian network and AFDs. It isalso used for estimating the expected selectivity of the rewritten queries. We usethe Expectation Propagation inference algorithm (Minka 2001) (with 10 samples)available in Infer.NET software package (Minka et al. 2010) for carrying inferenceon the Bayesian network.

In order to evaluate the relevance of the answers returned, we create a copy ofthe test dataset which serves as the ground truth dataset. We further partition thetest data into two halves. One half is used for returning the certain answers, and inthe other half all the values for the constrained attribute(s) are set to null. Note thatthis is an aggressive setup for evaluating our system. This is because typical databasesmay have less than 50 % incompleteness and even the incompleteness may not be onthe query-constrained attribute(s). The tuples retrieved by the rewritten queries fromthe test dataset are compared with the ground truth dataset to compute precisionand recall. Since the answers returned by the certain result set will be the same forall techniques, we consider only uncertain answers while computing precision andrecall.

6.2.1 Comparison of rewritten queries generated by AFDs and BN-All-MB

Figure 5 shows the precision-recall curve for queries on attribute Make in the cardatabase. The size of the Markov blanket and the determining set is one for attributeMake. We note that there is no difference in the quality of the results returned byAFDs and BN-All-MB in this case (see Fig. 5). Next, we compare the quality ofthe results returned by AFDs and BN-All-MB when the size of the Markov blanketand determining set of the AFD of the constrained attribute is greater than one.Figure 5 shows the precision and recall curves for the queries issued to car and adultdatabases. For the query on the adult database, we found that the order in whichthe rewritten queries were ranked were exactly the same. Therefore, we find that theprecision-recall curves of both the approaches lie one on top of the other. For thequeries issued to the car database, we find that there are differences in the order inwhich the rewritten queries are issued to the database. However, we note that thereis no clear winner. The curves lie very close to each other, alternating as the numberof results returned increases. Therefore the performance of AFDs and BN-All-MBis comparable for single-attribute queries.

6.2.2 Comparison of rewritten queries generated by BN-All-MB and BN-beam

Figure 6a shows the increase in recall of the results for three different values of α

in the F-measure metric when ten queries can be issued to the database. We referto results for different values for α for BN-Beam as BN-Beam-α (substitute α withits value) and show a single curve for BN-All-MB, since the curves with different

Page 20: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

(a) (b)

(c)

Fig. 5 Precision-recall curve for different queries

α values were coincident. This is because there are no rewritten queries with highthroughput, therefore just increasing the α value does not increase recall. For BN-Beam, the level of search, L, is set to two. We see that the recall increases withincrease in the value of α. Figure 6b shows the change in precision for differentvalues of α as the number of queries sent to the database increases. As expected,the precision of BN-Beam-0.30 is higher than BN-Beam-0.35 and BN-Beam-0.40.In particular, we point out that the precision of BN-Beam-0.30 remains competitivewith BN-All-MB in all the cases while providing significantly higher recall. BN-Beam

(b)(a)

Fig. 6 Change in recall and precision for different values of α in F-measure metric for top-10rewritten queries for σYear = 2002. The curves for all values of α for BN-All-MB are the same, sothey are represented as a single curve

Page 21: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

is able to retrieve relevant incomplete answers with high recall without any largedecrease in precision: the average difference is only 0.05.

6.2.3 Comparison of multi-attribute queries

We now compare Bayesian network and AFD approaches for retrieving relevantuncertain answers with multiple missing values when multi-attribute queries areissued by the user. We note that the current QPIAD system retrieves only uncertainanswers with atmost one missing value on query-constrained attributes. We compareBN-Beam with two baseline AFD approaches.

1. AFD-All-Attributes: This approach creates a conjunctive query by combiningthe best rewritten queries for each of the constrained attributes. The best rewrit-ten queries for each attribute constrained in the original query are computedindependently and new rewritten queries are generated by combining a rewrittenquery for each of the constrained attributes. The new queries are sent to theautonomous database in the decreasing order of the product of the expectedprecisions of the individual rewritten queries that were combined to form thequery. AFD-All-Attributes technique is only used for multi-attribute querieswhere the determining set of each of the attributes are disjoint.

2. AFD-Highest-Confidence: This approach uses only the AFD of the query-constrained attribute with the highest confidence for generating rewrittenqueries, ignoring the other attributes.We evaluate these methods for selection queries with two constrained attributes.For BN-Beam, the level of search is set to 2 and the value for α in the F-measuremetric is set to zero.

(a) (b)

(c) (d)

Fig. 7 Precision-recall curve for the results returned by top-10 rewritten queries for various queries

Page 22: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

6.2.4 Comparison of AFD-all-attributes and BN-beam

Figure 7a shows the precision-recall curve for the results returned by topten rewritten queries by AFD-All-Attributes and BN-Beam for the queryσMake=bmw∧Mileage=15000 issued to the car database. Figure 7b shows a similar curve forthe query σEducation=HS-grad∧Relationship=Husband issued to the adult database. We notethat the recall of the results returned by AFD-All-Attributes is significantly lowerthan BN-Beam in both cases (see Fig. 8). This is because the new queries generatedby conjoining the rewritten queries of each constrained attribute do not capture thejoint distribution of the multi-attribute query. Therefore, the throughput of thesequeries are often very low, in the extreme case they even generate empty queries. Theprecision of the results returned by AFD-All-Attributes is only slightly higher thanBN-Beam (see Fig. 7). By retrieving answers with a little lesser precision and muchhigher recall than AFD-All-Attributes, BN-Beam technique becomes very effectivein scenarios where the autonomous database has limits on the number of queries thatit will respond to.

6.2.5 Comparison of AFD-highest-conf idence and BN-beam

Figure 7 shows the precision-recall curves for the results returned by the top tenqueries for multi-attribute queries issued to the car and adult databases. Figure 8shows the change in recall with each of the top ten rewritten query issued to theautonomous database. We note that the recall of the results returned by AFD-Highest-Confidence is much higher than BN-Beam. However, this increase in

(a) (b)

(d)(c)

Fig. 8 Change in recall as the number of queries sent to the autonomous database increases forvarious queries

Page 23: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

recall is accompanied by a drastic fall in precision. This is because AFD-Highest-Confidence approach is oblivious to the values of the other constrained attributes.Thus, this approach too, is not very effective for retrieving relevant possible answerswith multiple missing values for multi-attribute queries.

7 Conclusion

We presented a comparison of cost and accuracy trade-offs of using Bayesiannetwork models and Approximate Functional Dependencies (AFDs) for handlingincompleteness in autonomous databases. We showed how a generative model of anautonomous database can be learned and used by query processors while keepingcosts manageable.

We compared Bayesian networks and AFDs for imputing single and multiplemissing values. We showed that Bayesian networks have a significant edge overAFDs in dealing with missing values on multiple correlated attributes and at highlevels of incompleteness in test data.

Further, we presented a technique, BN-All-MB, for generating rewritten queriesusing Bayesian networks. We then proposed a technique, BN-Beam, to generaterewritten queries that retrieve relevant uncertain results with high precision andthroughput, which becomes very important when there are limits on the number ofqueries that autonomous databases respond to. We showed that BN-Beam trumpsAFD-based approaches for handling multi-attribute queries. BN-Beam contributesto the QPIAD system by retrieving relevant uncertain answers with multiple missingvalues on query-constrained attributes with high precision and recall.

References

Batista, G., & Monard, M. (2002). A study of k-nearest neighbour as an imputation method. In Softcomputing systems: design, management and applications (pp. 251–260). Santiago, Chile.

Bishop, C., et al. (2006). Pattern recognition and machine learning (Vol. 4). Springer, New York.Cars.com (2013). http://www.cars.com. Accessed 1 Feb 2013.CBioC: (2013) http://cbioc.eas.asu.edu/. Accessed 1 Feb 2013.Cooper, G. (1990). The computational complexity of probabilistic inference using Bayesian belief

networks. Artificial Intelligence, 42(2), 393–405.Dempster, A., Laird, N., Rubin, D. (1977). Maximum likelihood from incomplete data via the EM

algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.Fernández, A., Rumí, R., Salmerón, A. (2012). Answering queries in hybrid Bayesian networks using

importance sampling. Decision Support Systems, 53(3), 580–590.Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml.

Accessed 1 Feb 2013.Geiger, D., Verma, T., Pearl, J. (1990). Identifying independence in Bayesian networks. Networks,

20(5), 507–534.Gupta, R., & Sarawagi, S. (2006). Creating probabilistic databases from information extraction

models. In VLDB (pp. 965–976).Hartemink, A., et al. (2005) Banjo: Bayesian network inference with java objects. Web Site

http://www.cs.duke.edu/~amink/software/banjo/. Accessed 1 Feb 2013.Heckerman, D. (1992). The certainty-factor model (2nd edn). Encyclopedia of Artificial Intelligence.Heckerman, D., Geiger, D., Chickering, D.M. (1995). Learning Bayesian networks: the combination

of knowledge and statistical data. Machine Learning, 20(3), 197–243.Heitjan, D.F., & Basu, S. (1996). Distinguishing missing at random and missing completely at

random. The American Statistician, 50(3), 207–213.

Page 24: Bayesian networks for supporting query processing over incomplete autonomous databasesrakaposhi.eas.asu.edu/rohit-sushovan-jiis.pdf · 2013-09-03 · J Intell Inf Syst DOI 10.1007/s10844-013-0277-0

J Intell Inf Syst

Jensen, F., Olesen, K., Andersen, S. (2006). An algebra of Bayesian belief universes for knowledge-based systems. Networks, 20(5), 637–659.

Jensen, F.V., & Nielsen, T.D. (2007). Bayesian networks and decision graphs. Springer.Khatri, H. (2006). Query processing over incomplete autonomous web databases. Master’s thesis,

Arizona State University, Tempe, USA.Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to information retrieval. Cambridge

University Press.Minka, T., Winn, J., Guiver, J., Knowles, D. (2010). Infer.NET 2.4. http://research.microsoft.

com/infernet. Microsoft Research Cambridge. Accessed 1 Feb 2013.Minka, T.P. (2001). Expectation propagation for approximate Bayesian inference. In UAI (pp. 362–

369).Murphy, K., et al. (2001). The Bayes net toolbox for Matlab. Computing Science and Statistics, 33(2),

1024–1034.Muslea, I., & Lee, T. (2005). Online query relaxation via Bayesian causal structures discovery. In

Proceedings of the national conference on artif icial intelligence (Vol. 20, p. 831). Menlo Park,CA/Cambridge, MA, London: AAAI Press/MIT Press.

Nambiar, U., & Kambhampati, S. (2006). Answering imprecise queries over autonomous web data-bases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE) (pp. 45–45). IEEE.

Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. MorganKaufmann.

Ramoni, M., & Sebastiani, P. (1997). Learning Bayesian networks from incomplete databases. InProceedings of the 13th conference on uncertainty in artif icial intelligence (pp. 401–408). MorganKaufmann Publishers Inc.

Ramoni, M., & Sebastiani, P. (2001). Robust learning with missing data. Machine Learning, 45(2),147–170.

Romero, V., & Salmerón, A. (2004). Multivariate imputation of qualitative missing data us-ing Bayesian networks. In Soft methodology and random information systems (pp. 605–612).Springer.

Russell, S.J., & Norvig, P. (2010). Artif icial intelligence—a modern approach. Pearson Education.Shortliffe, E. (1976) Computer-based medical consultations: MYCIN (Vol. 388). Elsevier, New York.Wolf, G., Kalavagattu, A., Khatri, H., Balakrishnan, R., Chokshi, B., Fan, J., Chen, Y., Kambham-

pati, S. (2009). Query processing over incomplete autonomous databases: query rewriting usinglearned data dependencies. Very Large Data Bases Journal, 18(5), 1167–1190.

Wolf, G., Khatri, H., Chokshi, B., Fan, J., Chen, Y., Kambhampati, S. (2007). Query processing overincomplete autonomous databases. In Proceedings of the 33rd international conference on verylarge data bases (pp. 651–662). VLDB Endowment.

Wu, C., Wun, C., Chou, H. (2004). Using association rules for completing missing data. In HybridIntelligent Systems (HIS) (pp. 236–241). IEEE.