using taxonomies to perform aggregated querying over imprecise data

Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Atanu RoyChandrima SarkarRafal A. AngrykUsing Taxonomies to Perform Aggregated Querying over Imprecise DataPresented by: Rafal A. AngrykDate: 2010-12-14Outlines of the PresentationRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data2IdeaImprecisionMotivationLimitations of Previous WorkDefinitionsApproachExperimental Setup & ResultsConclusion and Future WorkIdea of the ProjectRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data3This paper provides framework for answering queries over imprecise data found in the common databases.

We propose to solve this by classifying the data into taxonomical hierarchies and then capturing it in weighted hierarchical hypergraph.3Imprecision in Databases: An ExampleRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data4IDGermination TimeStem CankersR1AugustAbove-sec-nodeR2SeptemberAbsentR3FallAbove-Sec-nodeR4JulyAbsentIDGermination TimeStem CankersR1AugustAbove-sec-nodeR2SeptemberAbsentR3FallAbove-Sec-nodeR4JulyAbsentRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data5IDGermination TimeStem CankersR1AugustAbove-sec-nodeR2SeptemberAbsentR3FallAbove-Sec-nodeR4JulyAbsentIDGermination TimeStem CankersR1AugustAbove-sec-nodeR2SeptemberAbsentR3AugustAbove-Sec-nodeR4JulyAbsentIDGermination TimeStem CankersR1AugustAbove-sec-nodeR2SeptemberAbsentR3SeptemberAbove-Sec-nodeR4JulyAbsentConstraint: All soybean seeds with the same kind of stem canker should germinate in the same month of the season.IDGermination TimeStem CankersR1AugustAbove-sec-nodeR2SeptemberAbsentR3SeptemberAbove-Sec-nodeR4JulyAbsentMotivationSeveral recent papers have focused on retrieval of imprecise data, where every fact can be a region, instead of a point, in a multi-dimensional space.

The most prominent one is [BDRV07]

They have solved it by constructing marginal databases (MDBs) from extended database (EDBs) with the help of constraint hypergraph.6Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data6

Limitations of Previous WorkRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data7Creating Marginal Databases using weighted hierarchical Hypergraph, employs brute force method for retrieving connected facts (tuples).

This increases the overall time complexity and processing time of the queries.

[BDRV07] follows a data specific technique but we propose to follow a domain specific knowledge

7DefinitionsRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data8Background knowledge: Knowledge required to generate taxonomies.Expert knowledge: Domain-specific human expertise.Data-derived knowledge: Derived from historic precise database and is used to generate mutually exclusive probabilities

Possible worlds: All the possible combinations that an imprecise record can assume.

Valid world: All the possible worlds which satisfies a given set of constraints.Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data9

Assignment of ProbabilitiesRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data10

EDB CreationRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data11Probability of a possible world is the product of the unconditional occurrences of all imprecise attributes.

Sum of probabilities of all possible worlds of an imprecise record is 1.

Probability assignment rule creates a set of tuples using

Hyperedge CreationRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data12

MDB CreationRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data13Weighted hierarchical hypergraph is defined as H(L, E) where L represents the nodes and E is the set of hyperedges between different taxonomies.Each hyperedge signifies a distinct combination of attribute values. The weight of a possible world assigned to a hyperedge [AC10] needs to preserve the a few properties.All t-norms [AC10] (e.g. minimum, product) fulfill these requirements. We choose product for the purposes of our preliminary investigation.

EDB MDBRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data14IDTempIDGermination TimeStem CankerProbEDBR1T1AugustAbove-sec-node0.80R1T2SeptemberAbove-sec-node0.20R2T3AugustAbove-sec-node0.60R2T4AugustAbsent0.40R3T5AugustAove-sec-node 0.48R3T6AugustAbsent0.32R3T7SeptemberAbove-sec-node 0.12R3T8September Absent 0.08R4T9September Absent 1.00

IDTempIDGermination TimeStem CankerProbEDBR1T1AugustAbove-sec-node0.80R1T2SeptemberAbove-sec-node0.20R2T3AugustAbove-sec-node0.60R2T4AugustAbsent0.40R3T5AugustAove-sec-node 0.48R3T6AugustAbsent0.32R3T7SeptemberAbove-sec-node 0.12R3T8September Absent 0.08R4T9September Absent 1.00IDTempIDGermination TimeStem CankerProbMDBR1T1AugustAbove-sec-node0.9057R1T2SeptemberAbove-sec-node0.0943R2T3AugustAbove-sec-node0.6429R2T4AugustAbsent0.3571R3T5AugustAove-sec-node 0.4983R3T6AugustAbsent0.2768R3T7SeptemberAbove-sec-node 0.0519R3T8September Absent 0.1730R4T9September Absent 1.0000Aggregated QueryingRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data15We aggregate tuples for aggregated querying based on its uniqueness.Group two tuples only when all their attributes values and the corresponding probabilities are the same.

Find the total no. of plants grown in august which have a Stem Canker above-sec-node(44*0.9057) + (25*0.6429) 56 GIDGermination TimeStem CankerMarginal probabilityNo. of plantsG1AugustAbove-sec-node0.905744G1AugustAbove-sec-node0.094344Experimental SetupRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data16Census-Income dataset from UCI Machine Learning repository.Finally used 7 dimensions.Precise database has 191239 records.Test dataset has 99762 records.Randomly inserted imprecision into the test dataset to make it imprecise.

Distribution of ImprecisionRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data17AttributesagemiceducbfcbmcbswwyLevel 1 (Root)150 (1)107 (1)136 (1)100 (1)100 (1)100 (1)375 (1)Level 2450 (3)1393 (16)409 (3)144 (2)144 (2)144 (2)1125 (3)Level 3900 (6)8500 (24)955 (7)289 (4)289 (4)289 (4)8500 (9)Level 48500 (12)8500 (17)867 (12)867 (12)867 (12)Level 58500 (42)8500 (42)8500 (42)Total10000 (22)10000 (41)10000 (28)10000 (61)10000 (61)10000 (61)10000 (13)Imprecision CharacteristicsRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data18Scalability TestRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data19Extended Database AnalysisRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data20Influence of ImprecisionRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data21Absolute Percentage ErrorRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data22Conclusion and Future WorkRoy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data23In this research we significantly present a framework for efficient querying over imprecise data with an average of 94% accuracy We intend to extend this research to include Ontology in place of Taxonomy.We also intend to use Associative Weight Mining to assign weights to hyperedges.Questions?Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data24References[BDRV07]: Douglas Burdick, AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan: OLAP over Imprecise Data with Domain Constraints. VLDB 2007: 39-50

[AC10]: Rafal A. Angryk, Jacek Czerniak: Heuristic Algorithm for Interpretation of Multi-Valued Attributes in Similarity-based Fuzzy Relational Databases. International Journal of Approximate Reasoning 51: 895-911 (2010)

using taxonomies to perform aggregated querying over imprecise data

Documents