atanu roy chandrima sarkar rafal a. angryk using taxonomies to perform aggregated querying over...

24
Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Upload: jonas-timm

Post on 29-Mar-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Atanu RoyChandrima Sarkar

Rafal A. Angryk

Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Presented by: Rafal A. AngrykDate: 2010-12-14

Page 2: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Outlines of the Presentation

2

IdeaImprecisionMotivationLimitations of Previous WorkDefinitionsApproachExperimental Setup & ResultsConclusion and Future Work

Page 3: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Idea of the Project

3

This paper provides framework for answering queries over imprecise data found in the common databases.

We propose to solve this by classifying the data into taxonomical hierarchies and then capturing it in weighted hierarchical hypergraph.

Page 4: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Imprecision in Databases: An Example

4

ID Germination Time

Stem Cankers

R1 August Above-sec-node

R2 September Absent

R3 Fall Above-Sec-node

R4 July Absent

ID Germination Time

Stem Cankers

R1 August Above-sec-node

R2 September Absent

R3 Fall Above-Sec-node

R4 July Absent

Page 5: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

5

ID

Germination Time

Stem Cankers

R1

August Above-sec-node

R2

September Absent

R3

Fall Above-Sec-node

R4

July AbsentGermination Time

Summer

June July

Fall

August September

ID

Germination Time

Stem Cankers

R1

August Above-sec-node

R2

September Absent

R3

August Above-Sec-node

R4

July Absent

ID

Germination Time

Stem Cankers

R1

August Above-sec-node

R2

September Absent

R3

September Above-Sec-node

R4

July Absent

Constraint: All soybean seeds with the same kind of stem canker should germinate in the same month of the season.

ID

Germination Time

Stem Cankers

R1

August Above-sec-node

R2

September Absent

R3

September Above-Sec-node

R4

July Absent

Page 6: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

MotivationSeveral recent papers have focused on

retrieval of imprecise data, where every fact can be a region, instead of a point, in a multi-dimensional space.

The most prominent one is [BDRV07]

They have solved it by constructing marginal databases (MDBs) from extended database (EDBs) with the help of constraint hypergraph.

6

Page 7: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Limitations of Previous Work

7

Creating Marginal Databases using weighted hierarchical Hypergraph, employs brute force method for retrieving connected facts (tuples).

This increases the overall time complexity and processing time of the queries.

[BDRV07] follows a data specific technique but we propose to follow a domain specific knowledge

Page 8: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Definitions

8

Background knowledge: Knowledge required to generate taxonomies.Expert knowledge: Domain-specific human expertise.Data-derived knowledge: Derived from historic

precise database and is used to generate mutually exclusive probabilities

Possible worlds: All the possible combinations that an imprecise record can assume.

Valid world: All the possible worlds which satisfies a given set of constraints.

Page 9: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

9

Page 10: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Assignment of Probabilities

10

Page 11: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

EDB Creation

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

11

Probability of a possible world is the product of the unconditional occurrences of all imprecise attributes.

Sum of probabilities of all possible worlds of an imprecise record is 1.

Probability assignment rule creates a set of tuples using

Page 12: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Hyperedge Creation

12

Page 13: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

MDB Creation

13

Weighted hierarchical hypergraph is defined as H(L, E) where L represents the nodes and E is the set of hyperedges between different taxonomies.

Each hyperedge signifies a distinct combination of attribute values. The weight of a possible world assigned to a hyperedge [AC10] needs to preserve the a few properties.

All t-norms [AC10] (e.g. minimum, product) fulfill these requirements. We choose product for the purposes of our preliminary investigation.

Page 14: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

EDB MDB

14

ID TempID

Germination Time

Stem Canker ProbEDB

R1 T1 August Above-sec-node 0.80

R1 T2 September Above-sec-node 0.20

R2 T3 August Above-sec-node 0.60

R2 T4 August Absent 0.40

R3 T5 August Aove-sec-node 0.48

R3 T6 August Absent 0.32

R3 T7 September Above-sec-node 0.12

R3 T8 September Absent 0.08

R4 T9 September Absent 1.00

ID TempID

Germination Time

Stem Canker ProbEDB

R1 T1 August Above-sec-node 0.80

R1 T2 September Above-sec-node 0.20

R2 T3 August Above-sec-node 0.60

R2 T4 August Absent 0.40

R3 T5 August Aove-sec-node 0.48

R3 T6 August Absent 0.32

R3 T7 September Above-sec-node 0.12

R3 T8 September Absent 0.08

R4 T9 September Absent 1.00

ID TempID

Germination Time

Stem Canker ProbMD

B

R1 T1 August Above-sec-node 0.9057

R1 T2 September Above-sec-node 0.0943

R2 T3 August Above-sec-node 0.6429

R2 T4 August Absent 0.3571

R3 T5 August Aove-sec-node 0.4983

R3 T6 August Absent 0.2768

R3 T7 September Above-sec-node 0.0519

R3 T8 September Absent 0.1730

R4 T9 September Absent 1.0000

Page 15: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Aggregated Querying

15

We aggregate tuples for aggregated querying based on its uniqueness.

Group two tuples only when all their attributes values and the corresponding probabilities are the same.

Find the total no. of plants grown in august which have a Stem Canker above-sec-node

(44*0.9057) + (25*0.6429) ≈ 56

GID Germination Time

Stem Canker

Marginal probability

No. of plants

G1 August Above-sec-node 0.9057 44G1 August Above-sec-node 0.0943 44

Page 16: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Experimental Setup

16

Census-Income dataset from UCI Machine Learning repository.

Finally used 7 dimensions.Precise database has 191239 records.Test dataset has 99762 records.Randomly inserted imprecision into the

test dataset to make it imprecise.

Page 17: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Distribution of Imprecision

17

Attributes

age mic edu cbf cbm cbs wwy

Level 1 (Root)

150 (1) 107 (1) 136 (1) 100 (1) 100 (1) 100 (1) 375 (1)

Level 2 450 (3) 1393 (16)

409 (3) 144 (2) 144 (2) 144 (2) 1125 (3)

Level 3 900 (6) 8500 (24)

955 (7) 289 (4) 289 (4) 289 (4) 8500 (9)

Level 4 8500 (12)

8500 (17)

867 (12)

867 (12)

867 (12)

Level 5 8500 (42)

8500 (42)

8500 (42)

Total 10000 (22)

10000 (41)

10000 (28)

10000 (61)

10000 (61)

10000 (61)

10000 (13)

Page 18: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Imprecision Characteristics

18

1 2-10 11-50 51-100 101-250

251-500

>5000

5000

10000

15000

20000

25000

30000

35000

40000

45000

3470138485

16661

2979 2306 9783652

No. of Tuples

Page 19: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Scalability Test

19

10k 20k 30k 40k 50k 60k 70k 80k 90k 99.7k0

5000

10000

15000

20000

25000

30000

35000

40000

R² = 0.995651793028629

Runtime(In sec.) Linear Trendline

Page 20: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Extended Database Analysis

20

100

1000

10000

100000

1000000

10000000Tuples Linear TrendlineOutliers Linear (Outliers)

Page 21: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Influence of Imprecision

21

100000

150000

200000

250000

300000

350000

400000

450000

500000

R² = 0.997607922001129

No. of MDB tuples

Degree 4 Polynomial Trendline

Page 22: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Absolute Percentage Error

22

150 300 600 900 1200 15000

1

2

3

4

5

6

7

8

9Absolute Percentage Error

Page 23: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Conclusion and Future Work

23

In this research we significantly present a framework for efficient querying over imprecise data with an average of ≈ 94% accuracy

We intend to extend this research to include Ontology in place of Taxonomy.

We also intend to use Associative Weight Mining to assign weights to hyperedges.

Page 24: Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

Questions?

24

References[BDRV07]: Douglas Burdick, AnHai Doan,

Raghu Ramakrishnan, Shivakumar Vaithyanathan: OLAP over Imprecise Data with Domain Constraints. VLDB 2007: 39-50

[AC10]: Rafal A. Angryk, Jacek Czerniak: Heuristic Algorithm for Interpretation of Multi-Valued Attributes in Similarity-based Fuzzy Relational Databases. International Journal of Approximate Reasoning 51: 895-911 (2010)