multidimensional data. many applications of databases are "geographic" = 2dimensional...

Multidimensional Data

Multidimensional Data• Many applications of databases are "geographic" = 2 dimensional data.

Others involve large numbers of dimensions. • Example: data about sales.

- A sale is described by (store, day, item, color, size, etc.).

• Sale = point in 5 dim space. - A customer is described by (age, salary, pcode, marital status, etc.).

Typical Queries • Range queries: "How many customers for gold jewelry have age

between 45 and 55, and salary less than 100K?" • Nearest neighbor : "If I am at coordinates (a,b), what is the nearest

McDonalds." • They are expressible in SQL. Do you see how?

SQL• Range queries: “How many customers for gold jewelry have age between 45

and 55, and salary less than 100K?”

SELECT *FROM Customers WHERE age>=45 AND age<=55 AND sal<100;

• Nearest neighbor : “If I am at coordinates (a,b), what is the nearest McDonalds.” Suppose we have a relation Points(x,y,name)

SELECT *FROM Points pWHERE p.name=‘McDonalds’ AND NOT EXISTS (

SELECT * FROM POINTS q WHERE (q.x-a)*(q.x-a)+(q.y-b)*(q.y-b) < (p.x-a)*(p.x-a)+(p.y-b)*(p.y-

b)AND q.name=‘McDonalds’

);

Big Impediment• For these types of queries, there is no clean way to

eliminate lots of records that don't meet the condition of the WHERE clause.

An Approach for range queries

Index on attributes independently. - Intersect pointers in main memory to save disk I/O.

Attempt at using B-trees for MD-queries• Database = 1,000,000 points evenly distributed in a 1000×1000

square. Stored in 10,000 blocks (100 recs per block)• B-tree secondary indexes on x and on y

Range query {(x,y) : 450 x 550, 450 y 550}

• 100,000 pointers (i.e. 1,000,000/10) for the x range, and same for y• 10,000 pointers for answer (found by pointer intersection)• Retrieve 10,000 records. If they are stored randomly we need to do

10,000 I/O’s.

Add here the cost of B-Trees:• Root of each B-tree in main memory• Suppose leaves have avg. 200 keys 500 disk I/O in each B-tree to

get pointer lists 1000 + 2(for intermediate B-tree level) disk I/O’s

Total• 11,002 disk I/O’s, more than sequential scan of file = 10,000 I/O’s.

Nearest Neighbor query using B-trees

• Turn NN to (10,20) into a range-query

{(x,y):10-d x 10+d, 20-d y 20+d }

• Possible problem:1. No point in the selected range

2. The closest point inside may not be the answer

• Solution: re-execute range query with slightly larger d

NN-queries, example• Same relation Points and its indexes on x and y as before, and

Query: NN to (10,20)

• Choose d = 1 range-query = {(x,y): 9x 11, 19y21}• 2000 points in [9,11], • 2000 points in [19,21] • For each dimension, we pay 10+1 I/O’s to get pointers from the

B-Tree leaves+1 is because points with x=9 may not start just at the beginning of the leaf

• Add an extra I/O for the intermediate node when finding the start of the range for each index

• Total 24 + 1 disk I/O’s to get the answer, • assuming 1 of the 4 points is the answer, which we can determine by their

coordinates, prior to getting the data blocks holding the points

• However, if d is too small, we have to run another range query with a larger d

Grid files (hash-like structure)

Data:(25,60) (45,60) (50,75) (50,100)(50,120) (70,110) (85,140) (30,260)(25,400) (45,350) (50,275) (60,260)

• Divide data into stripes in each dimension

• Each rectangle is a bucket

• Example: database records (age,salary) for people who buy gold jewelry.

Grid file

OperationsLookup

Find coordinates of point in each dimension --- gives you a bucket to search.

Nearest Neighbor

Lookup point P . Consider points in its bucket. • Problem: there could be points in adjacent

buckets that are closer. • Problem: there could be no points at all in the

bucket: widen search?

Range Queries

Ranges define a region of buckets. • Buckets on border may contain points not in

range. • Example: 35 < age <= 45; 50 < salary <=

100.

Queries Specifying Only One Attribute • Must search a whole row or column of

buckets.

Insertion• Use overflow

buckets, or split stripes in one or more dimensions

• Insert (52,200).

Insertion• Insert (52,200). Split

central bucket, for instance by splitting central salary stripe (One possibility)

• Blocks of 3 buckets are to be processed.

• In general the blocks of n buckets are to be processed during a split.

Grid filesAdvantages• Good for multiple-key search• Supports Partial Match, Range Queries, NN queries

Disadvantages• Space management overhead• Need partitioning ranges that evenly split keys• Possibility of overflow buckets for insertion

Partitioned hashing I• If we hash the concatenation of several keys then such a

hash table cannot be used in queries specifying only one dimension (key).

• Instead create hash function h as a concatenation of n hash functions, one for each dimensional attribute.

• h = (h1, …, hn)

• the bucket where to put a tuple (v1, …, vn) is computed by concatenating the bit sequences h1(v1)…hn(vn).

Partitioned hashing II• Example: Gold jewelry

with• first bit: age mod 2• bits 2 and 3: salary

mod 4

• Partial match? • Range?• NN?

Partitioned hashing III• Partial match query – specifying only the value of a:

• compute hage(a), which could be, say 1.

• Then, locate all the relevant buckets, which are from 100 to 111.

– specifying only the value of salary:• compute hsalary(s), which could be, say 10.

• Then, locate the relevant buckets, which are 010 and 110.

• Bad for: • range • nearest neighbor queries

Grid files vs. partitioned hashing

• If many dimensions many empty cells in grid. While partitioned hashing is OK.

• Both support exact and partial match queries.

• Grid files good for range and NN queries, while partitioned hashing is not at all.

Multiple-key indexes• Index on one attribute provides

pointer to an index on the other.

• Let V be a value of the first attribute.

• Then the index we reach by following the pointer for V is an index into the set of points that

have V for their first value in the first attribute and

any value for the second attribute.

• “Who buys gold jewelry” (age and salary only). Raw data in age salary pairs:

(25; 60) (45; 60) (50; 75) (50; 100)

(50; 120) (70; 110) (85; 140) (30; 260)

(25; 400) (45; 350) (50; 275) (60; 260)

• Question: For what kinds of queries will a multiple key index (age first) significantly reduce the number of disk I/O's?

Example

The indexes can be organized as B-Trees.

Partial match queries• If the first attribute is specified, then

the access is quite efficient• If the first attribute isn’t specified,

then we have to search every sub-index.

Range queries• Quite well, provided the individual

indexes themselves support range queries on their attribute (e.g. they are B-Trees)- Example. Range query is

35age55 AND 100sal200

NN queries• Similar to range queries.

Operations

Also, the indexes should be “primary” ones if we

want to support efficiently range queries.

KD-Trees• Levels rotate among

the dimensions, partitioning the points by comparison with a value for that dimension.

• Leaves are blocks holding the data records.

Geometrically…• Remember we didn’t

want the stripes in grid files to continue all along the vertical or horizontal direction.

• Here they don’t.

OperationsLookup in KD Trees • Find appropriate leaf by binary search. Is the record there?

Insert Into KD Trees • Lookup record to be inserted, reaching the appropriate leaf. • If there is room, put record in that block. • If not, find a suitable value for the appropriate dimension and

split the leaf block using the appropriate dimension.

Example • Someone 35 years old with a salary of $500K buys gold

jewelry. • Belongs in leaf with (25; 400) and (45; 350). • Too full: split on age. See figure next.

It’s “age” turn to be used for split. Split at 35; it’s the median.

Someone 35 years old with a salary of $500K buys gold jewelry.

QueriesPartial match queries• When we don’t know

the value of the attribute at the node, we must explore both of its children.- E.g. find points with

age=50

Range Queries• Sometimes a range will

allow us to move to only one child of a node.

• But if the range straddles the splitting value then we must explore both children.

KD-trees in secondary storage• If internal nodes

don’t fit in main memory group them into blocks.

Quad trees• Nodes split at all

dimensions at once• For a quad tree of k

dimensions, each interior node has 2k children.

j k f g l d a b

c ei h

Age

400

1000

h

b

i

a

cd e

g f

kj

Sal

l

Age 25, Sal 300

Age 50, Sal 200

Age 75, Sal 100

Why quad trees?• k-dimensions node has 2k children, e.g. k=7 128

children. • If 128, or 27, pointers can fit in a block, then k=7 is a

convenient number of dimensions.

Quad Tree Insert and QueriesInsert• Find leaf node in which new point belongs. • If room, put it there. • If not, make the leaf an interior node and give it leaves for each quadrant.

Split the points among the new leaves. • Problem: may make lots of null pointers, especially in high dimensions.

Quad Tree Queries • Single point queries: easy; just go down the tree to proper leaf. • Range queries: varies by position of range.

- Example: a range like 45<age<55; 180<salary<220 requires search of four leaves.

Nearest neighbor: Problems and strategies similar to grid files.

multidimensional data. many applications of databases are "geographic" = 2dimensional...

Documents