multidimensional data. many applications of databases are "geographic" = 2dimensional...
Post on 18-Dec-2015
217 views
TRANSCRIPT
Multidimensional Data
Multidimensional Data• Many applications of databases are "geographic" = 2 dimensional data.
Others involve large numbers of dimensions. • Example: data about sales.
- A sale is described by (store, day, item, color, size, etc.).
• Sale = point in 5 dim space. - A customer is described by (age, salary, pcode, marital status, etc.).
Typical Queries • Range queries: "How many customers for gold jewelry have age
between 45 and 55, and salary less than 100K?" • Nearest neighbor : "If I am at coordinates (a,b), what is the nearest
McDonalds." • They are expressible in SQL. Do you see how?
SQL• Range queries: “How many customers for gold jewelry have age between 45
and 55, and salary less than 100K?”
SELECT *FROM Customers WHERE age>=45 AND age<=55 AND sal<100;
• Nearest neighbor : “If I am at coordinates (a,b), what is the nearest McDonalds.” Suppose we have a relation Points(x,y,name)
SELECT *FROM Points pWHERE p.name=‘McDonalds’ AND NOT EXISTS (
SELECT * FROM POINTS q WHERE (q.x-a)*(q.x-a)+(q.y-b)*(q.y-b) < (p.x-a)*(p.x-a)+(p.y-b)*(p.y-
b)AND q.name=‘McDonalds’
);
Big Impediment• For these types of queries, there is no clean way to
eliminate lots of records that don't meet the condition of the WHERE clause.
An Approach for range queries
Index on attributes independently. - Intersect pointers in main memory to save disk I/O.
Attempt at using B-trees for MD-queries• Database = 1,000,000 points evenly distributed in a 1000×1000
square. Stored in 10,000 blocks (100 recs per block)• B-tree secondary indexes on x and on y
Range query {(x,y) : 450 x 550, 450 y 550}
• 100,000 pointers (i.e. 1,000,000/10) for the x range, and same for y• 10,000 pointers for answer (found by pointer intersection)• Retrieve 10,000 records. If they are stored randomly we need to do
10,000 I/O’s.
Add here the cost of B-Trees:• Root of each B-tree in main memory• Suppose leaves have avg. 200 keys 500 disk I/O in each B-tree to
get pointer lists 1000 + 2(for intermediate B-tree level) disk I/O’s
Total• 11,002 disk I/O’s, more than sequential scan of file = 10,000 I/O’s.
Nearest Neighbor query using B-trees
• Turn NN to (10,20) into a range-query
{(x,y):10-d x 10+d, 20-d y 20+d }
• Possible problem:1. No point in the selected range
2. The closest point inside may not be the answer
• Solution: re-execute range query with slightly larger d
NN-queries, example• Same relation Points and its indexes on x and y as before, and
Query: NN to (10,20)
• Choose d = 1 range-query = {(x,y): 9x 11, 19y21}• 2000 points in [9,11], • 2000 points in [19,21] • For each dimension, we pay 10+1 I/O’s to get pointers from the
B-Tree leaves+1 is because points with x=9 may not start just at the beginning of the leaf
• Add an extra I/O for the intermediate node when finding the start of the range for each index
• Total 24 + 1 disk I/O’s to get the answer, • assuming 1 of the 4 points is the answer, which we can determine by their
coordinates, prior to getting the data blocks holding the points
• However, if d is too small, we have to run another range query with a larger d
Grid files (hash-like structure)
Data:(25,60) (45,60) (50,75) (50,100)(50,120) (70,110) (85,140) (30,260)(25,400) (45,350) (50,275) (60,260)
• Divide data into stripes in each dimension
• Each rectangle is a bucket
• Example: database records (age,salary) for people who buy gold jewelry.
Grid file
OperationsLookup
Find coordinates of point in each dimension --- gives you a bucket to search.
Nearest Neighbor
Lookup point P . Consider points in its bucket. • Problem: there could be points in adjacent
buckets that are closer. • Problem: there could be no points at all in the
bucket: widen search?
Range Queries
Ranges define a region of buckets. • Buckets on border may contain points not in
range. • Example: 35 < age <= 45; 50 < salary <=
100.
Queries Specifying Only One Attribute • Must search a whole row or column of
buckets.
Insertion• Use overflow
buckets, or split stripes in one or more dimensions
• Insert (52,200).
Insertion• Insert (52,200). Split
central bucket, for instance by splitting central salary stripe (One possibility)
• Blocks of 3 buckets are to be processed.
• In general the blocks of n buckets are to be processed during a split.
Grid filesAdvantages• Good for multiple-key search• Supports Partial Match, Range Queries, NN queries
Disadvantages• Space management overhead• Need partitioning ranges that evenly split keys• Possibility of overflow buckets for insertion
Partitioned hashing I• If we hash the concatenation of several keys then such a
hash table cannot be used in queries specifying only one dimension (key).
• Instead create hash function h as a concatenation of n hash functions, one for each dimensional attribute.
• h = (h1, …, hn)
• the bucket where to put a tuple (v1, …, vn) is computed by concatenating the bit sequences h1(v1)…hn(vn).
Partitioned hashing II• Example: Gold jewelry
with• first bit: age mod 2• bits 2 and 3: salary
mod 4
• Partial match? • Range?• NN?
Partitioned hashing III• Partial match query – specifying only the value of a:
• compute hage(a), which could be, say 1.
• Then, locate all the relevant buckets, which are from 100 to 111.
– specifying only the value of salary:• compute hsalary(s), which could be, say 10.
• Then, locate the relevant buckets, which are 010 and 110.
• Bad for: • range • nearest neighbor queries
Grid files vs. partitioned hashing
• If many dimensions many empty cells in grid. While partitioned hashing is OK.
• Both support exact and partial match queries.
• Grid files good for range and NN queries, while partitioned hashing is not at all.
Multiple-key indexes• Index on one attribute provides
pointer to an index on the other.
• Let V be a value of the first attribute.
• Then the index we reach by following the pointer for V is an index into the set of points that
have V for their first value in the first attribute and
any value for the second attribute.
• “Who buys gold jewelry” (age and salary only). Raw data in age salary pairs:
(25; 60) (45; 60) (50; 75) (50; 100)
(50; 120) (70; 110) (85; 140) (30; 260)
(25; 400) (45; 350) (50; 275) (60; 260)
• Question: For what kinds of queries will a multiple key index (age first) significantly reduce the number of disk I/O's?
Example
The indexes can be organized as B-Trees.
Partial match queries• If the first attribute is specified, then
the access is quite efficient• If the first attribute isn’t specified,
then we have to search every sub-index.
Range queries• Quite well, provided the individual
indexes themselves support range queries on their attribute (e.g. they are B-Trees)- Example. Range query is
35age55 AND 100sal200
NN queries• Similar to range queries.
Operations
Also, the indexes should be “primary” ones if we
want to support efficiently range queries.
KD-Trees• Levels rotate among
the dimensions, partitioning the points by comparison with a value for that dimension.
• Leaves are blocks holding the data records.
Geometrically…• Remember we didn’t
want the stripes in grid files to continue all along the vertical or horizontal direction.
• Here they don’t.
OperationsLookup in KD Trees • Find appropriate leaf by binary search. Is the record there?
Insert Into KD Trees • Lookup record to be inserted, reaching the appropriate leaf. • If there is room, put record in that block. • If not, find a suitable value for the appropriate dimension and
split the leaf block using the appropriate dimension.
Example • Someone 35 years old with a salary of $500K buys gold
jewelry. • Belongs in leaf with (25; 400) and (45; 350). • Too full: split on age. See figure next.
It’s “age” turn to be used for split. Split at 35; it’s the median.
Someone 35 years old with a salary of $500K buys gold jewelry.
QueriesPartial match queries• When we don’t know
the value of the attribute at the node, we must explore both of its children.- E.g. find points with
age=50
Range Queries• Sometimes a range will
allow us to move to only one child of a node.
• But if the range straddles the splitting value then we must explore both children.
KD-trees in secondary storage• If internal nodes
don’t fit in main memory group them into blocks.
Quad trees• Nodes split at all
dimensions at once• For a quad tree of k
dimensions, each interior node has 2k children.
j k f g l d a b
c ei h
Age
400
1000
h
b
i
a
cd e
g f
kj
Sal
l
Age 25, Sal 300
Age 50, Sal 200
Age 75, Sal 100
Why quad trees?• k-dimensions node has 2k children, e.g. k=7 128
children. • If 128, or 27, pointers can fit in a block, then k=7 is a
convenient number of dimensions.
Quad Tree Insert and QueriesInsert• Find leaf node in which new point belongs. • If room, put it there. • If not, make the leaf an interior node and give it leaves for each quadrant.
Split the points among the new leaves. • Problem: may make lots of null pointers, especially in high dimensions.
Quad Tree Queries • Single point queries: easy; just go down the tree to proper leaf. • Range queries: varies by position of range.
- Example: a range like 45<age<55; 180<salary<220 requires search of four leaves.
Nearest neighbor: Problems and strategies similar to grid files.