i/o-algorithms lars arge aarhus university march 5, 2008

43
I/O-Algorithms Lars Arge Aarhus University March 5, 2008

Post on 21-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

I/O-Algorithms

Lars Arge

Aarhus University

March 5, 2008

Lars Arge

I/O-algorithms

2

I/O-Model

• Parameters

N = # elements in problem instance

B = # elements that fits in disk block

M = # elements that fits in main memory

T = # output size in searching problem

• We often assume that M>B2

• I/O: Movement of block between memory and disk

D

P

M

Block I/O

Lars Arge

I/O-algorithms

3

Fundamental Bounds Internal External

• Scanning: N

• Sorting: N log N

• Permuting

• Searching: NBlog

BN

BN

BMlog

BN

log,minBN

BN

BMNN

N2log

Lars Arge

I/O-algorithms

4

Fundamental Data Structures

• B-trees: Node degree (B) queries in

– Rebalancing using split/fuse updates in

• Weight-balanced B-tress: Weight rather than degree constraint

Ω(w(v)) updates below v between rebalancing operations on v

• Persistent B-trees:

– Update in current version in

– Search in all previous versions in

• Buffer trees

– Batching of operations to obtain bounds

construction algorithms

)(log BT

B NO )(log NO B

)(log BT

B NO )(log NO B

)log( 1BN

BMBO

)log(BN

BN

BMO

Lars Arge

I/O-algorithms

5

Last time: Interval management• Maintain N intervals with unique endpoints dynamically such that

stabbing query with point x can be answered efficiently

• Static solution: Persistent B-tree

– Linear space and query

• Dynamic solution: External interval tree

– update

x

)(log BT

B NO

)(log NO B

Lars Arge

I/O-algorithms

6

• Base tree on endpoints – “slab” Xv associated with each node v

• Interval stored in highest node v where it contains midpoint of Xv

• Intervals Iv associated with v stored in

– Left slab list sorted by left endpoint (search tree)

– Right slab list sorted by right endpoint (search tree)

Linear space and O(log N) update

Internal Interval Tree

Lars Arge

I/O-algorithms

7

• Query with x on left side of midpoint of Xroot

– Search left slab list left-right until finding non-stabbed interval

– Recurse in left child

O(log N+T) query bound

x

Internal Interval Tree

Lars Arge

I/O-algorithms

8

Externalizing Interval Tree

• Natural idea:

– Block tree

– Use B-tree for slab lists

• Number of stabbed intervals in large slab list may be small (or zero)

– We can be forced to do I/O in each of O(log N) nodes

Lars Arge

I/O-algorithms

9

Externalizing Interval Tree

• Idea:

– Decrease fan-out to height remains

– slabs define multislabs

– Interval stored in two slab lists (as before) and one multislab list

– Intervals in small multislab lists collected in underflow structure

– Query answered in v by looking at 2 slab lists and not O(log N)

)( B )(log NO B

)( B )(B

)( B

multislab

Lars Arge

I/O-algorithms

10

External Interval Tree• Linear space, query, update

• General solution techniques:

– Filtering: Charge part of query cost to output

– Bootstrapping:

* Use O(B2) size structure in each internal node

* Constructed using persistence

* Dynamic using global rebuilding

– Weight-balanced B-tree: Split/fuse in amortized O(1)

)(log BT

B NO )(log NO B

Lars Arge

I/O-algorithms

11

Last time: Three-Sided Range Queries• Interval management: “1.5 dimensional” search

• More general 2d problem: Dynamic 3-sidede range searching

– Maintain set of points in plane such

that given query (q1, q2, q3), all points

(x,y) with q1 x q2 and y q3 can

be found efficiently

• Linear space, O(logB N+T/B) query static solution using persistence

– Dynamic: External priority search tree

(x,x)

(x1,x2)

x

x1 x2

q3

q2q1

Lars Arge

I/O-algorithms

12

• Base tree on x-coordinates with nodes augmented with points

• Heap on y-coordinates

– Decreasing y values on root-leaf path

– (x,y) on path from root to leaf holding x

– If v holds point then parent(v) holds point

Linear space and O(log N) update

Internal Priority Search Tree9

16.20

1619,9

1313,3

1920,3

45,6

59,4

11,2

201916139544,1

1

Lars Arge

I/O-algorithms

13

Internal Priority Search Tree

• Query with (q1, q2, q3) starting at root v:

– Report point in v if satisfying query

– Visit both children of v if point reported

– Always visit child(s) of v on path(s) to q1 and q2

O(log N+T) query

916.20

1619,9

1313,3

1920,3

45,6

59,4

11,2

201916139544,1

1

4

194

Lars Arge

I/O-algorithms

14

• Natural idea: Block tree

• Problem:

– I/Os to follow paths to to q1 and q2

– But O(T) I/Os may be used to visit other nodes (“overshooting”)

query

Externalizing Priority Search Tree9

16.20

1619,9

1313,3

1920,3

45,6

59,4

11,2

201916139544,1

1

)(log NO B

)(log TNO B

Lars Arge

I/O-algorithms

15

Externalizing Priority Search Tree

• Solution idea:

– Store B points in each node * O(B2) points stored in each supernode

* B output points can pay for “overshooting”

– Bootstrapping:

* Store O(B2) points in each supernode in static structure

916.20

1619,9

1313,3

1920,3

45,6

59,4

11,2

201916139544,1

1

Lars Arge

I/O-algorithms

16

Summary/Conclusion: Priority Search Tree• We have now discussed structures for special cases of two-

dimensional range searching

– Space: O(N/B)

– Query:

– Updates:

• Cannot be obtained for general (4-sided) 2d range searching:

– query requires space

– space requires query

q3

q2q1q

q

q3

q2q1

q4

)(loglog

logN

NBN

BB

BΩ)(log NO cB

)(BNO )( B

N

)(log NO B

)(log BT

B NO

Lars Arge

I/O-algorithms

17

• Base tree: Weight balanced tree with branching parameter

and leaf parameter B on x-coordinates

height

• Points below each node stored in 4 linear space secondary structures:

– “Right” priority search tree

– “Left” priority search tree

– B-tree on y-coordinates

– Interval (priority search) tree

space

External Range Tree

)()(logloglog

loglog N

NN

BB

B

BONO

)(log NΘ B

)(loglog

logN

NBN

BB

BO

)(log NB

Lars Arge

I/O-algorithms

18

• Secondary interval tree:

– Connect points in each slab in y-order

– Project obtained segments in y-axis

– Intervals stored in interval tree

* Interval augmented with pointer to corresponding points in y-coordinate B-tree in corresponding child node

External Range Tree

)(log NB

Lars Arge

I/O-algorithms

19

• Query with (q1, q2, q3 , q4) answered in top node with q1 and q2 in different slabs v1 and v2

• Points in slab v1

– Found with 3-sided query in v1

using right priority search tree

• Points in slab v2

– Found with 3-sided query in v2

using left priority search tree

• Points in slabs between v1 and v2

– Answer stabbing query with q3 using interval tree

first point above q3 in each of the slabs

– Find points using y-coordinate B-tree in slabs

External Range Tree

)(log NO B

)(log NO B

)(log NB

v1 v2

Lars Arge

I/O-algorithms

20

External Range Tree• Query analysis:

– I/Os to find relevant node

– I/Os to answer two 3-sided queries

– I/Os to query interval tree

– I/Os to traverse B-trees

I/Os

)(log NO B

)(log BT

B NO )(log)(log log NONO BB

NB

B )(log B

TB NO )(log NO B

)(log BT

B NO )(log NB

v1 v2

Lars Arge

I/O-algorithms

21

External Range Tree• Insert:

– Insert x-coordinate in weight-balanced B-tree

* Split of v can be performed in I/Os

I/Os

– Update secondary structures in all nodes on one root-leaf path

* Update priority search trees

* Update interval tree

* Update B-tree

I/Os

• Delete:

– Similar and using global rebuilding

)(loglog

logN

N

BB

BO)(

logloglog2

NN

BB

BO

))(log)(( vwvwO B

)(loglog

log2

NN

BB

BO

)(log NB

v1 v2

Lars Arge

I/O-algorithms

22

Summary: External Range Tree• 2d range searching in space

– I/O query

– I/O update

• Optimal among query structures

)(loglog

log2

NN

BB

BO

)(log BT

B NO )(

logloglog

NN

BN

BB

BO

)(log BT

B NO

q3

q2q1

q4

Lars Arge

I/O-algorithms

23

kdB-tree

• kd-tree:

– Recursive subdivision of point-set into two half using vertical/horizontal line

– Horizontal line on even levels, vertical on uneven levels

– One point in each leaf

Linear space and logarithmic height

Lars Arge

I/O-algorithms

24

kd-Tree: Query

• Query

– Recursively visit nodes corresponding to regions intersecting query

– Report point in trees/nodes completely contained in query

• Query analysis

– Horizontal line intersect Q(N) = 2+2Q(N/4) = regions

– Query covers T regions I/Os worst-case

)( NO

)( TNO

Lars Arge

I/O-algorithms

25

kdB-tree

• kdB-tree:

– Stop subdivision when leaf contains between B/2 and B points

– BFS-blocking of internal nodes

• Query as before

– Analysis as before but each region now contains Θ(B) points

I/O query)( B

TB

NO

Lars Arge

I/O-algorithms

26

Construction of kdB-tree

• Simple algorithm

– Find median of y-coordinates (construct root)

– Distribute point based on median

– Recursively build subtrees

– Construct BFS-blocking top-down

• Idea in improved algorithm

– Construct levels at a time using O(N/B) I/Os

)log( 2 BN

BNO

)log(BN

BN

BMO

BMlog

Lars Arge

I/O-algorithms

27

Construction of kdB-tree• Sort N points by x- and by y-coordinates using I/Os

• Building levels ( nodes) in O(N/B) I/Os:

1. Construct by grid

with points in each slab

2. Count number of points in each

grid cell and store in memory

3. Find slab s with median x-coordinate

4. Scan slab s to find median x-coordinate and construct node

5. Split slab containing median x-coordinate and update counts

6. Recurse on each side of median x-coordinate using grid (step 3) Grid grows to during algorithm Each node constructed in I/Os

BMlog

)log(BN

BN

BMO

BM

BM

BM

N

BM

)( BM

BM

BM

BM

))/(( BNO BM

Lars Arge

I/O-algorithms

28

kdB-tree

• kdB-tree:

– Linear space

– Query in I/Os

– Construction in I/Os

– Point search in I/Os

• Dynamic?

– Deletions relatively easily in I/Os (partial rebuilding)

)( BT

BNO

)(log NO B

)log( / BN

BMBNO

)(log2 NO B

Lars Arge

I/O-algorithms

29

kdB-tree Insertion using Logarithmic Method

• Partition pointset S into subsets S0, S1, … Slog N, |Si| = 2i or |Si| = 0

• Build kdB-tree Di on Si

• Query: Query each Di

• Insert: Find first empty Di and construct Di out of

elements in S0,S1, … Si-1

– I/Os per moved point

– Point moved O(log N) times

I/Os amortized

..................................

0 2222 1 2 log N

iij

j 221 10

)()(log

02

BT

BN

BTN

i B OO ii

)log( 2/

2BBMB

ii

O )log( /1

BN

BMBO

)(log)loglog( 2/

1 NONO BBN

BMB

Lars Arge

I/O-algorithms

30

kdB-tree Insertion and Deletion• Insert: Use logarithmic method ignoring deletes

• Delete: Simply delete point p from relevant Di

– i can be calculated based on # insertions since p was inserted

– # insertions calculated by storing insertion number of each point in separate B-tree

extra update cost

• To maintain O(log N) structures Di

– Perform global rebuild after every Θ(N) updates

extra update cost

..................................

0 2222 1 2 log N

)(log NO B

)(log)log( /1 NOO BB

NBMB

Lars Arge

I/O-algorithms

31

Summary: kdB-tree

• 2d range searching in O(N/B) space

– Query in I/Os

– Construction in I/Os

– Updates in I/Os

• Optimal query among linear space structures

)( BT

BNO

)log( / BN

BMBNO

)(log2 NO B

q3

q2q1

q4

Lars Arge

I/O-algorithms

32

O-Tree Structure• O-tree:

– B-tree on vertical slabs

– B-tree on horizontal slabs in each vertical slab

– kdB-tree on points in each leaf

NBBN log

)log( NBBN

)log()( 2)log( 2 NB BN

NBB

N

NBBN 2log

)log( NBBN

NB B2log

Lars Arge

I/O-algorithms

33

O-Tree Query• Perform rangesearch with q1 and q2 in vertical B-tree

– Query all kdB-trees in leaves of two horizontal B-trees with x-interval intersected but not spanned by query

– Perform rangesearch with q3 and q4 horizontal B-trees with x-interval spanned by query

* Query all kdB-trees with range intersected by query

NBBN log

NBBN 2log

NB B2log

Lars Arge

I/O-algorithms

34

O-Tree Query Analysis• Vertical B-tree query:

• Query of all kdB-trees in leaves of two horizontal B-trees:

• Query horizontal B-trees:

• Query kdB-trees not completely in query

• Query in kdB-trees completely

contained in query:

I/Os

)())log((log BN

BBN

B ONO

)()()log()log( 2BT

BN

BT

BBBN OOBNBONO

)log( NO BBN

)())log((log)log( BN

BBN

BBBN ONONO

)()()log()log(2 2BT

BN

BT

BBBN OOBNBONO

)(BTO

)(BT

BNO

)log(2 NO BBN

Lars Arge

I/O-algorithms

35

O-Tree Update• Insert:

– Search in vertical B-tree: I/Os

– Search in horizontal B-tree: I/Os

– Insert in kdB-tree: I/Os

• Use global rebuilding when structures grow too big/small

– B-trees not contain elements

– kdB-trees not contain elements

I/Os

• Deletes can be handled

in I/Os similarly

)log( NBBN

)log( 2 NB B

)(log NO B

)(log NO B

)(log))log((log 22 NONBO BBB

)(log NO B

)(log NO B

Lars Arge

I/O-algorithms

36

Summary: O-Tree• 2d range searching in linear space

– I/O query

– I/O update

• Optimal among structures

using linear space

• Can be extended to work in d-dimensions

with optimal query bound

q3

q2q1

q4

)(log NO B

)(BT

BNO

))((11

BT

BN dO

Lars Arge

I/O-algorithms

37

Summary/Conclusion: 3 and 4-sided Queries• 3-sided 2d range searching: External priority search tree

– query, space, update

• General (4-sided) 2d range searching:

– External range tree: query, space,

update

– O-tree: query, space, update

q3

q2q1

q3

q2q1

q4

)(loglog

logN

NBN

BB

BO)(log BT

B NO

)(BNO)( B

TB

NO

)(log NO B)(log BT

B NO

)(log NO B

)(loglog

log2

NN

BB

BO

)(BNO

Lars Arge

I/O-algorithms

38

Summary/Conclusion: Tools and Techniques• Tools:

– B-trees

– Persistent B-trees

– Buffer trees

– Logarithmic method

– Weight-balanced B-trees

– Global rebuilding

• Techniques:

– Bootstrapping

– Filtering

q3

q2q1

q3

q2q1

q4

(x,x)

Lars Arge

I/O-algorithms

39

Other results• Many other results for e.g.

– Higher dimensional range searching

– Range counting, range/stabbing max, and stabbing queries

– Halfspace (and other special cases) of range searching

– Queries on moving objects

– Proximity queries (closest pair, nearest neighbor, point location)

– Structures for objects other than points (bounding rectangles)

• Many heuristic structures in database community

Lars Arge

I/O-algorithms

40

Point Enclosure Queries• Dual of planar range searching problem

– Report all rectangles containing query point (x,y)

• Internal memory:

– Can be solved in O(N) space and O(log N + T) time

x

y

Lars Arge

I/O-algorithms

41

Point Enclosure Queries• Similarity between internal and external results (space, query)

– in general tradeoff between space and query I/O

Internal External

1d range search (N, log N + T) (N/B, logB N + T/B)

3-sided 2d range search (N, log N + T) (N/B, logB N + T/B)

2d range search

2d point enclosure (N, log N + T)

TNN NN log,loglog

log BT

BNN

BN N

BB

B log,loglog

log

TNN , BT

BN

BN ,

(N/B, log N + T/B)?2B

(N/B, log N+T/B)

(N/B1-ε, logB N+T/B)

Lars Arge

I/O-algorithms

42

Next Time: Rectangle Range Searching• Report all rectangles intersecting query rectangle Q

Q

Lars Arge

I/O-algorithms

43

References

• External Memory Geometric Data Structures

Lecture notes by Lars Arge.

– Section 8-9