data structures & algorithms binary search trees

Data Structures & Algorithms

Binary Search Trees

Long studied problemIntrinsic problem in much of CSBusiness applicationsInformation retrievalCachingMatching problems

SearchSearch

Goal: given a key, find an item(s) that is associated with that key

If search is successful, called a hit.If not, called a miss.

May also extend to approximate search, range queries, etc.

SearchSearch

Defn. 12.1: A symbol table is a data structure of items with keys that supports two basic operations:

• insert an new item, and • return an item with a given key.

Symbol table is otherwise known as:• Dictionary• Associative memory (contrast with

retrieving the item stored at a particular address)

SearchSearch

Examples of symbol tables:• Telephone book• Encyclopedia• Index (vs. table of contents)• Compiler tables of identifiers

Huge advantage of computer-based symbol tables:

• Easy to be dynamic

SearchSearch

Symbol table ADT interface: Insert <new item>Search <key>Remove <item>Select <int>SortJoin <Symbol table>

May also want construct, destroy,empty test, copy, search&insert

SearchSearch

Symbol table ADTtemplate <class Item, class Key>class ST { Private: // impl dependent stuff Public: ST(int); int count(); Item search(Key); void insert(Item); void remove(Item); Item select(int); void show(ostream&);}

SearchSearch

Obligations imposed by symbol table ADT on Key:

key()null()operator==operator<Mods to rand, scan, show

Also – missing join()Interpretations – e.g., remove()

SearchSearch

Key values distinct, small numbersCan use an array implementationSimply put item with key k in st[k]To remove, replace with null itemTime = ?

ConstantSelect, sort, count all linear

Skip over null itemsClient has to deal with duplicate keys

Key-indexed SearchKey-indexed Search

If there are no items, just keysUse table of bits

Existence tableBit map

Applications:Allocation tables for memoryBlum filters


Prop 12.1: If key values are positive integers less than M and items have distinct keys, then the symbol-table data type can be implemented with key-indexed arrays of items such that insert, search, and remove require constant time, and initialize, select, and sort require time proportional to M, whenever any of the operations are performed on an N-item table.


If keys are over too large a range to used key-indexed approach, can still use a simple approach:Store contiguously in an array, either

• In order• Unordered

Trade-offs?• Insertion time• Search time• Removal time• Sort time

Sequential SearchSequential Search

If keys are over too large a range to used key-indexed approach, can still use a simple approach:Could also store in linked list, either

In orderUnordered

Trade-offs?Aside from those depending on

ordering, those from arrays vs. linked lists: insert/remove/join times vs. size


Prop 12.2: Sequential search in a symbol table with N items uses about N/2 comparisons for search hits (on average).

True for any of the four forms (array or linked list, ordered or unordered).


Prop 12.2: Sequential search in a symbol table with N unordered items uses a constant number of steps for inserts and N comparisons for search misses (always).

True for array or linked list.


Prop 12.3: Sequential search in a symbol table with N ordered items uses about N/2 comparisons for insertions, search hits, and search misses (on average).

True for array or linked list.


N = number of items, M = size of container

Costs of Insertion and SearchCosts of Insertion and SearchWorst Case

Worst Case

Worst Case

Avg Case

Avg Case

Avg Case

Insert Search Select Insert Search Hit

Search Miss

Key-Indexed Array

1 1 M 1 1 1

Ordered Array

N N 1 N/2 N/2 N/2

Ordered Linked List

N N N N/2 N/2 N/2

Unordered Array

1 N N lg N 1 N/2 N

Unordered Linked List

1 N N lg N 1 N/2 N

N = number of items, M = size of container

Costs of Insertion and SearchCosts of Insertion and SearchWorst Case

Worst Case

Worst Case

Avg Case

Avg Case

Avg Case

Insert Search Select Insert Search Hit

Search Miss

Binary Search N lg N 1 N/2 lg N lg N

Binary Search Tree

N N N lg N lg N lg N

Red-Black Tree

lg N lg N lg N lg N lg N lg N

Randomized Tree

N* N* N* lg N lg N lg N

Hashing 1 N* N lg N 1 1 1

Binary search – compare to middle item in sorted search table, then search appropriate half.

Binary SearchBinary Search

a c f j k m n p q r s t v x y





k?

hit

Prop 12.4: Binary search in a symbol table with N ordered items never uses more than [lg N] + 1 comparisons for a search (hit or miss).

Note that this is the same as the number of bits in the representation of N.Why?


Keeping table sorted•Expensive using insertion sort•May not matter if table is mostly static, •or if the number of searches is huge (relative to insertions).Improvements:•Interpolation search (try to guess where key will fall in ST)•assumes key values are evenly distributed


Use an explicit tree structure

Defn 12.2: A binary search tree (BST) is a binary tree that has a key associated with each internal node, and the key in a node is >= the keys in its left subtree and <= the keys in its right subtree

Binary Search TreesBinary Search Trees

Defn 12.2: A binary search tree (BST) is a binary tree that has a key associated with each internal node, and the key in a node is >= the keys in its left subtree and <= the keys in its right subtree


k

<=k >=k

BST search algorithm sketch

To search a BST T with root h for key k:If h is null (T is empty) – return miss If h.key == k – return hitIf k < h.key search left subtree h.leftelse search right subtree h.right


BST insert algorithm

To insert a new item X into a BST:Essentially search, then attach new node (and just as efficient as search): If tree empty set X as root and returnIf X.key < root.key insert X into left subtreeElse insert X into right subtree


BST sort algorithm

To sort a BST (i.e., output items stored in BST in order of keys):Inorder traversal of BST

If tree empty returnShow left subtreeShow rootShow right subtree


Prop. 12.6: Search hits require about 2 ln N ≈ 1.39 lg N comparisons on average, in a BST built from random keys. (Treating successive < and = operations as one comparison)

Prf: Search hit follows a path from root to the node with the key; average is internal path length over all nodes, + 1.


Prop. 12.7: Insertions and search misses require about 2 ln N ≈ 1.39 lg N comparisons on average, in a BST built from random keys. (Treating successive < and = operations as one comparison)

Prf: Search hit follows a path from root to an external node; average is external path length over all nodes.


Prop. 12.8: In the worst case, a search in a BST with N keys can take N comparisons.

Prf: The tree may only have one non-empty subtree per internal node.


May not want to move actual items around, and may want flexibility in item contents.Here, may use items stored in an array (or parallel arrays), with key extraction, then use parallel arrays for link (i.e., have left[i] and right[i] give indexes for left and right child, resp.)Appl: large or overlapping items (e.g., text strings in corpus)

Index ImplementationsIndex Implementations

Usually, insertion into BST at bottomReplace external node

What if need to insert at root, so new nodes are near the top?

Examine this now, as it will come in handy soon…..

BST InsertionBST Insertion

Wlog, assume new item has a key larger than that of the current root.

BST Insertion at RootBST Insertion at Root

H

D

B

C

F

E

L

M

J

I K

N

O

Tempting to make new node root, make old root left child, and old right child the right child of new root:


H

D

B

C

F

E

L

M

J

I K

N

O

But if there are items in right subtree with smaller keys, then this violates the BST requirements!


H

D

B

C

F

E

L

M

J

I K

N

O

Must somehow rearrange tree to re-establish BST property… moving all nodes with smaller key to left subtree looks costly!


H

D

B

C

F

E

L

M

J

I K

N

O

Solution: ROTATION


H

D

B

C

F

E

L

M

J

I K

N

O

>=N

Solution: ROTATION (here in BST)


L

N

R

A

<=L >=L &<=N

K M

>=N

Right Rotation in BST: make left child new root, make old root right child


L

N

R

A

<=L >=L &<=N

K M

>=N

Right Rotation in BST: patch up links from A, to “middle” subtree, betw/L & N


L

N

R

A

<=L >=L &<=N

K M

>=N

Result is still BST, and requires a constant number of local changes.


L

N

R

A

<=L >=L &<=N

K M

>=N

Left rotation is mirror image of right rotation (reverse the Rrot steps!)


L

N

R

A

<=L >=L &<=N

K M

>=N

Right rotation is mirror image of left rotation (reverse the Lrot steps!)


L

N

R

A

<=L >=L &<=N

K M

Now we can insert at root by inserting at leaf, …


H

D

B

C

F

E

M

J

I K

N

O

L

Now we can insert at root by inserting at leaf, then rotating up


H

D

B

C

F

E

L

M

J

I K

N

O

Now we can insert at root by inserting at leaf, then rotating up, and up


H

D

B

C

F

E L

M

J

I

K

N

O

Now we can insert at root by inserting at leaf, then rotating up, and up, and up


H

D

B

C

F

E

L

M

J

I K

N

O

Now we can insert at root by inserting at leaf, then rotating up, and up, and up, and up


H

D

B

C

F

E

L

MJ

I KN

O

Q: How much does it cost?A: path lengthQ: Why do it?


H

D

B

C

F

E

L

M

J

I KN

O

Q: Why do it?A: Often do searches for more recent items – faster if near the root!


H

D

B

C

F

E

L

M

J

I KN

O

Select k – like quickSort version: Keep count field with size of subtreeRecursively select in subtrees, adjusting k if right subtree

Other BST FunctionsOther BST Functions

H9

D5

B2

C1

F2

E1

L13

M3

J3

I1

K1

N2

O1

Partition: put kth smallest node at rootSelect, then rotate up

Remove: delete node and patch up treeSearch (which subtree?)Replace subtree with one minus nodeIf root of subtree, delete root and

combine the two subsubtreesJoin: many possible ways, none all that attractive in general


Remove: a lot of work, disrupts structureNo longer random, even if data wereHeight tends to sqrt(N) instead of lg N

May just mark deleted nodesSkip them on searchUse space to insert new itemsPeriodically remove/restructure to avoid

using up too much space/timeThis lazy approch applies to any structure, not just BSTs or symbol tables!


Balanced trees perform welllg N timeNear-balanced is good enough

How to provide performance guarantees?RandomizeAmortizeOptimize

Balancing BSTsBalancing BSTs

RandomizeIntroduce random decision makingReduces chance of worst-case … a lot!

ExamplesQuickSort – random pivotRandomized BSTsSkip Lists

PerformancePerformance

AmortizeDo extra (expensive) work from time to timeTo avoid much more work laterWith “cost” spread over all operationsA particular operation may take more timeBut time for op sequence is always good

ExamplesArray resizingSplay BSTs


OptimizeDo work every timeTo guarantee performance every timeUsually cumbersome to implement

ExamplesAVL treesTop-down 2-3-4 treesRed-Black Trees


Recall analysis of average case for BSTsAssumed items inserted in random orderThus each node equally likely to be rootBut Hey!!Don’t have to rely on random insertions!Can get this by randomly choosing to make new node the root with prob 1/(N+1)

Randomized BSTsRandomized BSTs

Prop 13.1: Building a randomized BST is equivalent to building a standard BST from a random initial permutation of the keys. We use about 2 N ln N comparisons to construct a rBST with N items, and about 2 ln N comparisons for search.

Note distinction in average-case performance in rBSTs and BSTs


Note distinction in average-case performance in rBSTs and BSTs

Average case is about same (constant is a little higher in rBSTs)

But assumption in BSTs that items are inserted in random order

Which is often NOT the caseWith rBSTs, this does not matter!


data structures & algorithms binary search trees

Documents