15-211 fundamental data structures and algorithms margaret reid-miller 1 march 2005 more lzw /...
TRANSCRIPT
15-211Fundamental Data Structures and Algorithms
Margaret Reid-Miller
1 March 2005
More LZW / Midterm Review
2
Midterm
Thursday, 12:00 noon, 3 March 2005
WeH 7500
Worth a total of 125 points
Closed book, but you may have one page of notes.
If you have a question, raise your hand and stay in your seat
Last Time…
4
Last Time:Lempel & Ziv
5
Reminder: Compressing
where each prefix is in the dictionary.
We stop when we fall out of the dictionary:
A b
We scan a sequence of symbols
A = a1 a2 a3 …. ak
6
Reminder: Compressing
Then send the code for
A = a1 a2 a3 …. ak
This is the classical algorithm.
7
…s…sssb…
LZW: Compress bad case
Input:^
Dictionary:
Output: ….
s
- word(possibly empty)
8
…s…sssb…
LZW: Compress bad case (time t)
Input:^
Dictionary:
Output: ….
s
…
9
.…
LZW: Uncompress bad case (time t)
Input:^
Dictionary:
Output: ……
s
10
…s…sssb…
LZW: Compress bad case (step t+1)
Input:^
Dictionary:
Output: …….
s
s
11
.…
LZW: Uncompress bad case (time t+1)
Input:^
Dictionary:
Output: ……s
s
12
…s…sssb…
LZW: Compress bad case (time t+2)
Input:^
Dictionary:
Output: …….
s
s
+1b
13
….
LZW: Uncompress bad case (time t+2)
Input:
Dictionary:
Output: ……s
s
What is ??^
14
.…
LZW: Uncompress bad case (time t+2)
Input:
Dictionary:
Output: ……sss
s
What is ??
It codes for ss!
s
^
15
Example
0 0 1 5 3 6 7 9 5 aabbbaabbaaabaababb s s s s
Input Output add to D
0 a
0 + a 3:aa
1 + b 4:ab
5 - bb 5:bb
3 + aa 6:bba
6 + bba 7:aab
7 + aab 8:bbaa
9 - aaba 9:aaba
5 + bb 10:aabab
s = a = ab
16
LZW Correctness
So we know that when this case occurs, decompression works.
Is this the only bad case? How do we know that decompression always works? (Note that compression is not an issue here).
Formally have two maps
comp : texts int seq.
decomp : int seq. texts
We need for all texts T:
decomp(comp(T)) = T
17
Getting Personal
Think about
Ann: compresses T, sends int sequence
Bob: decompresses int sequence,
tries to reconstruct T
Question: Can Bob always succeed?
Assuming of course the int sequence is valid
(the map decomp() is not total).
18
How?
How do we prove that Bob can always succeed?
Think of Ann and Bob working in parallel.
Time 0: both initialize their dictionaries.
Time t: Ann determines next code number c,
sends it to Bob.
Bob must be able to convert c back into the corresponding word.
19
Induction
We can use induction on t.
The problem is:
What property should we establish by induction?
It has to be a claim about Bob’s dictionary.
How do the two dictionaries compare over time?
20
The Claim
At time t = 0 both Ann and Bob have the same dictionary.
But at any time t > 0 we have
Claim: Bob’s dictionary misses exactly the last entry in Ann’s dictionary after processing the last code Ann sends.
(Ann can add Wx to the dictionary, but Bob won’t know x until the next message he receives.)
21
The Easy Case
Suppose at time t Ann enters A b with code number C and sends c = code(A).
Easy case: c < C-1
By Inductive Hypothesis Bob has codes upto and including C-2 in his dictionary. That is, c is already in Bob’s dictionary. So Bob can decode and now knows A.
But then Bob can update his dictionary: all he needs is the first letter of A.
22
The Easy Case
Suppose at time t Ann enters A b with code number C and sends c = code(A).
Easy case: c < C-1
… A b …
c
CC-1Entered:
Sent:
23
The Hard Case
Now suppose c = C-1.
Recall, at time t Ann had entered A b with code number C and sent c = code(A).
… A b …
c
CC-1Entered:
Sent:
24
The Hard Case
Now suppose c = C-1.
Recall, at time t Ann had entered A b with code number C and sent c = code(A).
… A’ s’ … b …
c
C
cEntered:
Sent:
A = A’ s’
a1 = s’
25
The Hard Case
Now suppose c = C-1.
Recall, at time t Ann had entered A b with code number C and sent c = code(A).
… s’ W s’ … b…
c
C
cEntered:
Sent:
A’ = s’ W
26
The Hard Case
Now suppose c = C-1.
Recall, at time t Ann had entered A b with code number C and sent c = code(A).
… s’ W s’ W s’ b …
c
C
cEntered:
Sent:
27
The Hard Case
Now suppose c = C-1.
Recall, at time t Ann had entered A b with code number C and sent c = code(A).
So we have
Time t-1: entered c = code(A),
sent code(A’), where A = A’ s’
Time t: entered C = code(A b),
sent c = code(A), where a1 = s’
But then A’ = s’ W.
28
The Hard Case
In other words, the text must looked like so
…. s’ W s’ W s’ b ….
But Bob already knows A’ and thus can reconstruct A.
QED
Midterm Review
30
Basic Data Structures
ListPersistance
TreeHeight of tree, Depth of node, LevelPerfect, Complete, Full Min & Max number of nodes
31
Recurrence Relations
E.g., T(n) = T(n-1) + n/2 Solve by repeated substitution Solve resulting series Prove by guessing and substitution Master Theorem
T(N) = aT(N/b) + f(N)
32
Solving recurrence equations
Repeated substitution:t(n) = n + t(n-1) = n + (n-1) + t(n-2) = n + (n-1) + (n-2) + t(n-3)and so on… = n + (n-1) + (n-2) + (n-3) + … + 1
33
Incrementing series
This is an arithmetic series that comes up over and over again, because characterizes many nested loops:
for (i=1; i<n; i++) { for (j=1; j<i; j++) { f(); }}
34
“Big-Oh” notation
N
cf(N)
T(N)
n0
runn
ing t
ime
T(N) = O(f(N))“T(N) is order f(N)”
35
Upper And Lower Bounds
f(n) = O( g(n) ) Big-Ohf(n) ≤ c g(n) for some constant c and n > n0
f(n) = ( g(n) ) Big-Omegaf(n) ≥ c g(n) for some constant c and n > n0
f(n) = ( g(n) ) Thetaf(n) = O( g(n) ) and f(n) = ( g(n) )
36
Upper And Lower Bounds
f(n) = O( g(n) ) Big-OhCan only be used for upper bounds.
f(n) = ( g(n) ) Big-OmegaCan only be used for lower bounds
f(n) = ( g(n) ) ThetaPins down the running time exactly (up to a multiplicative constant).
37
Big-O characteristic
Low-order terms “don’t matter”:Suppose T(N) = 20n3 + 10nlog n + 5Then T(N) = O(n3)
Question:What constants c and n0 can be used to show
that the above is true?
Answer: c=35, n0=1
38
Big-O characteristic
The bigger task always dominates eventually. If T1(N) = O(f(N)) and T2(N) = O(g(N)).Then T1(N) + T2(N) = max( O(f(N)), O(g(N) ).
Also:T1(N) T2(N) = O( f(N) g(N) ).
39
Dictionary
Operations: Insert Delete Find
Implementations: Binary Search Tree AVL Tree Splay Trie Hash
40
Binary search trees
Simple binary search trees can have bad behavior for some insertion sequences. Average case O(log N), worst case O(N).
AVL trees maintain a balance invariant to prevent this bad behavior. Accomplished via rotations during insert.
Splay trees achieve amortized running time of O(log N). Accomplished via rotations during find.
41
AVL trees
Definition Min number of nodes of height H
FH+3 -1, where Fn is nth Fibonacci
number Insert - single & double rotations.
How many? Delete - lazy. How bad?
42
Single rotation
For the case of insertion into left subtree of left child:
Z
YX
ZYX
Deepest node of X has depth 2 greater than deepest node of Z.
Depth reduced by 1
Depth increased by 1
43
Double rotation
For the case of insertion into the right subtree of the left child.
Z
X
Y1 Y2
ZX Y1 Y2
44
Splay trees
Splay trees provide a guarantee that any sequence of M operations (starting from an empty tree) will require O(Mlog N) time.
Hence, each operation has amortized cost of O(log N).
It is possible that a single operation requires O(N) time.
But there are no bad sequences of operations on a splay tree.
45
Splaying, case 3
Case 3: Zig-zag (left).Perform an AVL double rotation.
a
Zb
X
Y1 Y2
a
Z
b
X Y1 Y2
46
Splaying, case 4
Case 4: Zig-zig (left).Special rotation.
a
Zb
Y
W X
a
Z
b
Y
W
X
47
Tries
Good for unequal length keys or sequences
Find O(m), m sequence length
But: Few to many children
4 5 9
4 6 6
5 8 8
3 3
I
like loveyou
5
9lovely
…
…
48
Hash Tables
Hash function h: h(key) = index Desirable properties:
Approximate random distributionEasy to calculateE.g., Division: h(k) = k mod m
Perfect hashingWhen know all input keys in advance
49
Collisions
Separate chainingLinked list: ordered vs unordered
Open addressingLinear probing - clustering very bad with
high load factor*Quadratic probing - secondary
clustering, table size must be primeDouble hashing - table size must be
prime, too complex
50
Hash Tables
Delete? Rehash when load factor high -
double (amortize cost constant) Find & insert are near constant
time! But: no min, max, next,… operation Trade space for time--load factors
<75%
Priority Queues
52
Priority Queues
Operations: Insert FindMin DeleteMin
Implementations:Linked listSearch treeHeap
53
Linked list deleteMin O(1) O(N) insert O(N) O(1)
Search treesAll operations O(log N)
HeapsdeleteMin O(log N)
insert O(log N)
buildheap O(N) N inserts
or
Possible priority queue implementations
54
Heaps
Properties: 1. Complete binary tree in an array2. Heap order property
Insert: push up DeleteMin: push down Heapify: starting at bottom, push
down Heapsort: BuildHeap + DeleteMin
55
Insert - Push up
Insert leaf to establish complete tree property. Bubble inserted leaf up the tree until the heap order
property is satisfied.
13
2665
24
32
31 6819
16
14
21
13
2665
24
32
21 6819
16
31
14
56
DeleteMin - Push down
Move last leaf to root to restore complete tree property. Bubble the transplanted leaf value down the tree until the
heap order property is satisfied.
14
31
2665
24
32
21 6819
16
14
--
2665
24
32
21 6819
16
31
1 2
57
Heapify - Push down
Start at bottom subtrees. Bubble subtree root down until the heap order
property is satisfied.
24
2365
26
21
31 1916
68
14
32
Sorting
59
Simple sorting algorithms
Several simple, quadratic algorithms (worst case and average).
- Bubble Sort- Insertion Sort- Selection Sort
Only Insertion Sort of practical interest: running time linear in number of inversion of input sequence.
Constants small. Stable?
60
Sorting Review
Asymptotically optimal O(n log n) algorithms (worst case and average).
- Merge Sort- Heap Sort
Merge Sort purely sequential and stable.
But requires extra memory: 2n + O(log n).
61
Quick Sort
Overall fastest. In place.
BUT:
Worst case quadratic. Average case O(n log n).
Not stable.
Implementation details tricky.
62
Radix Sort
Used by old computer-card-sorting machines.
Linear time:• b passes on b-bit elements• b/m passes m bits per pass
Each pass must be stable
BUT:
Uses 2n+2m space.
May only beat Quick Sort for very large arrays.
Data Compression
64
Data Compression
Huffman Optimal prefix-free codesPriority queue on “tree” frequency
LZWDictionary of codes for previously seen
patternsWhen find pattern increase length by
one trie
65
Huffman Full: every node
Is a leaf, orHas exactly 2 children.
Build tree bottom up:Use priority queue of trees
weight - sum of frequencies.
New tree of two lowest weight trees.
c
a
b
d0
0
0
1
1
1
a=1, b=001, c=000, d=01
66
Summary of LZW
LZW is an adaptive, dictionary based compression method.
Incrementally builds the dictionary (trie) as it encodes the data.
Building the dictionary while decoding is slightly more complicated, but requires no special data structures.