running time of kruskal’s algorithm huffman codes monday, july 14th

104
Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Upload: aldous-anderson

Post on 05-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Running Time of Kruskal’s Algorithm

Huffman Codes

Monday, July 14th

Page 2: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Outline For Today

1. Runtime of Kruskal’s Algorithm (Union-Find Data

Structure)

2. Data Encodings & Finding An Optimal Prefix-free

Encoding

3. Prefix-free Encodings Binary Trees

4. Huffman Codes

Page 3: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Outline For Today

1. Runtime of Kruskal’s Algorithm (Union-Find Data

Structure)

2. Data Encodings & Finding An Optimal Prefix-free

Encoding

3. Prefix-free Encodings Binary Trees

4. Huffman Codes

Page 4: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 5: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 6: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 7: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 8: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 9: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Creates a cycle

Page 10: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Creates a cycle

Page 11: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 12: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 13: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Creates a cycle

Page 14: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Creates a cycle

Page 15: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Simulation

B C

1

4

2A D E

F

3

G

2.5

7.5

H

7

Final Tree!

Same as Tprim

Page 16: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: Kruskal’s Algorithm Pseudocode procedure kruskal(G(V, E)):

sort E in order of increasing weights rename E so w(e1) < w(e2) < … < w(em) T = {} // final tree edges for i = 1 to m: if T ∪ ei=(u,v) doesn’t create cycle add ei to T return T

Page 17: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Recap: For Correctness We Proved 2 Things1. Outputs a Spanning Tree Tkrsk

2. Tkrsk is a minimum spanning tree

Page 18: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

1: Kruskal Outputs a Spanning Tree (1)

Need to prove Tkrsk is spanning AND is acyclic

Acyclic is by definition of the algorithm.

Why is Tkrsk spanning (i.e., connected)?

Recall Empty Cut Lemma:

A graph is not connected iff ∃ cut (X, Y) with no

crossing edges

If all cuts have a crossing edge -> graph is

connected!

Page 19: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

2: Kruskal is Optimal (by Cut Property)Let (u, v) be any edge added by Kruskal’s Algorithm.

u and v are in different comp. (b/c Kruskal checks for

cycles)

ux

y

v

t

zw

Claim: (u, v) is min-edge crossing this cut!

Page 20: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s Runtime

procedure kruskal(G(V, E)): sort E in order of increasing weights

rename E so w(e1) < w(e2) < … < w(em) T = {} // final tree edges for i = 1 to m: if T ∪ ei=(u,v) doesn’t create cycle add ei to T return T

O(mlog(n))

m iterations

?Option 1: check if u v path exists! ⤳

Run a BFS/DFS from u or v => O(|T| + n) = O(n)

Can we speed up cycle checking?

***BFS/DFS Total Runtime: O(mn)***

Page 21: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Speeding Kruskal’s Algorithm

Goal: Check for cycles in log(n) time.

Observation: (u, v) creates a cycle iff u and v

are in the same connected component

Option 2: check if u’s component = v’s

component

More Specific Goal: check the component of

each vertex in log(n) time

Page 22: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Data Structure

Operation 1: Maintain the component

structure of T as we add new edges to it.

Operation 2: Query component of each

vertex v

Union

Find

Page 23: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

B C

1

46

2

5

A D E

F

3

G

2.5

7.5

H

8

7

9

Page 24: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

A

CB

E

B C

1

46

2

5

D

A D E

FF

3

G G

2.5

7.5

HH

8

7

9

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Find(A) = A

Find(D) = DUnion(A, D)

Page 25: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

A

CB

E

B C

1

46

2

5

A

A D E

FF

3

G G

2.5

7.5

HH

8

7

9

Find(D) = A

Find(E) = EUnion(A, E)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 26: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

A

CB

A

B C

1

46

2

5

A

A D E

FF

3

G G

2.5

7.5

HH

8

7

9

Find(C) = C

Find(F) = FUnion(C, F)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 27: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

A

CB

A

B C

1

46

2

5

A

A D E

FC

3

G G

2.5

7.5

HH

8

7

9

Find(E) = A

Find(F) = CUnion(A, C)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 28: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

A

AB

A

B C

1

46

2

5

A

A D E

FA

3

G G

2.5

7.5

HH

8

7

9

Find(A) = A

Find(B) = BUnion(A, B)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 29: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

A

AA

A

B C

1

46

2

5

A

A D E

FA

3

G G

2.5

7.5

HH

8

7

9

Find(D) = A

Find(C) = ASkip (D, C)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 30: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

A

AA

A

B C

1

46

2

5

A

A D E

FA

3

G G

2.5

7.5

HH

8

7

9

Find(A) = A

Find(C) = ASkip (A, C)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 31: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

A

AA

A

B C

1

46

2

5

A

A D E

FA

3

G G

2.5

7.5

HH

8

7

9

Find(C) = A

Find(H) = HUnion(A, H)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 32: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

A

AA

A

B C

1

46

2

5

A

A D E

FA

3

G G

2.5

7.5

AH

8

7

9

Find(F) = A

Find(G) = GUnion(A, G)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 33: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

A

AB

A

B C

1

46

2

5

A

A D E

FA

3

A G

2.5

7.5

AH

8

7

9

Find(B) = A

Find(C) = ASkip (B, C)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 34: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s With Union-Find (Conceptually)

A

AB

A

B C

1

46

2

5

A

A D E

FA

3

A G

2.5

7.5

AH

8

7

9

Find(H) = A

Find(G) = ASkip (H, G)

1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9

Page 35: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A1

B1

C1

D1

E1

F1

G1

H1

Page 36: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A1

B1

C1

D1

E1

F1

G1

H1

Page 37: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A2

B1

C1

D

E1

F1

G1

H1

Page 38: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A2

B1

C1

D

E1

F1

G1

H1

Page 39: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A3

B1

C1

D

F1

G1

H1

E

Page 40: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A3

B1

C1

D

F1

G1

H1

E

Page 41: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A3

B1

C2

D

G1

H1

E F

Page 42: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A3

B1

C2

D

G1

H1

E F

Page 43: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A5

B1

CD

G1

H1

E

F

Page 44: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A5

B1

CD

G1

H1

E

F

Page 45: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A6

CD

G1

H1

E

F

B

Page 46: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A6

CD

G1

H1

E

F

B

Page 47: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A7

CD

G1

E

F

B H

Page 48: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A7

CD

G1

E

F

B H

Page 49: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Union-Find Implementation Simulation

A8

CD E

F

B H G

Page 50: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

C

A

X7

W Z

Y

T

Linked Structure Per Connected Component

Leader

Page 51: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

C

A W Z

Y

T

Union Operation

F G

X7

E3

Union: **Make Leader of Small Component Point to the leader of Large Component**

Page 52: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

C

A W Z

Y

T

Union Operation

F G

X10

E

Cost: O(1)(1 pointer update, 1 increment)

Union: **Make Leader of Small Component Point to the leader of Large Component**

Page 53: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

C

A W Z

Y

T

Union Operation

F G

X10

E

Page 54: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

C

A W Z

Y

T

Find Operation

F G

X10

E

Find: “pointer chase” until the leader

Cost: # pointers to leader

?

Page 55: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Cost of Find Operation

Claim: For any v,

#-pointers to leader(v) ≤ log2(|

component(v)|)

≤ log2(n)

Proof: Each time v’s path to leader increases by

1, the size of its component at least doubles!

|component(v)| starts at 1, increases to n,

therefore it can double at most log2(n) time!

Page 56: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Summary of Union-Find

Initialization: Each v is a comp. of size 1 and points to

itself.

When we union two components, we make the leader

of the smaller one point to the larger one (break ties

arbitrarily).

Find(v):

Pointer chasing to the leader

Cost: O(log2(|component|)) = O(log2(n))

Union(u, v): 1 pointer update, 1 increment => O(1)

Page 57: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Kruskal’s Runtime With Union-Find

procedure kruskal(G(V, E)):sort E in order of increasing weights

rename E so w(e1) < w(e2) < … < w(em) init Union-Find T = {} // final tree edges for i = 1 to m: ei=(u,v) if find(u) != find(v) add ei to T Union(find(u), find(v)) return T

O(mlog(n))

m iterations

log(n)

***Total Runtime: O(mlog(n))***Same as Prim’s with heaps

O(1)

O(n)

Page 58: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Outline For Today

1. Runtime of Kruskal’s Algorithm (Union-Find Data

Structure)

2. Data Encodings & Finding An Optimal Prefix-free

Encoding

3. Prefix-free Encodings Binary Trees

4. Huffman Codes

Page 59: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Data Encodings and Compression

All data in the digital world gets represented as 0s and

1s. 010010100010010100011110110010010101010110100001110100010011000010010101011010100010

100001110100010011000010010101011010100010010100011010100010010100010010110110010101

111100111010001001100001100101101011010100010011000110101000100101001010110110010101

Page 60: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Goal of Data Compression: Make the binary

blob as small as possible, satisfying the

protocol.

Encoding-Decoding Protocol

010010100010010100011110110010010101010110100001110100010011000010010101011010100010

encoder

decoder

Page 61: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Alphabet A = {a, b, c, …., z}, assume |A|

= 32 ab…z

0000000001…11111

Option 1: Fixed Length Codes

Each letter mapped to exactly 5

bits

Example: ASCII encoding

Page 62: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

cat

ab…z

0000000001…11111

encoder

decoder

000110000010100

Example: Fixed Length Codes

000110000010100

cat

A = {a, b, c, …., z}

Page 63: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Output Size of Fixed Length Codes

Input: Alphabet A, text document of length n

Each letter is mapped to log2(|A|) bits

Output Size: nlog2(|A|)

Optimal if letters appear with same frequencies in

text!In practice, letters appear with different

frequencies

Ex: In English, letters a, t, e are much more

frequent than q, z, x

Question: Can we do better?

Page 64: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Option 2: Variable Length Binary Codes

Goal is to assign:

Frequently appearing letters short bit strings

Infrequently appearing ones long bit strings

Hope: On average have ≤ nlog2(|A|) encoded bits for

documents of size n (or ≤ log2(|A|) bits per letter)

Page 65: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Example 1: The Morse’s Code (not binary)Two Symbols: Dots (●) and Dash (−) or light and dark

But end of a letter is indicated with a pause

(effectively a third symbol)

frequents: e => ●, t => −, a => ●−

Infrequents: c => −●−●, j => ●−−−

cat encoder −●−●P

−●−●P●−P−P

cat decode

●−P

−P

Page 66: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Can We Have a Morse Code with 2 Symbols?Goal: Same idea as the Morse code but with only 2

symbols.

frequents: e => 0, t => 1, a => 01

Infrequents: c => 1010, j => 0111

cat encoder 1010

1010011decode

011

taeett?

teteat?cat?

**Decoding is Ambigous**

Page 67: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Why Was There Ambiguity?

The encoding of one letter was a

prefix of another letter.

Ex: e => 0 is a prefix of a => 01

Goal: Use a “prefix-free” encoding, i.e.

no letter’s encoding is a prefix of

another!Note: Fixed-length encoding was naturally

“prefix-free”.

Page 68: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

110010

decode

Page 69: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

110010

decodec

Page 70: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

110010

decodeca

Page 71: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

110010

decodecab

Page 72: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decode

Page 73: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decoded

Page 74: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decodeda

Page 75: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decodedac

Page 76: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decodedacc

Page 77: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Ex: Variable Length Prefix-free Encoding

Ex: A = {a, b, c, d}

abcd

010110111

11101101100

decodedacca

Page 78: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Benefits of Variable Length Codes

Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c:

10% d: 5%

abcd

010110111

Variable Length

Codeabcd

00011011

Fixed Length Code

A document of length

100K

Fixed Length

Code

Variable Length

Code200K bits

(2

bits/letter)

a: 45Kb: 80Kc: 30Kd: 15K

Total:170K bits(1.7 b/l)

Page 79: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Formal Problem Statement

Input: An alphabet A, and frequencies 𝓕 of letters in A

Output: a prefix-free encoding Ɣ, i.e. a mapping A ->

{0,1}* that minimizes the average bits per letter

Page 80: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Outline For Today

1. Runtime of Kruskal’s Algorithm (Union-Find Data

Structure)

2. Data Encodings & Finding An Optimal Prefix-free

Encoding

3. Prefix-free Encodings Binary Trees

4. Huffman Codes

Page 81: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Prefix-free Encodings Binary Trees

We can represent each prefix-free code Ɣ as a binary

tree T as follows:

abcd

010110111

Code 1

b

c d

0 1

a0 1

0 1

Encoding of letter x = path from the root to the leaf

with x

Page 82: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Prefix-free Encodings Binary Trees

We can represent each prefix-free code Ɣ as a binary

tree T as follows:

abcd

00011011

Code 2

c d

0 1

0 1

a b

0 1

Page 83: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Reverse is Also True

Each labeled binary tree T corresponds to a prefix-free

code for an alphabet A, where |A| = # leaves in T

b e

0 1

0 1

a0

1

c d0

1

abcde

011000000111

Why is this code prefix-free?

Page 84: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Reverse is Also True

Claim: Each labeled binary tree T corresponds to a

prefix-free code for an alphabet A, where |A| = #

leaves in T

Proof: Take path P = {0,1}* to leaf x as x’

encoding

Since each letter x is at a leaf,

the path from the root to x is a dead-end

and cannot be part of a path to another letter y.

Page 85: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Number of Bits for Letter x?

b

c d

0 1

a0 1

0 1

Let A be an alphabet, and T be a binary tree where

letters of A are the leaves of T

Answer: depthT(x)

Question: What’s the number

of bits for each letter x in the

encoding corresponding to T?

Page 86: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Formal Problem Statement Restated

Input: An alphabet A, and frequencies 𝓕 of letters in A

Output: A binary tree T, where letters of A are the

leaves of T, that has the minimum average bit length

(ABL):

Page 87: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Outline For Today

1. Runtime of Kruskal’s Algorithm (Union-Find Data

Structure)

2. Data Encodings & Finding An Optimal Prefix-free

Encoding

3. Prefix-free Encodings Binary Trees

4. Huffman Codes

Page 88: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Observation 1 About Optimal T

Claim: The optimal binary tree T is full, i.e., each non-

leaf vertex u has exactly 2 children

a

0 1

c

0 1

b

0 1

e0

a

0 1

c

0 1

b

0 1

e

Why?T T`

Page 89: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Claim: The optimal binary tree T is full, i.e., each non-

leaf vertex u has exactly 2 children

a

0 1

c

0 1

b

0 1

e0

a

0 1

c

0 1

b

0 1

e

Exchange Argument: Can replace u with its only child and decrease the

depths of some leaves, giving a better tree T`.

Observation 1 About Optimal T

Page 90: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Claim: The optimal binary tree T is full, i.e., each non-

leaf vertex has exactly 2 children

T T`

c

0 1

1

0

a b

1c

0 1

0

a b

1

Observation 1 About Optimal T

Page 91: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

First Algorithm: Shannon-Fano Codes

From 1948

Top-down Divide-Conquer type approach

1. Divide the alphabet into A0 and A1 s.t the frequencies

of letters in A0 and A1 are roughly 50%

2. Find an encoding Ɣ0 for A0, and Ɣ1 for A1

3. Append 0 to the encodings of Ɣ0 and 1 to Ɣ1

Page 92: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

First Algorithm: Shannon-Fano Codes

Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c:

10% d: 5%

A0 = {a, d}, A1 = {b, c}

d

0 1

a c

0 1

b

0 1

Fixed-length encoding, which we saw was

suboptimal!

Page 93: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Observation 2 About Optimal T

Claim: In any optimal tree T if leaf x has depth i, and leaf

y has depth j, s.t i < j => f(x) ≥ f(y)

Why?

Exchange Argument: Replace x and y and get a better

tree T`.

Page 94: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Observation 2 About Optimal T

Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c: 10%

d: 5%

b

a d

0 1

c0 1

0 1b

c d

0 1

a0 1

0 1

T => 2.4 bits/letter

T` => 1.7 bits/letter

Page 95: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Corollary

In any optimal tree T the two lowest

frequency letters are both in the lowest

level of the tree!

Page 96: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Huffman’s Key Insight

Observation 1 => optimal Ts are full => each leaf has

a sibling

Corollary => 2 lowest freq. letters x, y are at the same

level

Changing letters across the same level does not

change the cost of T

b

c d

0 1

a0 1

0 1

There is an optimal tree T,

in which the two lowest

frequency letters are

siblings (in the lowest level

of the tree).

Page 97: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Possible Greedy Algorithm

Possible greedy algorithm:

1. If x, y are siblings, treat them as a single meta-letter

xy

2. Find an optimal tree T* with A-{x, y} + {xy}

3. Expand xy back into x and y in T*

Page 98: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Possible Greedy Algorithm (Example)

xy t

0 1

z0 1

Ex: A = {x, y, z, t}, and let x, y be the two lowest freq.

letters

Let A` = {xy, z, t}

t

0 1

z0 1

x y

0 1

T* T

Page 99: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

The weight of meta-letter?

Q: What weight should be attached to the meta-letter

xy?

A: f(x) + f(y) procedure Huffman(A, 𝓕): if (|A|=2): return T where branch 0, 1 point to A[0] and A[1], respectively

let x, y be lowest two frequency letters let A` = A-{x,y}+{xy} let ` = - {x, y} + {xy: f(x) + f(y)}𝓕 𝓕 T* = Huffman(A`, `)𝓕 expand x, y in T* to get Treturn T

Page 100: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Huffman’s Algorithm (1951)

procedure Huffman(A, 𝓕): if (|A|=2): return T where branch 0, 1 point to A[0] and A[1], respectively

let x, y be lowest two frequency letters let A` = A-{x,y}+{xy} let ` = - {x, y} + {xy: f(x) + f(y)}𝓕 𝓕 T* = Huffman(A`, `)𝓕 expand x, y in T* to get Treturn T

Page 101: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Huffman’s Algorithm Correctness (1)

By induction on the |A|

Base case: |A| = 2 => return simple full tree with 2

leaves

IH: Assume true for all alphabets of size k-1

Huffman will get a Tk-1opt with meta-letter xy and

expand xy

Page 102: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Huffman’s Algorithm Correctness (2)

xy t

0 1z

0 1t

0 1z

0 1

x y0 1

Tk-1opt T

f(xy)*depth(xy)=(f(x) +

f(y))*depth(xy)

(f(x) + f(y))*(depth(xy) + 1)

Total diff = f(x) + f(y)

Page 103: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Huffman’s Algorithm Correctness (3)

Take any optimal Z, we’ll argue ABL(T) ≤ ABL(Z)

By corollary we can assume in Z x,y are also siblings at

the lowest level.

Consider Z` by merging them => Z` is valid prefix-

code for A` of size k-1

ABL(Z) = ABL(Z`) + f(x) + f(y)

ABL(T) = ABL(T`) + f(x) + f(y)

By IH: ABL(T`) ≤ ABL(T`) => ABL(T) ≤ ABL(z)

Q.E.D

Page 104: Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th

Huffman’s Algorithm Runtime

Exercise: Make Huffman run in O(|A|log(|A|))?