huffman code and id3

58
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science

Upload: joshua

Post on 05-Jan-2016

83 views

Category:

Documents


0 download

DESCRIPTION

CS 157 B Lecture 18. Huffman code and ID3. Prof. Sin-Min Lee Department of Computer Science. Data Compression. Data discussed so far have used FIXED length for representation For data transfer (in particular), this method is inefficient. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Huffman code and ID3

Huffman code and ID3

Prof. Sin-Min Lee

Department of Computer Science

Page 2: Huffman code and ID3

Data Compression• Data discussed so far have used FIXED length for

representation

• For data transfer (in particular), this method is inefficient.

• For speed and storage efficiencies, data symbols should use the minimum number of bits possible for representation.

Page 3: Huffman code and ID3

Data CompressionMethods Used For Compression:

– Encode high probability symbols with fewer bits• Shannon-Fano, Huffman, UNIX compact

– Encode sequences of symbols with location of sequence in a dictionary

• PKZIP, ARC, GIF, UNIX compress, V.42bis

– Lossy compression• JPEG and MPEG

Page 4: Huffman code and ID3

Data CompressionAverage code length

Instead of the length of individual code symbols or words, we want to know the behavior of the complete information source

Page 5: Huffman code and ID3

Data CompressionAverage code lengthAssume that symbols of a source alphabet {a1,a2,

…,aM} are generated with probabilities p1,p2,…,pM

P(ai) = pi (i = 1, 2, …, M)

• Assume that each symbol of the source alphabet is encoded with codes of lengths l1,l2,…,lM

Page 6: Huffman code and ID3

Data Compression

Average code length

Then the Average code length, L, of an information source is given by:

MM plplpl 2211L

Page 7: Huffman code and ID3

Data CompressionVariable Length Bit Codings

Rules:

1. Use minimum number of bitsAND

2. No code is the prefix of another codeAND

3. Enables left-to-right, unambiguous decoding

Page 8: Huffman code and ID3

Data CompressionVariable Length Bit Codings

• No code is a prefix of another

– For example, can’t have ‘A’ map to 10 and ‘B’ map to 100, because 10 is a prefix (the start of) 100.

Page 9: Huffman code and ID3

Data CompressionVariable Length Bit Codings

• Enables left-to-right, unambiguous decoding

– That is, if you see 10, you know it’s ‘A’, not the start of another character.

Page 10: Huffman code and ID3

Data CompressionVariable Length Bit Codings

• Suppose ‘A’ appears 50 times in text, but ‘B’ appears only 10 times

• ASCII coding assigns 8 bits per character, so total bits for ‘A’ and ‘B’ is 60 * 8 = 480

• If ‘A’ gets a 4-bit code and ‘B’ gets a 12-bit code, total is 50 * 4 + 10 * 12 = 320

Page 11: Huffman code and ID3

Data CompressionVariable Length Bit Codings

Example:

Source Symbol

P C1 C2 C3 C4 C5 C6

A 0.6 00 0 0 0 0 0

B 0.25 01 10 10 01 10 10

C 0.1 10 110 110 011 11 11

D 0.05 11 1110 111 111 01 0

Average code length = 1.75

Page 12: Huffman code and ID3

Data CompressionVariable Length Bit Codings

Question:

Is this the best that we can get?

Page 13: Huffman code and ID3

Data CompressionHuffman code

– Constructed by using a code tree, but starting at the leaves

– A compact code constructed using the binary Huffman code construction method

Page 14: Huffman code and ID3

Data CompressionHuffman code Algorithm

① Make a leaf node for each code symbolAdd the generation probability of each symbol to the leaf node

② Take the two leaf nodes with the smallest probability and connect them into a new node

Add 1 or 0 to each of the two branchesThe probability of the new node is the sum of the probabilities of the two connecting nodes

③ If there is only one node left, the code construction is completed. If not, go back to (2)

Page 15: Huffman code and ID3

Data CompressionHuffman code Example

Character (or symbol) frequencies– A : 20% (.20) e.g., ‘A’ occurs 20 times in a

100 character document, 1000 times in a 5000 character document, etc.

– B : 9% (.09)– C : 15% (.15)– D : 11% (.11)– E : 40% (.40)– F : 5% (.05)

• Also works if you use character counts• Must know frequency of every character in the document

Page 16: Huffman code and ID3

C .15

A.20

D.11

F.05

B.09

E.40

Huffman code Example

• Symbols and their associated frequencies.

• Now we combine the two least common symbols (those with the smallest frequencies) to make a new symbol string and corresponding frequency.

Data Compression

Page 17: Huffman code and ID3

C .15

A.20

D.11

F.05

BF.14

B.09

E.40

Data CompressionHuffman code Example

• Here’s the result of combining symbols once.

• Now repeat until you’ve combined all the symbols into a single string.

Page 18: Huffman code and ID3

C .15

A.20

D.11

F.05

BF.14

B.09

BFD.25

AC.35

E.40

ABCDF.60

ABCDEF1.0Data Compression

Huffman code Example

Page 19: Huffman code and ID3

• Now assign 0s/1s to each branch

• Codes (reading from top to bottom)– A: 010– B: 0000– C: 011– D: 001– E: 1– F: 0001

• Note– None are prefixes of another

ABCDEF1.0

E.40

C .15

A.20

D.11

F.05

BF.14

AC.35

BFD.25

ABCDF.60

B.09

0

0

0

0

0

1

1

11

1

Data Compression

Average Code Length = ?

Page 20: Huffman code and ID3

Data CompressionHuffman code

• There is no unique Huffman code– Assigning 0 and 1 to the branches is arbitrary– If there are more nodes with the same probability,

it doesn’t matter how they are connected

• Every Huffman code has the same average code length!

Page 21: Huffman code and ID3

Data CompressionHuffman code

Quiz:• Symbols A, B, C, D, E, F are being produced by the

information source with probabilities 0.3, 0.4, 0.06, 0.1, 0.1, 0.04 respectively.

What is the binary Huffman code?1) A = 00, B = 1, C = 0110, D = 0100, E = 0101, F = 01112) A = 00, B = 1, C = 01000, D = 011, E = 0101, F = 010013) A = 11, B = 0, C = 10111, D = 100, E = 1010, F = 10110

Page 22: Huffman code and ID3

Data CompressionHuffman code

Applied extensively:

• Network data transfer

• MP3 audio format

• Gif image format

• HDTV

• Modelling algorithms

Page 23: Huffman code and ID3
Page 24: Huffman code and ID3

A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision.

Page 25: Huffman code and ID3
Page 26: Huffman code and ID3
Page 27: Huffman code and ID3
Page 28: Huffman code and ID3

•One measure is the amount of information provided by the attribute.•Example: Suppose you are going to bet $1 on the flip of a coin. With a normal coin. You might be willing to pay up to $1 for advance knowledge of the coin flip. However, if the coin was rigged so that 99% of the time heads come up, you would bet heads with an expected value of $0.98. So, in the case of the rigged coin, you would be willing to pay less that $0.02 for advance information of the result.•The less you know the more valuable the information is.•Information theory uses this intuition.

Page 29: Huffman code and ID3
Page 30: Huffman code and ID3
Page 31: Huffman code and ID3
Page 32: Huffman code and ID3

We measure the entropy of a dataset,S, with respect to one attribute, in this case the target attribute, with the following calculation:

where Pi is the proportion of instances in the dataset that take the ith value of the target attribute

This probability measures give us an indication of how uncertain we are about the data. And we use a log2 measure as this represents how many bits we would need to use in order to specify what the class (value of the target attribute) is of a random instance.

Page 33: Huffman code and ID3

Entropy CalculationsIf we have a set with k different values in it, we can calculate the entropy as follows:

Where P(valuei) is the probability of getting the ith value when randomly selecting one from the set.

So, for the set R = {a,a,a,b,b,b,b,b}

Page 34: Huffman code and ID3

Using the example of the marketing data, we know that there are two classes in the data and so we use the fractions that each class represents in an entropy calculation:

Entropy (S = [9/14 responses, 5/14 no responses])

= -9/14 log2 9/14 - 5/14 log2 5/14 = 0.947 bits

Page 35: Huffman code and ID3

•Initial decision tree is one node with all examples.

•There are 4 +ve examples and 3 -ve examples

•i.e. probability of +ve is 4/7 = 0.57; probability of -ve is 3/7 = 0.43

•Entropy is: - (0.57 * log 0.57) - (0.43 * log 0.43) = 0.99

Page 36: Huffman code and ID3

•Evaluate possible ways of splitting.

•Try split on size which has three values: large, medium and small.

•There are four instances with size = large.

•There are two large positives examples and two large negative examples.

•The probability of +ve is 0.5

•The entropy is: - (0.5 * log 0.5) - (0.5 * log 0.5) = 1

Page 37: Huffman code and ID3

•There is one small +ve and one small -ve

•Entropy is: - (0.5 * log 0.5) - (0.5 * log 0.5) = 1

•There is only one medium +ve and no medium -ves, so entropy is 0.

•Expected information for a split on size is:

•The expected information gain is: 0.99 - 0.86 = 0.13

Page 38: Huffman code and ID3

•Now try splitting on colour and shape.

•Colour has an information gain of 0.52

•Shape has an information gain of 0.7

•Therefore split on shape.

•Repeat for all subtree

Page 39: Huffman code and ID3
Page 40: Huffman code and ID3
Page 41: Huffman code and ID3
Page 42: Huffman code and ID3
Page 43: Huffman code and ID3
Page 44: Huffman code and ID3
Page 45: Huffman code and ID3
Page 46: Huffman code and ID3
Page 47: Huffman code and ID3
Page 48: Huffman code and ID3

How do we construct the decision tree?How do we construct the decision tree?•Basic algorithm (a greedy algorithm)

–Tree is constructed in a top-down recursive divide-and-conquer manner–At start, all the training examples are at the root–Attributes are categorical (if continuous-valued, they can be discretized in advance)–Examples are partitioned recursively based on selected attributes.–Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)

Information gain: Information gain: (information before split) – (information after split)(information before split) – (information after split)

Page 49: Huffman code and ID3

•Conditions for stopping partitioning–All samples for a given node belong to the same class–There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf–There are no samples left

Page 50: Huffman code and ID3
Page 51: Huffman code and ID3
Page 52: Huffman code and ID3
Page 53: Huffman code and ID3

•The basic algorithm for decision tree induction is a greedy algorithm that constructs decision trees in a top-down recursive divide-and – conquer manner.•The basic strategy is as follows.•Tree STARTS as a single node representing all training dataset (samples)•IF the samples are ALL in the same class, THEN the node becomes a LEAF and is labeled with that class•OTHERWISE, the algorithm uses an entropy-based measure known as information gain as a heuristic for selecting the ATTRIBUTE that will best separate the samples into individual classes. This attribute becomes the node-name (test, or tree split decision attribute)

•Select the attribute with the highest information gain (information

gain is the expected reduction in entropy).

Page 54: Huffman code and ID3

•A branch is created for each value of the node-attribute (and is labeled by this value -this is syntax) and the samples are partitioned accordingly (this is semantics; see example which follows)•The algorithm uses the same process recursively to form a decision tree at each partition. Once an attribute has occurred at a node, it need not be considered in any other of the node’s descendents•The recursive partitioning STOPS only when any one of the following conditions is true.•All samples for the given node belong to the same class or•There are no remaining attributes on which the samples may be further partitioning. In this case we convert the given node into a LEAF and label it with the class in majority among samples (majority voting)•There is no samples left – a leaf is created with majority vote for training sample

Page 55: Huffman code and ID3
Page 56: Huffman code and ID3

Information Gain as A Splitting CriteriaInformation Gain as A Splitting Criteria

Page 57: Huffman code and ID3

Information Gain Computation (ID3/C4.5): Case of Two Classes

Page 58: Huffman code and ID3