comp527-08

Upload: dinesh-goyal

Post on 10-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 comp527-08

    1/25

    Dr Robert Sanderson

    ([email protected])

    Dept. of Computer Science

    University of Liverpool

    2008

    COMP527: Data Mining

    Classification: Trees January 18, 2008 Slide 1

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    2/25

    Introduction to the Course

    Introduction to Data MiningIntroduction to Text Mining

    General Data Mining Issues

    Data Warehousing

    Classification: Challenges, Basics

    Classification: RulesClassification: Trees

    Classification: Trees 2

    Classification: Bayes

    Classification: Neural Networks

    Classification: SVMClassification: Evaluation

    Classification: Evaluation 2

    Regression, Prediction

    COMP527: Data Mining

    Classification: Trees January 18, 2008 Slide 2

    COMP527:Data Mining

    Input Preprocessing

    Attribute SelectionAssociation Rule Mining

    ARM: A Priori and Data Structures

    ARM: Improvements

    ARM: Advanced Techniques

    Clustering: Challenges, BasicsClustering: Improvements

    Clustering: Advanced Algorithms

    Hybrid Approaches

    Graph Mining, Web Mining

    Text Mining: Challenges, Basics

    Text Mining: Text-as-Data

    Text Mining: Text-as-Language

    Revision for Exam

  • 8/8/2019 comp527-08

    3/25

    Trees

    Tree Learning AlgorithmAttribute Splitting Decisions

    Random

    'Purity Count'Entropy (aka ID3)Information Gain Ratio

    Today's Topics

    Classification: Trees January 18, 2008 Slide 3

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    4/25

    Anything can be made better by storing it in a tree structure! (Not really!)

    Instead of having lists or sets of rules, why not have a treeof rules? Then there's no problem with order, or repeatingthe same test over and over again in different conjunctiverules.

    So each node in the tree is an attribute test, the branchesfrom that node are the different outcomes.

    Instead of 'separate and conquer', Decision Trees are themore typical 'divide and conquer' approach. Once thetree is built, new instances can be tested by simplystepping through each test.

    Trees

    Classification: Trees January 18, 2008 Slide 4

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    5/25

    Here's our example data again:

    How to construct a tree from it, instead of rules?

    Example Data Again

    Classification: Trees January 18, 2008 Slide 5

    COMP527:Data Mining

    Outlook Temperature Humidity Windy Play?

    sunny hot high false no

    sunny hot high true no

    overcast hot high false yes

    rainy mild high false yes

    rainy cool normal false yes

    rainy cool normal true no

    overcast cool normal true yes

    sunny mild high false no

    sunny cool normal false yes

    rainy mild normal false yes

    sunny mild normal true yes

    overcast mild high true yes

    overcast hot normal false yes

    rainy mild high true no

  • 8/8/2019 comp527-08

    6/25

    Trivial Tree Learner:

    create empty tree T

    select attribute A

    create branches in T for each value v of A

    for each branch,

    recurse with instances where A=v

    add tree as branch node

    Most interesting part of this algorithm is line 2, the attribute

    selection. Let's start with a Random selection, then look at how it

    might be improved.

    Tree Learning Algorithm

    Classification: Trees January 18, 2008 Slide 6

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    7/25

    Random method: Let's pick 'windy'

    Need to split again, looking at only the 8 and 6 instances respectively.

    For windy=false, we'll randomly select outlook:

    sunny: no, no, yes | overcast: yes, yes | rainy: yes, yes, yes

    As all instances of overcast and rainy are yes, they stop, sunny continues.

    T

    Classification: Trees January 18, 2008 Slide 7

    COMP527:Data Mining

    Windy

    false true

    6 yes2 no

    3 yes3 no

  • 8/8/2019 comp527-08

    8/25

    As we may have thousands of attributes and/or values to test, we

    want to construct small decision trees. Think back to RIPPER's

    description length ... the smallest decision tree will have the

    smallest description length. So how can we reduce the number

    of nodes in the tree?

    We want all paths through the tree to be as short aspossible. Nodes with one class stop a path, so we wantthose to appear early in the tree, otherwise they'll occurin multiple branches.

    Think back: the first rule we generated wasoutlook=overcast because it was pure.

    Attribute Selection

    Classification: Trees January 18, 2008 Slide 8

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    9/25

    'Purity' count:

    Select attribute that has the most 'pure' nodes, randomise equal

    counts.Still mediocre. Most data sets won't have pure nodes for several

    levels. Need a measure of the purity instead of the simple count.

    Attribute Selection: Purity

    Classification: Trees January 18, 2008 Slide 9

    COMP527:Data Mining

    Outlook

    sunny rainy

    2 yes3 no

    3 yes2 no

    4 yes

  • 8/8/2019 comp527-08

    10/25

    For each test:

    Maximal purity: All values are the same

    Minimal purity: Equal number of each value

    Find a scale between maximal and minimal, and then merge across all of the

    attribute tests.

    One function that calculates this is the Entropy function:

    entropy(p1,p2...,pn)

    = -p1*log(p1) + -p2*log(p2) + ... -pn*log(pn)

    p1 ... pn are the number of instances of each class, expressed as a fraction of

    the total number of instances at that point in the tree. log is base 2.

    Attribute Selection: Entropy

    Classification: Trees January 18, 2008 Slide 10

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    11/25

    entropy(p1,p2...,pn)

    = -p1*log(p1) + -p2*log(p2) + ... -pn *log(pn)

    This is to calculate one test. For outlook there are three tests:

    sunny: info(2,3)

    = -2/5 log(2/5) -3/5 log(3/5)= 0.5287 + 0.4421

    = 0.971

    overcast: info(4,0) = -(4/4*log(4/4)) + -(0*log(0))

    Ohoh! log(0) is undefined. But note that we're multiplying it by 0, so what ever it

    is the final result will be 0.

    Attribute Selection: Entropy

    Classification: Trees January 18, 2008 Slide 11

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    12/25

    sunny: info(2,3) = 0.971

    overcast: info(4,0) = 0.0

    rainy: info(3,2) = 0.971

    But we have 14 instances to divide down those paths...

    So the total for outlook is:(5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971) = 0.693

    Now to calculate the gain, we work out the entropy for the top node

    and subtract the entropy for outlook:info(9,5) = 0.940

    gain(outlook) = 0.940 - 0.693 = 0.247

    Attribute Selection: Entropy

    Classification: Trees January 18, 2008 Slide 12

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    13/25

    Now to calculate the gain for all of the attributes:gain(outlook) = 0.247

    gain(humidity) = 0.152

    gain(windy) = 0.048

    gain(temperature) = 0.029

    And select the maximum ... which is outlook.This is (also!) called information gain. The total is the information,

    measured in 'bits'.

    Equally we could select the minimum amount of information

    needed -- the minimum description length issue in RIPPER.

    Let's do the next level, where outlook=sunny.

    Attribute Selection: Entropy

    Classification: Trees January 18, 2008 Slide 13

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    14/25

    Now to calculate the gain for all of the attributes:

    Temp: hot info(0,2) mild info(1,1) cool info(1,0)

    Humidity: high info(0,3) normal info(2,0)

    Windy: false info(1,2) true info(1,1)

    Don't even need to do the math. Humidity is the obvious choice as

    it predicts all 5 instances correctly. Thus the information will be

    0, and the gain will be maximal.

    Attribute Selection: Entropy

    Classification: Trees January 18, 2008 Slide 14

    COMP527:Data Mining

    Outlook Temperature Humidity Windy Play?

    sunny hot high false no

    sunny hot high true no

    sunny mild high false no

    sunny cool normal false yes

    sunny mild normal true yes

  • 8/8/2019 comp527-08

    15/25

    Now our tree looks like:

    This algorithm is called ID3, developed by Quinlan.

    Attribute Selection: Entropy

    Classification: Trees January 18, 2008 Slide 15

    COMP527:Data Mining

    Outlook

    sunny rainy

    yes no

    yesHumidity

    normal high

    ?

  • 8/8/2019 comp527-08

    16/25

    Nasty side effect of Entropy: It prefers attributes with a large

    number of branches.

    Eg, if there was an 'identifier' attribute with a unique value, this

    would uniquely determine the class, but be useless for

    classification. (over-fitting!)

    Eg: info(0,1) info(0,1) info(1,0) ...

    Doesn't need to be unique. If we assign 1 to the first two instances,

    2 to the second and so forth, we still get a 'better' split.

    Entropy: Issues

    Classification: Trees January 18, 2008 Slide 16

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    17/25

    Half-Identifier 'attribute':info(0,2) info(2,0) info(1,1) info(1,1) info(2,0) info(2,0)

    info(1,1)

    = 0 0 0.5 0.5 0 0 0.5

    2/14 down each route, so:

    = 0*2/14 + 0*2/14 + 0.5*2/14 + 0.5*2/14 + ...= 3 * (2/14 * 0.5)

    = 3/14

    = 0.214

    Gain is:

    0.940 - 0.214 = 0.726

    Remember that the gain for Outlook was only 0.247!

    Urgh. Once more we run into over-fitting.

    Entropy: Issues

    Classification: Trees January 18, 2008 Slide 17

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    18/25

    Solution: Use a gain ratio. Calculate the entropy disregarding

    classes for all of the daughter nodes:

    eg info(2,2,2,2,2,2,2) for half-identifier

    and info(5,4,5) for outlook

    identifier = -1/14 * log(1/14) * 14 = 3.807

    half-identifier = -1/7 * log(1/7) * 7 = 2.807

    outlook = 1.577

    Ratios:identifier = 0.940 / 3.807 = 0.247

    half-identifier = 0.726 / 2.807 = 0.259

    outlook = 0.247 / 1.577 = 0.157

    Gain Ratio

    Classification: Trees January 18, 2008 Slide 18

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    19/25

    Close to success: Picks half-identifier (only accurate in 4/7

    branches) over identifier (accurate in all 14 branches)!

    half-identifier = 0.259

    identifier = 0.247

    outlook = 0.157humidity = 0.152

    windy = 0.049

    temperature = 0.019

    Humidity is now also very close to outlook, whereas before they

    were separated.

    Gain Ratio

    Classification: Trees January 18, 2008 Slide 19

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    20/25

    We can simply check for identifier like attributes and ignore them.

    Actually, they should be removed from the data before the data

    mining begins.

    However the ratio can also over-compensate. It might pick an

    attribute because it's entropy is low. Note how close humidity

    and outlook became... maybe that's not such a good thing?

    Possible Fix: First generate the information gain. Throw away any

    attributes with less than the average. Then compare using the

    ratio.

    Gain Ratio

    Classification: Trees January 18, 2008 Slide 20

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    21/25

    An alternative method to Information Gain is called the Gini Index

    The total for node D is:gini(D) = 1 - sum(p12, p22, ... pn2)

    Where p1..n are the frequency ratios of class 1..n in D.

    So the Gini Index for the entire set:

    = 1- (9/142 + 5/142)

    = 1 - (0.413 + 0.127)

    = 0.459

    Alternative: Gini

    Classification: Trees January 18, 2008 Slide 21

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    22/25

    The gini value of a split of D into subsets is:

    Split(D) = N1/N gini(D1) + N2/N gini(D2) + Nn/N gini(Dn)

    Where N' is the size of split D', and N is the size of D.

    eg: Outlook splits into 5,4,5:split = 5/14 gini(sunny) + 4/14 gini(overcast)

    + 5/14 gini(rainy)

    sunny = 1-sum(2/52, 3/52) = 1 - 0.376 = 0.624

    overcast= 1- sum(4/4

    2

    , 0/4

    2

    ) = 0.0rainy = sunny

    split = (5/14 * 0.624) * 2

    = 0.446

    Gini

    Classification: Trees January 18, 2008 Slide 22

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    23/25

    The attribute that generates the smallest gini split value is chosen

    to split the node on.

    (Left as an exercise for you to do!)

    Gini is used in CART (Classification and Regression Trees), IBM's

    IntelligentMiner system, SPRINT (Scalable PaRallelizable

    INduction of decision Trees). It comes from an Italian statistician

    who used it to measure income inequality.

    Gini

    Classification: Trees January 18, 2008 Slide 23

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    24/25

    The various problems that a good DT builder needs to address:

    Ordering of Attribute Splits

    As seen, we need to build the tree picking the best attribute to split on first.

    Numeric/Missing Data

    Dividing numeric data is more complicated. How?

    Tree Structure

    A balanced tree with the fewest levels is preferable.

    Stopping Criteria

    Like with rules, we need to stop adding nodes at some point. When? Pruning

    It may be beneficial to prune the tree once created? Or incrementally?

    Decision Tree Issues

    Classification: Trees January 18, 2008 Slide 24

    COMP527:Data Mining

  • 8/8/2019 comp527-08

    25/25

    Introductory statistical text books Witten, 3.2, 4.3 Dunham, 4.4 Han, 6.3 Berry and Browne, Chapter 4 Berry and Linoff, Chapter 6

    Further Reading

    Classification: Trees January 18, 2008 Slide 25

    COMP527:Data Mining