algorithms: the basic methods. inferring rudimentary rules simplicity first simple algorithms often...

22
Algorithms: The basic methods

Post on 19-Dec-2015

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Algorithms:The basic methods

Page 2: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Inferring rudimentary rulesSimplicity first• Simple algorithms often work surprisingly well• Many different kinds of simple structure exist:• One attribute might do all the work

• 1R: learns a 1-level decision tree– In other words, generates a set of rules that all test on one

particular attribute• Basic version (assuming nominal attributes)

– One branch for each of the attribute’s values– Each branch assigns most frequent class– Error rate: proportion of instances that don’t belong to the

majority class of their corresponding branch– Choose attribute with lowest error rate

Page 3: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Pseudo-code for 1RFor each attribute,

For each value of the attribute, make a rule as follows:

count how often each class appears

find the most frequent class

make the rule assign that class to this attribute-value

Calculate the error rate of the rules

Choose the rules with the smallest error rate

• Note: “missing” is always treated as a separate attribute value

Page 4: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Evaluating the weather attributes

Arbitrarily breaking the tie between the firstand third rule sets we pick the first.Oddly enough the game is played when it’s overcast and rainy but not when it’s sunny. Perhaps it’s an indoor pursuit.

Page 5: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Dealing with numeric attributes• Numeric attributes are

discretized: the range of the attribute is divided into a set of intervals– Instances are sorted

according to attribute’s values

– Breakpoints are placed where the (majority) class changes (so that the total error is minimized)

outlook temp. humidity windy play

sunny 85 85 FALSE no

sunny 80 90 TRUE no

overcast 83 86 FALSE yes

rainy 70 96 FALSE yes

rainy 68 80 FALSE yes

rainy 65 70 TRUE no

overcast 64 65 TRUE yes

sunny 72 95 FALSE no

sunny 69 70 FALSE yes

rainy 75 80 FALSE yes

sunny 75 70 TRUE yes

overcast 72 90 TRUE yes

overcast 81 75 FALSE yes

rainy 71 91 TRUE no

Page 6: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

The problem of overfitting• Discretization procedure is very sensitive to noise

– A single instance with an incorrect class label will most likely result in a separate interval

– Also: E.g. time stamp attribute will have zero errors because each partition contains just one instance.

– However, highly branching attributes do not usually perform well on future examples.

• Simple solution: enforce minimum number of instances in majority class per interval– Weather data example (with minimum set to 3):

Page 7: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Result of overfitting avoidance• Whenever adjacent partitions have the same majority class as do the

first two in this example, we merge.• Final result for for temperature attribute:

Resulting Rule set

The same procedure leads to the rule for humidity, which turns out to be the best.

Page 8: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Discussion of 1R• 1R was described in a paper by Holte (1993)• Contains an experimental evaluation on 16 datasets• Minimum number of instances was set to 6 after some

experimentation (instead of 3 as we did)• 1R’s simple rules performed not much worse than much

more complex decision trees• Simplicity first pays off!

Page 9: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Statistical modeling• “Opposite” of 1R: use all the attributes• Two assumptions:

– Attributes are equally important

– statistically conditionally independent (given the class value)

• This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known)

• Although based on assumptions that are almost never correct, this scheme works well in practice!

Page 10: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Probabilities for the weather data

Page 11: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Bayes’s rule• Probability of event H given evidence E:

P(H | E) = P(E | H) P(H) / P(E)

• A priori probability of H: P(H)– Probability of event before evidence has been seen

• A posteriori probability of H: P(H | E)– Probability of event after evidence has been seen

Page 12: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Naïve Bayes for classification• Classification learning: what’s the probability of the class

given an instance?– Evidence E = instance

– Event H = class value for instance

• Naïve Bayes assumption: evidence can be split into independent parts (i.e. attributes of instance!)

P(H | E) = P(E | H) P(H) / P(E)

= P(E1|H) P(E2|H)… P(En|H) P(H) / P(E)

Page 13: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

The weather data example

P(play=yes | E) =

P(Outlook=Sunny | play=yes) *

P(Temp=Cool | play=yes) *

P(Humidity=High | play=yes) *

P(Windy=True | play=yes) *

P(play=yes) / P(E) =

= (2/9) * (3/9) * (3/9) * (3/9) *(9/14) / P(E) = 0.0053 / P(E)

Don’t worry for the 1/P(E); It’s alpha, the normalization constant.

Page 14: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

The weather data example

P(play=no | E) =

P(Outlook=Sunny | play=no) *

P(Temp=Cool | play=no) *

P(Humidity=High | play=no) *

P(Windy=True | play=no) *

P(play=no) / P(E) =

= (3/5) * (1/5) * (4/5) * (3/5) *(5/14) / P(E) = 0.0206 / P(E)

Page 15: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Normalization constant

P(play=yes, E) + P(play=no, E) = P(E) i.e.

P(play=yes, E)/P(E) + P(play=no, E)/P(E) = 1 i.e.

P(play=yes | E) + P(play=no | E) = 1 i.e.

0.0053 / P(E) + 0.0206 / P(E) = 1 i.e.

1/P(E) = 1/(0.0053 + 0.0206)

So,

P(play=yes | E) = 0.0053 / (0.0053 + 0.0206) = 20.5%

P(play=no | E) = 0.0206 / (0.0053 + 0.0206) = 79.5%

E

play=yes play=no

20.5%79.5%

Page 16: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

The “zero-frequency problem”• What if an attribute value doesn’t occur with every class

value (e.g. “Humidity = High” for class “Play=Yes”)?– Probability P(Humidity=High | play=yes) will be zero!

• A posteriori probability will also be zero! – No matter how likely the other values are!

• Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator)– I.e. initialize the counters to 1 instead of 0.

• Result: probabilities will never be zero! (stabilizes probability estimates)

Page 17: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Missing values• Training: instance is not included in frequency count for

attribute value-class combination• Classification: attribute will be omitted from calculation• Example:

P(play=yes | E) =

P(Temp=Cool | play=yes) *

P(Humidity=High | play=yes) *

P(Windy=True | play=yes) *

P(play=yes) / P(E) =

= (3/9) * (3/9) * (3/9) *(9/14) / P(E) = 0.0238 / P(E)

P(play=no | E) =

P(Temp=Cool | play=no) *

P(Humidity=High | play=no) *

P(Windy=True | play=no) *

P(play=no) / P(E) =

= (1/5) * (4/5) * (3/5) *(5/14) / P(E) = 0.0343 / P(E)

After normalization: P(play=yes | E) = 41%, P(play=no | E) = 59%

Page 18: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Dealing with numeric attributes• Usual assumption: attributes have a normal or Gaussian

probability distribution (given the class)• The probability density function for the normal distribution

is defined by two parameters:

• The sample mean

• The standard deviation

• The density function f(x):

Page 19: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Statistics for the weather data

Page 20: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Classifying a new day• A new day E:

P(play=yes | E) =

P(Outlook=Sunny | play=yes) *

P(Temp=66 | play=yes) *

P(Humidity=90 | play=yes) *

P(Windy=True | play=yes) *

P(play=yes) / P(E) =

= (2/9) * (0.0340) * (0.0221) * (3/9) *(9/14) / P(E) = 0.000036 / P(E)

P(play=no | E) =

P(Outlook=Sunny | play=no) *

P(Temp=66 | play=no) *

P(Humidity=90 | play=no) *

P(Windy=True | play=no) *

P(play=no) / P(E) =

= (3/5) * (0.0291) * (0.0380) * (3/5) *(5/14) / P(E) = 0.000136 / P(E)

After normalization: P(play=yes | E) = 20.9%, P(play=no | E) = 79.1%

Page 21: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Probability densities• Relationship between probability and density:

• But: this doesn’t change calculation of a posteriori probabilities because cancels out

• Exact relationship:

Page 22: Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple

Discussion of Naïve Bayes• Naïve Bayes works surprisingly well (even if independence

assumption is clearly violated)

• Why?

• Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class

• However: adding too many redundant attributes will cause problems (e.g. identical attributes)

• Note also: many numeric attributes are not normally distributed