csc 4510 – machine learning

CSC 4510 – Machine LearningDr. Mary-Angela PapalaskariDepartment of Computing SciencesVillanova University

Course website:www.csc.villanova.edu/~map/4510/

Lecture 3: Classification and Decision Trees

1CSC 4510 - M.A. Papalaskari - Villanova University

Last time:Machine learning Overview• Supervised Learning

– Classification– Regression

• Unsupervised learning

Others: Reinforcement learning, recommender systems.

Also talk about: Practical advice for applying learning algorithms.

CSC 4510 - M.A. Papalaskari - Villanova University 2

Supervised or Unsupervised learning?Iris Data

Resources: Datasets

• UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html

• UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html

• Statlib: http://lib.stat.cmu.edu/

• Delve: http://www.cs.utoronto.ca/~delve/


Example: adult.data Dataset description from UCI Repository


UCI Repository: adult.data


Our Sample Data%,Class data,,,,,%,major 1=CS; 2=psych; 3=other,class (1=freshman; 2=sophomore;=graduate or other),birthday month (number),eyecolor (0=blue; =brown;

2=other),Do you prefer apples(1) or oranges (0)?,T: major,class,bmonth,eyecolor,aORo;A: 2,3,6,1,0;A: 1,2,3,1,1;A: 2,3,5,1,1;A: 3,4,7,1,1;A: 1,4,10,1,0;A: 3,4,6,1,0;A: 2,3,10,0,1;A: 1,4,7,1,0;A: 2,3,3,1,1;A: 3,3,7,1,1;A: 1,4,8,2,1;A: 1,4,4,1,0;A: 3,4,3,0,1;A: 3,4,2,2,1;A: 3,4,8,1,1;A: 1,4,2,2,0;A: 1,5,8,1,1;A: 1,5,4,0,1;A: 2,5,11,2,0;


8

Classification (Categorization)

• Given:– A description of an instance, xX, where X is the instance

language or instance space.– A fixed set of categories: C={c1, c2,…cn}

• Determine:– The category of x: c(x)C, where c(x) is a categorization

function whose domain is X and whose range is C.– If c(x) is a binary function C={0,1} ({true,false}, {positive,

negative}) then it is called a concept.

CSC 4510 - M.A. Papalaskari - Villanova University

9

Tiny Example of Category Learning

• Instance attributes: <size, color, shape>– size {small, medium, large}– color {red, blue, green}– shape {square, circle, triangle}

• C = {positive, negative}

• D: Example Size Color Shape Category

1 small red circle positive

2 large red circle positive

3 small red triangle negative

4 large blue circle negativeCSC 4510 - M.A. Papalaskari - Villanova University

10

Hypothesis Selection• Many hypotheses are usually consistent with the

training data.– red & circle– (small & circle) or (large & red) – (small & red & circle) or (large & red & circle)– not [ ( red & triangle) or (blue & circle) ]– not [ ( small & red & triangle) or (large & blue & circle) ]

• Bias– Any criteria other than consistency with the training

data that is used to select a hypothesis.


11

Generalization

• Hypotheses must generalize to correctly classify instances not in the training data.

• Simply memorizing training examples is a consistent hypothesis that does not generalize.

• Occam’s razor:– Finding a simple hypothesis helps ensure

generalization.


12

Hypothesis Space• Restrict learned functions a priori to a given hypothesis

space, H, of functions h(x) that can be considered as definitions of c(x).

• For learning concepts on instances described by n discrete-valued features, consider the space of conjunctive hypotheses represented by a vector of n constraints

<c1, c2, … cn> where each ci is either:– ?, a wild card indicating no constraint on the ith feature– A specific value from the domain of the ith feature– Ø indicating no value is acceptable

• Sample conjunctive hypotheses are– <big, red, ?>– <?, ?, ?> (most general hypothesis)– < Ø, Ø, Ø> (most specific hypothesis)


Decision Tree Creation

Example: Do We Want to Wait in a Restaurant?


Decision Tree Creation• One Possible Decision Tree:


Creating Efficient Decision Trees


Decision Tree Induction

Many Trees, which to prefer?

Occam’s Razor: The most likely explanation for a set of observations is the simplest explanation.

Assumption: “Smallest Tree” == “Simplest”


Decision Tree Induction Issues

UNFORTUNATELY: Finding smallest Tree is Intractable!

(what does this mean?)


Heuristics to the Rescue!

• Algorithm:


Informal Argument: Choosing Attributes

Some Attributes just discriminate better than others


Choosing and Ordering Attribute-Tests

Information Theory“How many bits is a question’s answer worth?”Coin Toss: Fair vs. Rigged

Observation: 1 bit is enough to answer a yes/no question about which one has NO idea.If answers Vi have probabilities P(Vi), then we must weight the number of bits for each answer by its probability to get an overall average number of bits required to represent any answer.

I(P(v1),P(v

2),...,P(v

n)) = - Σ [ (P vi) * log2 (P vi)]I( (P v1), (P v2),..., (P vn)) = - Σ [ (P vi) * log2 (P vi)]


Choosing Attributes

Given “p” positive examples of concept “F(x)” and “n” negative examples, what is I(“correctly identify instances of concept X”)?

I( , ) = I ( , ) = p+np+npp

p+np+nnn

--p+np+nnn

log2

log2 p+np+n

nn

p+np+npp

log2

log2

--p+np+n

pp


Choosing/Ordering Decision Tree Attributes

If one knows the answer/value of an attribute, how much unknown information about the overall concept are we still missing?

. . . . . .. . . . . .v1

v1

vk

vk

Attribute A

p1 YES

n1 NO

pi YES

ni NO

pk YES

nk NO

I ( , ) I ( , ) p

i+n

ip

i+n

i

pi

pi

pi+n

ip

i+n

i

ni

ni

. . . . . .. . . . . .v1

v1

vk

vk

Attribute A

p1 YES

n1 NO

pi YES

ni NO

pk YES

nk NO

I ( , ) I ( , ) p

i+n

ip

i+n

i

pi

pi

pi+n

ip

i+n

i

ni

ni

Remainder (A) = Remainder (A) = ΣΣ=1i=1i

kk

I( , ) I( , ) p

i+n

ip

i+n

i

pi

pi

pi+n

ip

i+n

i

ni

ni

pi + n

ip

i + n

i

p+np+n


Heuristically Choosing Attributes

When adding tests to a tree, always add the next attribute that gives us the largest information gain:

What happens when a leaf node is ambiguous (has both + and - examples)

when our decision path gets us to such a node, randomly give a yes/no answer according to the yes/no probabilities at that node


When to Use/Not Use Decision Trees

ExpressivenessPro: any Boolean Function can be representedCon: many BFs don’t have compact trees

Overfitting: finding meaningless regularities in data

Solution 1 (Pruning): don’t use attributes whose G(A) is close to zero; use Chi-Squared tests for significance.Solution 2: (Cross Validation) Prefer trees with higher predictive accuracy on set-aside data.


Our Sample Data

• Lets revisit the sample data from our class in AiSpace:• Download and save the file with student data• From the main tools page in AIspace.org select

“Decision Trees” • Launch the decision trees tool using Java web start (use

the first link on that page)• Load the example and use the “Step” button to build the

tree.• Observe the choice of nodes split by the decision tree

algorithm


Class Exercise• Practice using decision tree learning on some of the sample

datasets available in AISpace


Some of the slides in this presentation are adapted from:• Prof. Frank Klassner’s ML class at Villanova• the University of Manchester ML course http://www.cs.manchester.ac.uk/ugt/COMP24111/• The Stanford online ML course http://www.ml-class.org/

csc 4510 – machine learning

Documents

data csc

large red small red

class data

small red triangle

training data

circlesmall circle

data dataset description

reinforcement learning