introduction to machine learningcsg220: machine learning introduction: slide 5 • given experience...

23
1 Introduction to Machine Learning Ronald J. Williams CSG220, Spring 2007 Introduction: Slide 2 CSG220: Machine Learning What is learning? • Learning => improving with experience • Learning Agent = Performance Element + Learning Element • Performance element decides what actions to take • e.g., identify this image • e.g., choose a move in this game • Learning element modifies performance element so that it makes better decisions

Upload: others

Post on 27-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

1

Introduction toMachine Learning

Ronald J. WilliamsCSG220, Spring 2007

Introduction: Slide 2CSG220: Machine Learning

What is learning?• Learning => improving with experience• Learning Agent = Performance Element

+ Learning Element• Performance element decides what actions

to take• e.g., identify this image• e.g., choose a move in this game

• Learning element modifies performance element so that it makes better decisions

Page 2: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

2

Introduction: Slide 3CSG220: Machine Learning

Learning agent design• Which components of the performance

element are to be learned?• What feedback is available to learn these

components?• What representation is to be used for the

components?• How is performance to be measured (i.e.,

what is meant by better decisions?)

Introduction: Slide 4CSG220: Machine Learning

Wide range of possible goals• Given a set of data, find potentially

predictive patterns in it• data mining• scientific discovery

• As a result of acquiring new data, gain knowledge allowing an agent to exploit its environment• robot navigation• acquisition of new knowledge may be passive or

active (e.g., exploration or queries)

Page 3: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

3

Introduction: Slide 5CSG220: Machine Learning

• Given experience in some problem domain, improve performance in it• game-playing• robotics

• Rote learning qualifies, but more interesting and challenging aspect is to be able to generalize successfully beyond actual experiences

Introduction: Slide 6CSG220: Machine Learning

Learning vs. programming• Learning is essential for unknown

environments, i.e., when designer lacks omniscience

• Learning is essential in changing environments

• Learning is useful as a system construction method• expose the agent to reality rather than trying to

write it down

Page 4: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

4

Introduction: Slide 7CSG220: Machine Learning

Application examples• Robot control• Playing a game• Recognizing handwritten digits• Various bioinformatics applications• Filtering email (e.g., spam detection)• Intelligent user interfaces

Introduction: Slide 8CSG220: Machine Learning

Relevant Disciplines• Artificial intelligence• Probability & statistics• Control theory• Computational complexity• Philosophy• Psychology• Neurobiology

Page 5: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

5

Introduction: Slide 9CSG220: Machine Learning

3 categories of learning problem• Supervised learning• Unsupervised learning• Reinforcement learning

Not an exhaustive list

Not necessarily mutually exclusive

Introduction: Slide 10CSG220: Machine Learning

Supervised Learning• Also called inductive inference• Given training data {(xi,yi)}, where yi = f(xi)

for some unknown function f : X -> Y, find an approximation h to f• called a classification problem if Y is a small

discrete set (e.g., {+, -})• called a regression problem if Y is a continuous

set (e.g., a subset of R)

• More realistic, but harder: each observed yiis a noise-corrupted approximation to f(xi)

Page 6: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

6

Introduction: Slide 11CSG220: Machine Learning

• X called the instance space• Construct/adjust h to agree with f on

training set• h is consistent if it agrees with f on all

training examples• inappropriate if noise assumed present

• If Y = {+, -}, define {x ε X | f(x)=+}, the set of all positive instances, to be a concept

• Thus 2-class classification problems may also be called concept learning problems

Introduction: Slide 12CSG220: Machine Learning

Classification (supervised)

+-+

+ +

+ +

+

+

+ +

++

+

+

-

-

-

-

-

--

-

-

-

-

-

Instance Space X = R2

Target Space Y = {+, -}

-

-

--

--

Labeled Data

Page 7: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

7

Introduction: Slide 13CSG220: Machine Learning

Regression (supervised)

Instance Space X = R

Targ

et S

pace

Y =

RCurve Fitting

Introduction: Slide 14CSG220: Machine Learning

Regression

Page 8: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

8

Introduction: Slide 15CSG220: Machine Learning

Regression

Introduction: Slide 16CSG220: Machine Learning

Regression

Page 9: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

9

Introduction: Slide 17CSG220: Machine Learning

RegressionWhich curve is “best”?

Introduction: Slide 18CSG220: Machine Learning

Unsupervised Learning• Have an instance space X• Possible objectives

• clustering• characterize distribution• principal component analysis

• One possible use: novelty detection“This newly observed instance is different”

• Also includes such things as association rules in data mining“People who buy diapers tend to also buy beer”

Page 10: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

10

Introduction: Slide 19CSG220: Machine Learning

Unsupervised

Instance Space X = R2

Unlabeled Data

Introduction: Slide 20CSG220: Machine Learning

But also ...

Instance Space X = R2

Target Space Y = {+, -}

+

+

+

--

-

Mix of labeled and unlabeled data

Page 11: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

11

Introduction: Slide 21CSG220: Machine Learning

Overfitting• Especially applicable in supervised learning,

but may also appear in other types of learning problems

• Often manifests itself by having the learner perform worse on test data even as it gets better at fitting the training data

• There are practical techniques as well as theoretical approaches for trying to avoid this problem

Introduction: Slide 22CSG220: Machine Learning

Page 12: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

12

Introduction: Slide 23CSG220: Machine Learning

Performance measurement for supervised learning

• How do we know that h ≈ f? (Hume's Problem of Induction)• use theorems of computational/statistical

learning theory, or• try h on a new test set of examples (using same

distribution over instance space as training set)

• Learning curve = % correct on test set as a function of training set size

Introduction: Slide 24CSG220: Machine Learning

Page 13: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

13

Introduction: Slide 25CSG220: Machine Learning

• Learning curve depends on• realizable (can express target function) vs.

non-realizable• non-realizability can be due to missing attributes or

restricted hypothesis class that excludes true hypothesis (called selection bias)

• redundant expressiveness (e.g., many irrelevant attributes)

• size of hypothesis space

Introduction: Slide 26CSG220: Machine Learning

Page 14: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

14

Introduction: Slide 27CSG220: Machine Learning

• Occam's razor: maximize a combination of consistency and simplicity• in this form, just an informal principle• involves a trade-off

• Attempts to formalize this• penalize “more complex” hypotheses• Minimum Description Length• Kolmogorov complexity

• Alternative: Bayesian approach• start with a priori distribution over hypotheses

Introduction: Slide 28CSG220: Machine Learning

Reinforcement Learning• Applies to choosing sequences of actions to

obtain a good long-term outcome• game-playing• controlling dynamical systems

• Key feature is that system not told directly how to behave, only given a performance score“That was good/bad”“That was worth a 9.5”

Page 15: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

15

Introduction: Slide 29CSG220: Machine Learning

Issues for a learning system designer• How to represent performance element inputs and

outputs• symbolic• logical expressions• numerical• attribute vectors

• How to represent the input/output mapping• artificial neural network• decision tree• Bayes network• general computer program

• What kind of prior knowledge to use and how to represent it and/or take advantage of it during learning

Introduction: Slide 30CSG220: Machine Learning

• Contrasting representations of X (and Y and h, if applicable)• symbolic, with logical rules (e.g., X = shapes

with size and color specified)• e.g., instances:

(Shape=circle)^(Size=large)^(Color=red)• e.g., rules:

IF (shape=circle)^(size=large) THEN (interesting=yes)IF (shape=square)^(color=green) THEN(interesting = no)

• numeric• e.g., points in Euclidean space Rn

Page 16: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

16

Introduction: Slide 31CSG220: Machine Learning

Learning as search through a hypothesis space H

• Inductive bias• Selection bias: only hypotheses h ε H are

allowed• Preference bias: H includes all possible

hypotheses, but if more than one fits the data, choose the “best” among these (e.g., Occam’s razor: simpler hypotheses are better)

• Selection bias leads to less of an overfittingproblem, but runs the risk of eliminating the true hypothesis (i.e., true hypothesis is unrealizable in H)

Introduction: Slide 32CSG220: Machine Learning

Selection bias example• H = pure conjunctive concepts in some

attribute/value description language• (Shape=square)^(Size=large)• (Shape=circle)^(Size=small)^(Color=red)• Boolean: A^~B (equivalent to

(A=true)^(B=false)

• Description of all positive instances restricted to have this form

Page 17: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

17

Introduction: Slide 33CSG220: Machine Learning

Another selection bias example• H = space of all linear separators in the

plane

+

-Separating line can be

anywhere, and either side can be the + side

Such a separator can be specified by 3 real numbers

Introduction: Slide 34CSG220: Machine Learning

Hypothesis space sizeHow many distinct Boolean functions of n Boolean attributes?

Page 18: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

18

Introduction: Slide 35CSG220: Machine Learning

Hypothesis space sizeHow many distinct Boolean functions of n Boolean attributes?= number of distinct truth tables with 2n rows = 22n

Introduction: Slide 36CSG220: Machine Learning

Hypothesis space sizeHow many distinct Boolean functions of n Boolean attributes?= number of distinct truth tables with 2n rows = 22n

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 different possible Boolean hypotheses

Page 19: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

19

Introduction: Slide 37CSG220: Machine Learning

Hypothesis space sizeHow many distinct Boolean functions of n Boolean attributes?= number of distinct truth tables with 2n rows = 22n

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 different possible Boolean hypotheses

How many purely conjunctive concepts over n Booleanattributes?

Introduction: Slide 38CSG220: Machine Learning

Hypothesis space sizeHow many distinct Boolean functions of n Boolean attributes?= number of distinct truth tables with 2n rows = 22n

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 different possible Boolean hypotheses

How many purely conjunctive concepts over n Booleanattributes?

Each attribute can be required to be true, required to be false,or ignored, so 3n distinct purely conjunctive hypotheses

Page 20: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

20

Introduction: Slide 39CSG220: Machine Learning

Hypothesis space sizeHow many distinct Boolean functions of n Boolean attributes?= number of distinct truth tables with 2n rows = 22n

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 different possible Boolean hypotheses

How many purely conjunctive concepts over n Booleanattributes?

Each attribute can be required to be true, required to be false,or ignored, so 3n distinct purely conjunctive hypotheses

More expressive hypothesis space• increases chance that target function can be expressed• increases number of hypotheses consistent w/ training set

so may get worse predictions

Introduction: Slide 40CSG220: Machine Learning

Hypothesis space size (cont.)How many distinct linear separators in n-dimensional

Euclidean space?

Page 21: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

21

Introduction: Slide 41CSG220: Machine Learning

Hypothesis space size (cont.)How many distinct linear separators in n-dimensional

Euclidean space?Infinitely many

Introduction: Slide 42CSG220: Machine Learning

Hypothesis space size (cont.)How many distinct linear separators in n-dimensional

Euclidean space?Infinitely manyHow many distinct quadratic separators in n-dimensional

Euclidean space (e.g., with quadratic curves as separators in R2)?

Page 22: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

22

Introduction: Slide 43CSG220: Machine Learning

Hypothesis space size (cont.)How many distinct linear separators in n-dimensional

Euclidean space?Infinitely manyHow many distinct quadratic separators in n-dimensional

Euclidean space (e.g., with quadratic curves as separators in R2)?

Infinitely many

Introduction: Slide 44CSG220: Machine Learning

Hypothesis space size (cont.)How many distinct linear separators in n-dimensional

Euclidean space?Infinitely manyHow many distinct quadratic separators in n-dimensional

Euclidean space (e.g., with quadratic curves as separators in R2)?

Infinitely manyIt’s clear that allowing more complex separators gives rise to

a more expressive hypothesis space, but a simple count of hypotheses doesn’t measure it

Page 23: Introduction to Machine LearningCSG220: Machine Learning Introduction: Slide 5 • Given experience in some problem domain, improve performance in it • game-playing • robotics

23

Introduction: Slide 45CSG220: Machine Learning

Hypothesis space size (cont.)How many distinct linear separators in n-dimensional

Euclidean space?Infinitely manyHow many distinct quadratic separators in n-dimensional

Euclidean space (e.g., with quadratic curves as separators in R2)?

Infinitely manyIt’s clear that allowing more complex separators gives rise to

a more expressive hypothesis space, but a simple count of hypotheses doesn’t measure it

But there is a measure of hypothesis space size that applies even to these infinite hypothesis spaces: Vapnik-Chervonenkis (VC) dimension

Preview of coming attractions