lecture2: supervisedvs. unsupervised learning,bias ... › class › stats202 › content ›...

34
Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff Reading: Chapter 2 STATS 202: Data mining and analysis Sergio Bacallado September 24, 2014 1 / 20

Upload: others

Post on 27-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Lecture 2: Supervised vs. unsupervisedlearning, bias-variance tradeoff

Reading: Chapter 2

STATS 202: Data mining and analysis

Sergio BacalladoSeptember 24, 2014

1 / 20

Page 2: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:

Variables or factors

Samples

orun

its

2 / 20

Page 3: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:

Variables or factors

Quantitative, eg. weight, height, number of children, ...

Samples

orun

its

2 / 20

Page 4: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:

Variables or factors

Qualitative, eg. college major, profession, gender, ...

Samples

orun

its

2 / 20

Page 5: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:

Our goal is to:

I Find meaningful relationships between the variables or units.Correlation analysis.

I Find low-dimensional representations of the data which makeit easy to visualize the variables and units. PCA, ICA, isomap,locally linear embeddings, etc.

I Find meaningful groupings of the data. Clustering.

Unsupervised learning is also known in Statistics as exploratorydata analysis.

3 / 20

Page 6: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Supervised vs. unsupervised learning

In supervised learning, there are input variables, and outputvariables:

Variables or factors

Samples

orun

its

Input variables Output variable

4 / 20

Page 7: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Supervised vs. unsupervised learning

In supervised learning, there are input variables, and outputvariables:

Variables or factors

Samples

orun

its

Input variables Output variable

If quantitative, we saythis is a regressionproblem

4 / 20

Page 8: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Supervised vs. unsupervised learning

In supervised learning, there are input variables, and outputvariables:

Variables or factors

Samples

orun

its

Input variables Output variable

If qualitative, we saythis is a classificationproblem

4 / 20

Page 9: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Supervised vs. unsupervised learning

In supervised learning, there are input variables, and outputvariables:

If X is the vector of inputs for a particular sample. The outputvariable is modeled by:

Y = f(X) + ε︸︷︷︸Random error

Our goal is to learn the function f , using a set of training samples.

5 / 20

Page 10: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Supervised vs. unsupervised learning

Y = f(X) + ε︸︷︷︸Random error

Motivations:

I Prediction: Useful when the input variable is readilyavailable, but the output variable is not.

Example: Predict stock prices next month using data from lastyear.

I Inference: A model for f can help us understand thestructure of the data — which variables influence the output,and which don’t? What is the relationship between eachvariable and the output, e.g. linear, non-linear?

Example: What is the influence of genetic variations on theincidence of heart disease.

6 / 20

Page 11: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Parametric and nonparametric methods:

There are two kinds of supervised learning method:

I Parametric methods: We assume that f takes a specificform. For example, a linear form:

f(X) = X1β1 + · · ·+Xpβp

with parameters β1, . . . , βp. Using the training data, we try tofit the parameters.

I Non-parametric methods: We don’t make any assumptionson the form of f , but we restrict how “wiggly” or “rough” thefunction can be.

7 / 20

Page 12: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Parametric vs. nonparametric prediction

Years of Education

Sen

iorit

y

Incom

e

Years of Education

Sen

iorit

y

Incom

e

Figures 2.4 and 2.5

Parametric methods have a limit of fit quality. Non-parametricmethods keep improving as we add more data to fit.

Parametric methods are often simpler to interpret.

8 / 20

Page 13: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Prediction error

Training data: (x1, y1), (x2, y2) . . . (xn, yn)Predicted function: f .

Our goal in supervised learning is to minimize the prediction error.For regression models, this is typically the Mean Squared Error:

MSE(f) = E(y0 − f(x0))2.

Unfortunately, this quantity cannot be computed, because we don’tknow the joint distribution of (X,Y ). We can compute a sampleaverage using the training data; this is known as the training MSE:

MSEtraining(f) =1

n

n∑i=1

(yi − f(xi))2.

9 / 20

Page 14: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Prediction error

The main challenge of statistical learning is that a low trainingMSE does not imply a low MSE.

If we have test data {(x′i, y′i); i = 1, . . . ,m} which were not used tofit the model, a better measure of quality for f is the test MSE:

MSEtest(f) =1

m

m∑i=1

(y′i − f(x′i))2.

10 / 20

Page 15: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Figure 2.9.

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

FlexibilityM

ea

n S

qu

are

d E

rro

r

The circles are simulated data from the black curve. Inthis artificial example, we know what f is.

11 / 20

Page 16: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Figure 2.9.

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Me

an

Sq

ua

red

Err

or

Three estimates f are shown:1. Linear regression.2. Splines (very smooth).3. Splines (quite rough).

11 / 20

Page 17: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Figure 2.9.

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

FlexibilityM

ea

n S

qu

are

d E

rro

r

Red line: Test MSE.Gray line: Training MSE.

11 / 20

Page 18: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Figure 2.10

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Me

an

Sq

ua

red

Err

or

The function f is now almost linear.

12 / 20

Page 19: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Figure 2.11

0 20 40 60 80 100

−1

00

10

20

X

Y

2 5 10 20

05

10

15

20

Flexibility

Me

an

Sq

ua

red

Err

or

When the noise ε has small variance, the third method does well.

13 / 20

Page 20: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

The bias variance decomposition

Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).

Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).

14 / 20

Page 21: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

The bias variance decomposition

Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).

Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).

Irreducible error

14 / 20

Page 22: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

The bias variance decomposition

Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).

Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).

The variance of the estimate of Y : E[f(x0)− E(f(x0))]2

This measures how much the estimate of f at x0changes when we sample new training data.

14 / 20

Page 23: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

The bias variance decomposition

Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).

Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).

The squared bias of the estimate of Y : [E(f(x0))− f(x0)]2

This measures the deviation of the averageprediction f(x0) from the truth f(x0).

14 / 20

Page 24: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

15 / 20

Page 25: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

15 / 20

Page 26: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

15 / 20

Page 27: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Implications of bias variance decomposition

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε).

I The MSE is always positive.I Each element on the right hand side is always positive.I Therefore, typically when we decrease the bias beyond some

point, we increase the variance, and vice-versa.

More flexibility ⇐⇒ Higher variance ⇐⇒ Lower bias.

16 / 20

Page 28: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Squiggly f , high noise Linear f , high noise Squiggly f , low noise

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

2 5 10 20

05

10

15

20

Flexibility

MSEBiasVar

Figure 2.12

17 / 20

Page 29: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.

18 / 20

Page 30: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.

The model:Y = f(X) + ε

becomes insufficient, as X is not necessarily real-valued.

18 / 20

Page 31: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.

The model:(((((((Y = f(X) + ε

becomes insufficient, as X is not necessarily real-valued.

18 / 20

Page 32: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.

We will use slightly different notation:

P (X,Y ) : joint distribution of (X,Y ),

P (Y | X) : conditional distribution of X given Y,yi : prediction for xi.

18 / 20

Page 33: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Loss function for classification

There are many ways to measure the error of a classificationprediction. One of the most common is the 0-1 loss:

E(1(y0 6= y0))

Like the MSE, this quantity can be estimated from training andtest data by taking a sample average:

1

n

n∑i=1

1(yi 6= yi)

19 / 20

Page 34: Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content › ... · Lecture2: Supervisedvs. unsupervised learning,bias-variancetradeoff Reading:

Bayes classifier

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

Figure 2.13

In practice, we never know thejoint probability P . However, wecan assume that it exists.

The Bayes classifier assigns:

yi = argmaxj P (Y = j | X = xi)

It can be shown that this is thebest classifier under the 0-1 loss.

20 / 20