lecture3 – overview of supervised learning rice elec 697 farinaz koushanfar fall 2006

Lecture3 – Overview of Supervised Learning

Rice ELEC 697

Farinaz Koushanfar

Fall 2006

Summary

• Variable types and terminology• Two simple approaches to prediction

– Linear model and least squares

– Nearest neighbor methods

• Statistical decision theory• Curse of dimensionality• Structured regression model• Classes of restricted estimators• Reading (ch2, ELS)

Variable Types and Terminology

• Input/output variables– Quantitative – Qualitative (categorical, discrete, factors)– Ordered categorical

• Regression: quantitative output• Classification: qualitative output (numeric code)• Terminology:

– X: input, Y: regression output, G: classification output– xi: The i-th value of X (either scalar or vector)– Ŷ: prediction of Y, Ĝ: prediction of G

Two Approaches to Prediction

(1) Linear model (OLS)

• Given a vector X=(X1,..,Xp), predict Y via:

• , with intercept:• Least square method:• Differentiate w.r.t :• For XTX nonsingular:

(2) Nearest Neighbor (NN)• The k-NN for Ŷ is:

p

1jjj0

ˆXˆY ˆXY T

N

1i

2Tii )xy()(RSS

0)Xy(XT

yX)XX(ˆ T1T

)x(Nxi

ki

yk1)x(Y

Example

Example - Linear Model

• Example: output G is either GREEN or Red

• The two classes are separated by a linear decision boundary

• Possible data scenarios:

1- Gaussian, uncorrelated, same variance, diff mean

2- Each class is a mixture of 10 different Gaussians

Example – 15-Nearest Neighbor

• The k-NN for Ŷ is:

• Nk(x) is the neighborhood of x that has k points

• The classification rule is the majority voting among the neighbors of Nk(x)

)x(Nxi

ki

yk1)x(Y

Example – 1-Nearest Neighbor

• 1-NN classification, no points misclassified• OLS had 3 parameters, does the NN have 1 (i.e. k)?• Indeed, k-NN uses N/k effective num of parameters

Example - Data Scenario

• Data scenario in the previous example:– Density for each class: mixture of 10 Gaussians– Green points: 10 means from N((1,0)T,I)– Red points: 10 means from N((0,1)T,I)– Variance was 0.2 for both sets

• See the book website for actual data

• The Bayes Error is the best possible performance

From OLS to NN…

• Many modern modeling procedures are variants of OLS or k-NN– Kernel smoothers

– Local linear regression

– Local basis expansion

– Projection pursuit and neural networks

Statistical Decision Theory

• Case 1 – quantitative output Y• Xp: a real-valued random input vector• L(Y,f(X)): loss function for penalizing the prediction error• Most common form of L is the least square loss:

L(Y,f(X)) = (Y-f(X))2

• Criterion for choosing f: EPE(f) = E(Y-f(X))2

• The solution is: f(X) = E(Y|X=x)

• This is also known as the regression function

Statistical Decision Theory (Cont’d)

• Case 2 – qualitative output Y• Prediction rule is Ĝ(X), where G and Ĝ(X) take values in G, |

G|=K• L(k,l): loss function for classifying Gk as Gl

• Single unit misclassification: 0-1 loss• The expected prediction error is:

EPE = E [L(G, Ĝ(X))]

• The solution is:

K

1kkkg )xX|Pr()g,(Lminarg)X(G GGG

Statistical Decision Theory (Cont’d)

• Case 2 – qualitative output Y (cont’d)• With 0-1 loss function, the solution is:

• Or, simply

• This is Bayes Classifier: pick the class having the maximum probability at point x

)]xX|gPr(1[minarg)X(G g G

)xX|gPr(max)xX|Pr(if,)X(Gg

kk G

GG

Further Discussions

• k-NN uses conditional expectations directly by– Approximating expectations by simple averages– Relaxing conditioning at a point to a region around the point

• As N,k, s.t. k/N0, the k-NN estimate:f*(X)E(Y|X=x) and therefore is consistent!

• OLS assumes a linear structure form for f(X)=XT, and minimizes sample version of EPE directly

• As sample size grows, our estimate for coefficients converges to the optimal linear: opt = E(XTX)-1E(XTY)

• Model is limited by the linearity assumption

Example - Bayes Classifier

• Question: how did we build the classifier for our simulation example?

Curse of Dimensionality

• k-NN becomes difficult in higher dimensions:– It becomes difficult to gather k points close to x0

– NN become spatially large and estimates are biased– Reducing the spatial size of the neighborhood means

reducing k the variance of neighborhood increases

Example 1 – Curse of Dimensionality

• Sampling density proportional to N1/p

• If 100 points sufficient to estimate function in 1, 10010 needed for the same accuracy in 10

Example 1: • 1000 training points xi, generated uniformly on [-1,1]• (no measurement error)• Training set: T, use 1-NN to predict y0 at point x0

• This is mean squared error (MSE) for estimating f(0)• MSE(x0) =

2||X||8e)X(fY

200 ]y)x(f[ET

)y(Bias)y(Var)]x(f)y(E[)]y(Ey[E 02

02

002

00 TTTT


• Bias-variance decomposition: )y(Bias)y(Var)x(MSE 02

00 T


• If linear model is correct, or almost correct, the NN will do much worst than OLS

• Assuming that we know this is the case, simple OLS is not affected by the dimension

Statistical Models

• Y=f(X)+ (X and independent)• The random additive error , where E()=0 • Pr(Y|X) depends on X via conditional mean f(X) = E(Y|X=x)• Approximation to the truth, all unmeasured variables in • N realizations: yi=f(xi)+ i, (i and j independent)• Generally more complicated, e.g. Var(Y|X)=2(X)

• Additive errors not used with qualitative response• E.g. binary trials, E(Y|X=x)=p(x) &Var(Y|X=x)=p(x)[1-p(x)]• For qualitative, directly model: )X(P)}xX|G{Pr( K

1k G

Function Approximation

• The approximation has a set of parameters • E.g f(x)=xT (=), or f(x)=k hk(x) k

• Estimate by min RSS()= i (yi- f(xi))2

• Assumes parametric form for f and loss function• More general principle: Maximum Likelihood (ML)• E.g. A random sample yi, i=1,..,N from a density

Pr(y)• The log prob of the sample is:• E.g. Multinomial qualitative likelihood

N

1ii )y(Prlog)(L

)X(P)X|GPr( ,kk G

N

1ii,g )x(Plog)L(

i

Example – Least Square Function Approximation

Structured Regression Models

• Any Function passing thru (xi,yi) has RSS=0• Need to restrict the class• Usually the restrictions impose local behavior• Any method that attempts to approximate locally

varying function is “cursed”• Any method that overcomes the curse, assumes an

implicit metric that does not allow neighborhood to be simultaneously small in all directions

N

1i

2ii ))x(fy()RSS(f

Classes of Restricted Estimators

Some of the classes of restricted methods that we cover:• Roughness penalty and Bayesian methods

• Kernel methods and local regression

– E.g. for k-NN, Kk(x,x0)=I(||x-x0||||x(k)-x0||)

)f(J)f(RSS);PRSS(f

N

1i

2iii00 ))x(fy)(x,x(K)x,f(RSS

Model Selection and Bias-Variance Trade-offs

• Many of the flexible methods have a smoothing or a complexity parameter– The multiplier of the penalty term– The width of the kernel– Or the number of basis functions

• Cannot use RSS to determine this parameter – always interpolating learn data

• Use prediction error on unseen test cases to guide us• Generally, as the model complexity increases, the variance is

increased and the squared bias is decreased (and vice versa)• Choose model complexity to trade bias with variance error s.t.

to minimize the test error

Example – Bias-Variance Trade-offs

• k-NN on data, Y=f(X)+, E()=0, Var()=2

• For nonrandom samples xi, test (generalization) error will be:))]x(f(Var))x(f(Bias[]xX|))x(fY[(E 0k0k

220

20k T

k)x(f

k

1)x(f

22k

1l)l(0

2

Bias-Variance Trade-offs

• More generally, as the model complexity of our procedure increases, the variance tends to increase and the square bias tends to decrease

• The opposite behavior occurs as model complexity is decreased

• In k-NN the model complexity controlled by k• Choose your model complexity to trade-off variance

with bias

lecture3 – overview of supervised learning rice elec 697 farinaz koushanfar fall 2006

Documents