elements of statistical learning 読み会 第2章

27
The Elements of Statistical Learning Ch.2: Overview of Supervised Learning 4/13/2017 坂坂 坂

Upload: tsuyoshi-sakama

Post on 21-Apr-2017

48 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Elements of Statistical Learning 読み会 第2章

The Elements of Statistical LearningCh.2: Overview of Supervised Learning4/13/2017 坂間 毅

Page 2: Elements of Statistical Learning 読み会 第2章

2

• Supervised Learning• Predict outputs from inputs

• Inputs の別名• Predictors 予測変数• Independent variables 独立変数• Features 特徴

• Outputs の別名• Responses 応答変数• Dependent variables 従属変数

2.1 Introduction

Page 3: Elements of Statistical Learning 読み会 第2章

3

• Outputs1. Quantitative variable• 大気の測定値など、連続値• Quantitative prediction = Regression

2. Qualitative variable• Categorical, discrete variable ともいう• アヤメの種類など、有限集合の値• Qualitative prediction = Classification

• Input の種類1. Quantitative variable2. Qualitative variable3. Ordered categorical variable (eg. small, mid, large)

※ 間隔尺度と比例尺度は量的変数にまとめられている?

2.2 Variable Types and Terminology

Page 4: Elements of Statistical Learning 読み会 第2章

4

• Notation• Input• Vector: • Component of vector: • i-th observation: (小文字)• Matrix: (ボールド)• All the observations on j-th variable: ( ボールド)

• Output• Quantitative output: • Prediction of : • Qualitative output: • Prediction of :

2.2 Variable Types and Terminology (contd.)

Page 5: Elements of Statistical Learning 読み会 第2章

5

• Linear Model• With bias term in coefficient,

• Most popular Fitting method: least squares

(RSS: Residual Sum of Squared errors)

• By differentiating RSS w.r.t. , and set 0

• If is nonsingular (regular 正則行列 ), then inverse exists,

2.3.1 Linear Models and Least Squares

Page 6: Elements of Statistical Learning 読み会 第2章

6

• Linear Model (Classification)

• Two classes are separated by Decision boundary

• Two cases for generating 2-class data1. 平均が異なる相関の無い 2 変数ガウス分布からそれぞれ生成される

⇒ 線形の決定境界が最善(第四章で)

2. それぞれの平均の分布がガウス分布になっている、 10 個の分散の小さいガウス分布から生成される⇒ 非線形の決定境界が最善(本章の例はこちら)

2.3.1 Linear Models and Least Squares (contd.)

Page 7: Elements of Statistical Learning 読み会 第2章

7

• k-Nearest Neighbor

is k (Euclidean) closest points to x in training set

• : Voronoi tessellation

• Notice• Effective number of parameters of k-NN = N/k

• “we will see”

• RSS is useless• のとき訓練データを誤差なく分類するので、もっとも RSS が少ないこと

になる

2.3.2 Nearest-Neighbor Methods

Page 8: Elements of Statistical Learning 読み会 第2章

8

• Today’s popular techniques are variants of Linear model or k-Nearest Neighbor (or both)

2.3.3 From Least Squares to Nearest Neighbors

Variance BiasLinear Model low highk-Nearest Neighbors high low

Page 9: Elements of Statistical Learning 読み会 第2章

9

• Theoretical Framework• Joint distribution

• Squared error loss function

• Expected (squared) prediction error

by

2.4 Statistical Decision Theory

Page 10: Elements of Statistical Learning 読み会 第2章

10

•Minimum is the regression function• The best prediction of at any point is the conditional mean,

when best is measured by average squared error.

⇒ ⇒ ⇒ ⇒ ⇒

2.4 Statistical Decision Theory (contd.)

Page 11: Elements of Statistical Learning 読み会 第2章

11

• How to estimate the conditional mean• k-Nearest Neighbor

• Two approximation:

• Under mild regularity condition on ,• If , then • However, the curse of dimensionality becomes severe

2.4 Statistical Decision Theory (contd.)

Page 12: Elements of Statistical Learning 読み会 第2章

12

• How to estimate the conditional mean• Linear Regression• (or ?)• Then,

⇒⇒

• This is not conditioned on X.

• Based on loss function,

2.4 Statistical Decision Theory (contd.)

Page 13: Elements of Statistical Learning 読み会 第2章

13

• In classification• Zero-one loss function is represented by matrix :• where

• The Expected prediction error:

2.4 Statistical Decision Theory (contd.)

Page 14: Elements of Statistical Learning 読み会 第2章

14

• In classification

• Minimum (at a point ) is the Bayes classifier.

if

• This classifies to the most probable class, using the conditional distribution .

• Many approaches to modeling are discussed in Ch.4.

2.4 Statistical Decision Theory (contd.)

Page 15: Elements of Statistical Learning 読み会 第2章

15

• The curse of dimensionality1. If we want to include 10% of data in the neighbor, the

expected required rate of data in 10 dimensions is

2. Suppose a nearest-neighbor estimate at the origin, in data uniformly distributed in -dimensional unit ball

• The median distance to the closest data point

• If , then • more than half data points are closer to the boundary

2.5 Local Methods in High Dimensions

Page 16: Elements of Statistical Learning 読み会 第2章

16

• The curse of dimensionality3. The sampling density is proportional to

• Sparseness in high dimension

4. Examples uniformly from • Assume • Using 1-Nearest Neighbor estimation at • if

• If the dimension increase, the nearest neighbor get further from the target point

2.5 Local Methods in High Dimensions (contd.)

Page 17: Elements of Statistical Learning 読み会 第2章

17

• The curse of dimensionality5. In linear model ,

• For arbitrary test set ,

• If is large, were selected at random, ,

• If is large or is small, EPE does not significantly increases linearly as increases.

⇒ We can avoid the curse of dimensionality in this restriction.

2.5 Local Methods in High Dimensions (contd.)

Page 18: Elements of Statistical Learning 読み会 第2章

18

• Additive model

• Deterministic: • Anything non-deterministic goes to the random error

• is independent of

• Additive model cannot be used in the classification• Target function , the conditional density

2.6.1 A Statistical Model for the Joint Distribution

Page 19: Elements of Statistical Learning 読み会 第2章

19

• Learn by example through teacher

• Training set are pair of inputs and outputs• for

• Learning by example1. Produce 2. Compute differences 3. Modify

※ ここまでも上記の考えは使ってきたと思うが、ここになってなぜ言い出したのか?

2.6.2 Supervised Learning

Page 20: Elements of Statistical Learning 読み会 第2章

20

• Data point is viewed as a point in a -dimention Euclidean space

• Approximate Parameter • Linear model• Linear basis expansions:

• Criterion for approximation1. The Residual sum-of-squares

• For linear model, we get a simple closed form solution

2.6.3 Function Approximation

Page 21: Elements of Statistical Learning 読み会 第2章

21

• Criterion for approximation2. Maximum likelihood estimation

• The Principle of Maximum Likelihood:• Most reasonable are for which the probability of the

observed sample is largest

• In classification, use cross-entropy with

2.6.3 Function Approximation (contd.)

Page 22: Elements of Statistical Learning 読み会 第2章

22

• Infinitely many function fits the training data

• The training sets are finite, so infinitely many fits them

• Constraint comes from consideration outside of the data

• The strength of the constraint (complexity) can be viewed as the neighborhood size

• Constraint comes from the metric of the neighbors• Especially, to overcome the curse of dimensionality, we need

non-isotropic neighborhoods

2.7.1 Difficulty of the Problem

Page 23: Elements of Statistical Learning 読み会 第2章

23

• Variety of nonparametric regression techniques

• Add roughness penalty (regularization) term to RSS

• Penalty functional can be used to impose special structure• Additive models with smooth coordinate (feature) functions

• Projection pursuit regression

• For more on penalty, see Ch.5• For Bayesian approach, see Ch.8

2.8.1 Roughness Penalty and Bayesian methods

Page 24: Elements of Statistical Learning 読み会 第2章

24

• Kernel methods specify the nature of local neighborhood• The local neighborhood is specified by a kernel function

• Gaussian kernel is based on:

• In general, a local regression estimate is , where

• For more on this, see Ch.6

2.8.2 Kernel Methods and Local Regression

Page 25: Elements of Statistical Learning 読み会 第2章

25

• This class includes a wide variety of methods

1. The model for is a linear expansion of basis functions

• For more, see Sec.5.2, Ch.9

2. Radial basis functions are symmetric -dimensional kernels

• For more, see Sec.6.7

3. Feed-forward neural network (single layer)• where is the sigmoid function

• For more, see Ch.11

• Dictionary methods mean to choose basis function adaptively

2.8.3 Basis Functions and Dictionary methods

Page 26: Elements of Statistical Learning 読み会 第2章

26

•Many models have a smoothing or complexity parameter

• We cannot determine it with residual sum-of-squares on training data• Residuals will be zero and model will overfit

• The expected prediction error at (test, generalization error)

• : irreducible error, beyond our control• : (Squared) Bias term of mean squared error• increases with

• : Variance term of mean squared error• decreases with

2.9 Model Selection and the Bias-Variance Tradeoff

Page 27: Elements of Statistical Learning 読み会 第2章

27

•Model Complexity• If model complexity increases,• (Squared) Bias Term decreases• Variance Term increases

• There is a trade-off between Bias and Variance

• The training error is not a good estimate of test error• For more, see Ch.7.

2.9 Model Selection and the Bias-Variance Tradeoff (contd.)