1 multiple regression analysis 重回帰分析1 multiple regression analysis 重回帰分析 1 1.1...

1 Multiple regression analysis

重回帰分析

1

1.1 Objectives

• Estimate one continuous value from a linear combination of multiple (more than one) types of variables.

• Eg. Baseball

: Run (score): Batting average: Number of home runs

• Runs that a team earns is likely to be predicted based on the team’s batting average and number of home runs hit by a team in a year. 2

(1.1)

1.1 Objectives

3

Club Run Batting ave. Home runsTigers 597 .262 145Giants 531 .255 82… … … …Eagles 534 .256 105

This data is available in baseball201x.mat.

1.2 Theory1.2.1 Model

4

(1.2)

: Objective variable (目的変数)/Dependent variable (従属変数). Values to be estimated.: Explanatory variable (説明変数)/Independent variable (独立変数). Values to explain the objective variable.

: Suffix of samples. . Specify the baseball club.

: Number of samples. Twelve clubs: .: (Partial) Regression coefficient (偏回帰係数). Weights of explanatory variables.

5

: Number of explanatory variables: Error (誤差) or residual (残差). Difference between the observed (観測値) and predicted (推定値) values. is a random variable with the mean being 0. . Errors randomly vary around zero.: Mean of .

1.2.1 Model

6

• Determine partial regression coefficients with the least squares sum of errors

• For all n samples, (1.2) holds and can be written by using matrices and vectors as follows:𝑦⋮𝑦⋮𝑦

𝑥 𝑥 … 𝑥 𝑥 … 𝑥 𝑥 1⋮ ⋱ ⋮ ⋱ ⋮ ⋮𝑥 𝑥 … 𝑥 𝑥 … 𝑥 𝑥 ⋮⋮ ⋱ ⋮ ⋱ ⋮ ⋮𝑥 𝑥 … 𝑥 𝑥 … 𝑥 𝑥 1𝑎⋮𝑎⋮𝑎𝑎

𝜖⋮𝜖⋮𝜖 (1.3)

𝒚 𝑿𝒂 𝝐 (1.4)

𝑛 1 𝑛 𝑝 1 𝑝 1 1 𝑛 1

1.2.1 Model

7

• Scalar variable … Italic font• Vector … Italic and bold font• Matrix … Capital letter in Italic and bold font

1.2.1 Model

8

1.2.2 Mathematical principles• Least squares estimation of a is determined when the sum of squared errors is minimized. The sum is given by

(1.5)

• From (1.4), the error vectors are

(1.6)

9

1.2.2 Mathematical principles• Least squares estimation of a is determined when the sum of squared errors is minimized. The sum is given by

(1.5)

• From (1.4), the error vectors are

(1.6)

10

1.2.2 Mathematical principles• Using (1.6), the sum of squared errors is

(1.7)

11

1.2.2 Mathematical principles• Using (1.5) and (1.6), the sum of squared errors is

(1.7)

(1.8)

• For derivation, you may use the following equation about scalars.

12

1.2.2 Mathematical principles• The coefficients a that minimizes is given by solving the following about a.

(1.9)

• Then, the least square estimate of a is given by

(1.10)

13

Mathematical review I:Derivative of a scalar with respect to a vector

: Scalar

Eg.

14

1.2.2 Mathematical principles• The coefficients a that minimizes is given by solving the following about a.

(1.9)

• Then, the least square estimate of a is given by

(1.10)

15

1.3 Example of estimation• By using (1.10), the partial regression coefficients of (1.1) are solved, and the estimation equation is

500 600

Estim

ated

runs

Actual runs

500

600Tigers‐ actual runs: 597‐ estimate: 621‐ error: 24

Tigers

(1.11)

Eq. (1.11)

16

1.3 Example of estimation

• We may say that the runs of Tigers are unexpectedly small considering its batting average and number of home runs.• Estimated runs is 622.• Actual runs is 597.

17


• Runs that baseball clubs earn in a year are estimated by the number of single hits and home runs.𝑅 𝑎 𝐻 𝑎 𝑆 𝑎 𝑇

1.82 𝐻 0.75 𝑆 1.13 𝑇• We expect that the team earns

• 1.82 runs from a home run• 1.13 runs from a two‐base hit• 0.75 run from a single hit

𝑎 𝑇 𝑎2.24 𝑇 577

18


• Homework (optional, fro your own study)• Compute the a values for the following model.

• Next week• How can we improve the estimation and reduce the error?

22

Mathematical review II:Generalized inverse matrix

• An inverse matrix is defined for a square matrix （正方行列）. For rectangular (non‐square) matrices, generalized inverse matrices are defined.

• For a vertically long matrix ( ), the generalized inverse matrix is

• The product of and becomes a unit matrix of (m m)

• However, unlike the case of square matrices,

23

Mathematical review II:Generalized inverse matrix

• For a horizontally long matrix ( ), the generalized inverse matrix is

• In terms of the product,

• Horizontal and vertical matrices follow different fomuli.

24

1.4 Overdetermined problem

• When the number of equations or restrictions that thesystem should satisfy is greater than the number ofunknown variables, we cannot find answers that fullysatisfy the given restrictions. In other words, we cannotavoid some errors. This type of problem is called theoverdetermined problem （過決定問題）, or the system isover‐constrained.

• If the errors are small, they are considered accidentalerrors of measurement or noises. If the errors are large,some variables are missing for prediction.

25

1.5 Goodness of fit （適合度 or 当てはまりの良さ）

• How can be the goodness of fit defined? One idea is the correlation coefficient (r) between the observed and estimated values.

1.5.1 Coefficient of determination: R2 (R‐square, 決定係数)

500 600

Estim

ated

runs

Actual runs

500

600

In our baseball example, the correlation coefficient between the actual runs (𝑦) and estimated runs (𝑦) is 0.89, which is considered to be very high.

26

• Correlation coefficient between two variables, which are the observed ( ) and estimated variable ( ),is defined by


1𝑛 1 ∑ 𝑦 𝑦 𝑓 𝑓̅1𝑛 1 ∑ 𝑦 𝑦 1𝑛 1 ∑ 𝑓 𝑓̅𝑟 Covariance between 𝑦 and 𝑓Standard deviation of 𝑦 Standard deviation of 𝑓

(1.5.1)∑ 𝑦 𝑦 𝑓 𝑓̅∑ 𝑦 𝑦 ∑ 𝑓 𝑓̅ ∈ 1, 1

27

• A correlation coefficient can be an index of goodness of fit. • However, usually, the range of value is preferred to be 0 to 1; we use the square of r. The square of r is called R‐square and its reserved symbol is with 1 indicating the perfect fit.


(1.5.2)

where the residual means the error of estimation.

28

• (1.5.2) implies that is determined by comparing the estimation errors and variation of the sample data.

• It is not fully determined by the degree of errors, but by the ratio of the prediction error to the variation of the data.

• When , we may say that 78% of the variation of the sample data is explained.


29

• Compare the R2 values of some regression equations.• The yearly runs ( ) of all the clubs are predicted by using 3 types of variables: , , and .

is the number of stolen bases recoded by a team through a year.

1.5.2 Examples of R2

Explanatory variables used for prediction R2

.65

.22

.08, .78, , .80

30

• The batting average is the best predictor followed by the home runs.

• Base‐stealing has little impact on the run. The stolen bases explain merely 8% of the runs.

• However, when the three variables are used for prediction, R2 becomes .80, which is the largest value in the table.From this, the number of stolen bases appears to be an important predictor, which is actually a misjudgment.

• Although base stealing has little impact on the run, R2

slightly increases by embracing .

1.5.2 Examples of R2

31

• This problem typically occurs when is small. If is small, R2 approaches 1; however, such regression analysis is scientifically meaningless.

• The more explanatory variables lead to the higher R2 even if some of the explanatory variables have no practical importance.

• This problem is called over fitting （過学習）.

1.5.3 Adjusted R2:

32

• To avoid the overestimate of , we use the adjusted :

1.5.3 Adjusted R2:

(1.5.3)

• Each of the numerator and denominator is divided by its own degree of freedom （自由度）.

• The degree of freedom is the number of variables to freely vary when statistics such as a mean and standard deviation are provided.

33

• Adjustment by the degrees of freedom works as a penalty. In the case of regression model using 3‐explanatory variables, this adjustment is determined as follows. A small denominator (n – p – 1) leads to a big penalty.

1.5.3 Adjusted R2:

(1.5.4)

• Hence, are calculated as follows.

Explain variables used for prediction R2

, .78 .73, , .80 .70

34

• If we look at the adjusted R‐square values, the model involving 2 explanatory variables is better than the one with 3 explanatory variables.

• The 2‐variable model is simpler but explains the sample data variation as well as the 3 variable model. Hence, practically, the 2‐variable model is better.

• When comparing multiple regression models with different degrees of freedom (different sample numbers or different number of explanatory variables), adjusted R‐square should be used.

1.5.3 Adjusted R2:

36

• Some variables can be good predictors, but the others are not. How can we select good explanatory variables?

• There is a systematic method to select explanatory variables to establish a statistically valid regression model.

1.6 Variable selection （変数選択）

Run

Batting Homerun Steal Walk

Which variables are good predictors?

? ? ? ?

37

• Stepwise method• Possible explanatory variables are added onto the regression model one by one. For each step, the effect of the added variable is statistically tested.

• Only if the effect of the variable is approved, the regression model employs that variable.

• This depends on the order of variables to be tested. Hence, there exist some variations in the methods for variable selection. Here, basic one (forward method) is introduced.

• Matlab (and Octave) provides• regress function for multiple regression analysis• stepwisefit function for variable selection


38

• Model 0• We start from model 0, which has no explanatory variables. Model 0 tries to estimate the objective variable y solely by its mean value . Hence the model 0 is expressed by

• Model 1• Model 1 includes one explanatory variable. Choose one variable arbitrarily among the candidates, and establish

• We then statistically test the effect of x1 by using F‐test.


(1.6.1)

(1.6.2)

39

𝐹 1, 𝑛 1 1 𝑆𝑆𝑅 𝑀𝑜𝑑𝑒𝑙 0 𝑆𝑆𝑅 𝑀𝑜𝑑𝑒𝑙 1𝑆𝑆𝑅 𝑀𝑜𝑑𝑒𝑙 0 / 𝑛 1 1• This F value follows the F distribution with 1 and dof values.

• In the case of our baseball example, , hence, Fdistribution appears like the next figure.

• If is greater than 4.96, then the effect of the explanatory variable is considered to be significant with

. In other words, the decrease in SSR is not by chance.


(1.6.3)

40

0

0.5

1.0

0 5

F(1, 10) distribution


F value

Prob

ability den

sity

5% area95% area

41

• If p value is smaller than 0.05, model 1 is accepted.• Otherwise, x1 is abandoned and another variable (x2) is considered as a candidate of model 1.

• Once the model 1 is determined, we investigate model 2 that includes another explanatory variable.

• In general, F value is determined by


𝐹 1, 𝑛 𝑘 1 𝑆𝑆𝑅 𝑀𝑜𝑑𝑒𝑙 𝑘 1 𝑆𝑆𝑅 𝑀𝑜𝑑𝑒𝑙 𝑘𝑆𝑆𝑅 𝑀𝑜𝑑𝑒𝑙 𝑘 / 𝑛 𝑘 1• This process is continued until the development of the model finishes.

(1.6.4)

42

1.7 Example of variable selection• The winning rate is predicted by using following 9 variables:

• ERA: Earned run average per game （平均失点）

• Batting average• Home runs• Stolen bases• Run: Average run per game （平均得点）

• Sacrifice （犠打）

• Pinch hitter batting average （代打率）

• Errors （失策数）

• Quality starts

43

1.7 Example of variable selection• ERA is then predicted by using:

• Errors （失策数）

• Quality starts

• Run is predicted by using:• Batting average• Home runs• Stolen bases• Sacrifice （犠打）

• Pinch hitter batting average （代打率）

44

1.7 Example of variable selection

Run

Batting Homerun

ERA

Qstart

Winning

4.38 10 0.84 0.0380.127.2 10

• Layered causality model based on multiple regression analyses

* Run is the total scores that one game earns in a year.

Defensivefactor

Offensivefactor

1.8 Collinearity （共線性）

• If some of the explanatory variables exhibit large correlation coefficients with each other, the computation of

becomes unstable.• The rank of tends to be deficient (non‐regular) and the inverse matrix cannot be accurately computed

• In such cases, the magnitudes of partial regression coefficients may become extreme values and we may reach wrong conclusions.

• Before applying multiple regression analysis, check the correlation coefficients among all the possible explanatory variables, and use moderately independent variables only.

45


• As an example, let’s consider the weather of Nagoya in May for the last 18 years.

• Objective variable: Mean wind velocity

• Explanatory variables• Mean temperature• Mean atmospheric pressure• Amount of precipitation• Humidity• Duration of sunshine• Amount of cloud

46


47

Pressure Precipit. Humid. Sunshine CloudTemp. ‐.04 ‐.16 ‐.15 .17 ‐.02

Pressure .08 .42 ‐.42 .21Precipit. .39 ‐.47 .53Humid. ‐.80 .67Sunshine ‐.91

Correlation matrix

• The amount of cloud and duration of sunshine are strongly correlated. Only one of these two variables should be used in the regression model.


48

Wind vel.

Temp. Sunshine

‐.12 .0055

Wind vel.

Temp. Cloud

‐.07 ‐.30

𝑅 .759 𝑅 .66

Wind vel.

Temp. Sunshine

‐.12 .0052

𝑅 .757Cloud

‐.016

• If we use cloud and sunshine together in a single model, then we underestimate the effect of cloud.

In case that Sunshine is used

In case that Cloud is used

1.9 Regression analysis of nonlinear model

• Multiple regression analysis is a linear technique; however, it can be applied to some nonlinear models.

• For this purpose, we linearize nonlinear models. Here, we look at representative examples.

49

1.9.1 Model with high‐order components

• When the model includes high‐order components, we can linearize the model by introducing new variables.

• When the model is a quadratic function of :

50

we define a new variable:

• Then, (1.9.1) is expressed as a linear model:

(1.9.1)

(1.9.2)

1.9.2 Model with interaction terms

51

• We define

• Then, (1.9.x) is linearized as

(1.9.3)

(1.9.5)

• Interaction terms (交差項)

(1.9.4)

1.9.3 Power function model

52

(1.9.6)

• Power function (指数関数)

• We define

• Then, (1.9.7) is linearized as

(1.9.8)

(1.9.7)

1.9.4 Probabilistic distribution function

53

(1.9.9)

• Probability distribution function (確率分布関数)Prob

ability of

failu

re of a

machine

(p)

a1 xage + a2 xtemp + a0

1.0

0

The failure rate of a machine depends on the age and temperature of the atmosphere.


54

• Quiz: Linearize (1.9.10) such that it can be addressed by the multiple regression analysis

(1.9.10)


55

(1.9.11)

• Eq. (1.9.10) can be transformed to be

• Use logarithmic transformation

• Define

(1.9.12)

and the linearized form is(1.9.13)


56

(1.9.11)

• Eq. (1.9.10) can be transformed to be

• Use logarithmic transformation

• Define

(1.9.12)

and the linearized form is(1.9.13)

57

• Side effects of medicine. Probability of those who suffer from certain symptoms after medication. Gender is also known as a risk factor.

Probability No YesMale 0.03 0.07Female 0.06 0.12

Medication

Gender

1.9.5 Example

* Multivariate regression analysis of probabilistic distribution functions is better solved by the maximum likelihood method rather than the least square method.

58

• Compute . When p is the probability, is called an odds （オッズ）.

log(odds) No YesMale ‐3.47 ‐2.59Female ‐2.75 ‐1.99

Medication

Gender

• E.g. . . (left top cell)

1.9.5 Example

59

Medication (x1)No (0), Yes (1)

Gender (x2)Male (0), Female (1)

‐3.47 0 0‐2.75 0 1‐2.59 1 0‐1.99 1 1

• Medication (x1) and gender (x2) are considered as possible factors to determine the probability of occurrence of symptoms.

1.9.5 Example

Acquired model:60

• Compare the estimated and observed probabilities.

Probability No Yes

Male 0.03( 0.031)

0.07( 0.068)

Female 0.06( 0.058)

0.12( 0.123)

Medication

Gender

1.9.5 Example

1.10 Study more deeply• Ridge regression analysis （リッジ回帰分析）

• Used when the degree of freedom is small. This allows us to avoid (not perfectly) overfitting.

• Also effective to overcome the collinearity issue.• Lasso regression analysis is also used for the same purpose.• Known as regularization （正則化）

• Cross validation （交差検定）

• Used when ample samples are not available• To avoid overfitting

• Auto‐regression model （自己回帰）

• Used for temporally evolving value 61

First report• Apply multiple regression analysis on your own data. Include

• What kind of data?• Where did you get the data?• What are the objective and explanatory variables?• Collinearity check• Results• Discussion (Are the results reasonable?)

• See sample codes on the course web site.• Please, see an example of report on the course web site.• Due: May 17th

62

First report

• Submit on NUCT. Send .ppt file or pdf file by e‐mail.

[email protected]‐u.ac.jp

63

1 multiple regression analysis 重回帰分析1 multiple regression analysis 重回帰分析 1 1.1...

Documents