04/05/06bus304 – chapter 12-13 multivariate analysis1 chapter 12 correlation & regression ...

32
04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 1 Chapter 12 Correlation & Regression Examine the relationship among two or more random variables Visual Display Numerical Analysis Correlation Analysis Regression Analysis

Upload: gyles-dawson

Post on 16-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 1

Chapter 12 Correlation & Regression

Examine the relationship among

two or more random variables

Visual Display

Numerical Analysis

Correlation Analysis

Regression Analysis

Page 2: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 2

Visual Display

How to display the relationship between two variables? E.g. the relationship between a car’s

mileage and a car’s value Scatter Plot!

Exercise: create a scatter plot from the data file

Page 3: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 3

Typical Scatter Plots

Positive Relation

Negative Relation

No Correlation

Non-linear Relation

Page 4: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 4

Numerical Measure for the relation Numerical measures:

to formally capture the relationship

to be able to conduct higher level analysis

Commonly Used Measurements: Covariance

• Could be any real number: positive, negative, or 0

• Captures the co-movement of the two variables

• The sign indicates the direction of the trend line.

Correlation• A standardized measurement derived from the

covariance

• The value will be from -1 to 1,

• Measures the degree of linearity

Page 5: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 5

Correlation Coefficient

Formula:

Use excel to compute the correlation:

use excel function: =correl()

use data analysis tool correlation

2 22 2

n xy x yr

n x x n y y

Page 6: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 6

Correlation estimation and typical Scatter Plots

Page 7: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 7

Values of correlation

If the scatter plot is exactly a line

upwards, correlation is +1

downwards, correlation is -1

Correlation between the exactly

same random variables are +1

If the value of x has no impact on

y, then correlation is 0.

Example: payoff of the first round flip

coin game and payoff of the second

round flip coin game.

Page 8: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 8

Test the population correlation Population correlation coefficient: Sample correlation coefficient: r Determine whether

≥ 0, ≤ 0, or = 0

based on the sample coefficient r.

Theorem The t-value for r is

This t-value follows a student’s t-distribution with a degree of freedom n-2

When r > 0, the t value is positive When r < 0, the t value is negative When r = 0, the t value is 0

21

2

rt

r

n

Page 9: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 9

Hypothesis test

Take the example of problem 12.6 (p478)

Write down the hypotheses pair:

H0 : ≥ 0

HA: < 0

Write down the decision rule:

If t < t, reject the hypothesis H0,

If t ≥ t, do not reject the hypothesis H0.

Make decision:

compute r, then the t value of r

find out t using the t table.

compare t and t to make the decision.

Reject when the t value of sample r is too low

Page 10: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 10

Exercise

Problem 12.7

Page 11: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 11

Practice on correlation model Type 1: start with a conjecture

e.g. there is a negative correlation between the amount of money a person spend on grocery shopping and the amount of money on dinning out.

Justification: because a person tend to do less grocery shopping when he/she eats in the restaurant more.

Collect data and conduct the test to verify the conjecture.

Type 2: start without a clear conjecture Based on the available data, find out for any

pair of things, whether there is a strong correlation

If there is one, => “warning” Observe and study why. You may find out surprising answer:

Data Mining

Page 12: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 12

Comments on correlation analysis It can only identify the comovement.

It cannot indicate the causality

Sometimes, there is a third variable (factor)

to explain the comovement. Correlation

analysis cannot help you find out the

underlying factor

Sometimes, there are multiple factors

affecting the comovement. The interaction

among factors makes the comovement

unpredictable.

We need higher level analysis to get a better

understanding.

Page 13: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 13

Simple Regression Analysis Also called “Bivariate Regression”

It analyzes the relationship between two variables

It is regarded as a higher lever of analysis than correlation analysis

It specifies one dependent variable (the response) and one independent variable (the predictor, the cause).

It assumes a linear relationship between the dependent and independent variable.

The output of the analysis is a linear regression model, which is generally used to predict the dependent variable.

Page 14: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 14

The regression Model

The model assumes a linear relationship

Two variables: x – independent variable (the reason)

y – dependent variable (the result)

For example, • x can represent the number of customers dinning in

a restaurant

• y can represent the amount of tips collected by the waiter

Parameters: 0: the intercept – represents the expected value

of y when x=0.

1: the slope (also called the coefficient of x) –

represents the expected increment of y when x increases by 1

: the error term – the uncontrolled part

yi = 0 + 1 * xi + i

Page 15: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 15

Graphical explanation of the parameters Assume this is a scatter plot of the

population

1

Page 16: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 16

Building the model

The regression model is used to predict the value of y

explain the impact of x on y

Scenarios, x is easily observable, but y is not; or

x is easily controllable, but y is not; or

x will affect y, but y cannot affect x.

The causality should be carefully justified before

building up the model When assigning x and y, make sure which is the

reason and which is the result. – otherwise, the

model is wrong!

Example: Information System research:

• “Ease of use” vs. “The Usefulness”

There may always be a second thought on the

causality.

Page 17: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 17

Example

Build up the regression models

1. At State University, a study was done to

establish whether a relationship existed

between a student’s GPA when graduating

and SAT score when entering the

university.

2. The Skeleton Manufacturing Company

recently did a study of its customers. A

random sample of 50 customer accounts

was pulled from the computer records.

Two variables were observed:

a) The total dollar volume of business this year

b) Miles away the customer is from corporate

headquarters

Page 18: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 18

Estimate the coefficient

Regression Model

Given 0=2 and 1=3,

If knowing x=4, we can expect y.

How to know 0=2 and 1=3?

To know 0 and 1, we need to have the population data for all x and y.

Normally, we only have a sample. The trend line determined by a sample is an

estimation of the population trend line.

The Fitted Model

yi = 0 + 1 * xi + i

0 1y b b x b0 and b1 are estimations of 0 and 1, they are sample statistics

The hat indicates a predicted value

Page 19: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 19

Estimate the coefficients

Based on the sample collected

Run “simple regression analysis” to find the

“best fitted line”.

The intercept of the line: b0

The slope of the line: b1

They are estimates of 0 and 1

We can use b0 and b1 to predict y when we

know x0 1y b b x

The prediction model

Page 20: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 20

How to determine the trend line? The trend line is also called the “best

fitted line” How to define the “best fitted line”?

There could be a lot of criteria.

The most commonly used one:• The “Ordinary Least Squares” Regression

(OLS)

• To find the line with the least aggregate squared residual

• Residual: for each sample data point i, the y value (yi) is not likely to be exactly the predicted value ( ), the residue:

yˆi ie y y

Page 21: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 21

Solution for OLS regression

The objective function:

Find the best b0 and b1, which minimize the sum of squared residuals

Solution:

Use Excel: Add a trend line Run a regression analysis (Data Analysis

too kit)

0 1 0 1

2 2 2 21 2

, ,min mini nb b b b

e e e e

1 2

0 1

x x y yb

x x

b y b x

Page 22: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 22

Exercise

Open “Midwest.xls”

Create a scatter plot

Add a trend line.

Provide your estimation of y when

x = 10

x = 0

x = 4

Residue: ei, for each sample data point.

In regression analysis, we assume that the

residues are normally distributed, with mean 0

The smaller the variance of residue, the stronger

the linear relationship.

Page 23: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 23

Add a trend line

Step 1: Use your scatter plot, right click one data point, choose the option to “add trend line”

Step 2: choose “option tag”, check “Display equation

on chart” “OK”

y= 175.8 + 49.91*x

Page 24: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 24

The “Fitness”

Sometimes, it is just not a good idea to use a line to represent the relationship:

Just see how well the sample data form a line

-- how well the model predicts

X X

YY

Not good !Not good !

kindakinda goodgood betterbetter

Page 25: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 25

The measurement for the fitness The Sum of Squared Errors (SSE)

The smaller the SSE, the better the fit.

In the extreme case, if every point lies on the line,

there is no residual at all, SSE=0

(Every prediction is accurate)

SSE also increase when the sample size gets

larger (more terms to sum up)

-- however, this doesn’t indicate a worse fitness.

Other associated terms: SST – total sum of squares:

• Total variation of y

SSR – sum of squares Regression

• Total variation of y explained by the model

It can be computed that SST, SSR, and SSE has

the following relationship:

22 ˆiSSE e y y

2y y

2y y

SST SSE SSR

Page 26: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 26

R2

A standardized measure of fitness:

Interpretation:

The proportion of the total variation in the

dependent variable (y) that is explained by the

regression model

In other words, the proportion that is not

explained by the residuals.

The larger the R2, the better the fitness

In the Simple Linear Regression Model, R2=r2.

Compute the correlation and verify.

2 / 1SSE

R SSR SSTSST

Page 27: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 27

Read the regression report Step 1: check the fitness

whether the model is correct

Step 2: what are the coefficients, whether the

slope of x is too small?

Interval Estimation of 0 and 1: (conf level: 95%)

0: 53.3~298.2529

1: 26.5~73.31

Regression Statistics

Multiple R 0.832534056

R Square 0.693112955

Adjusted R Square 0.662424251

Standard Error 92.10553441

Observations 12

  CoefficientsStandard

Error t Stat P-valueLower 95%

Upper 95%

Intercept 175.8288191 54.98988674 3.197476 0.009532 53.30372 298.3539

Years with Midwest 49.91007584 10.50208428 4.752397 0.000777 26.50997 73.31018

y= 175.8 + 49.91*xp-value of 0 =0

p-value of 1 =0

Better greater than 0.3,The greater the better.

Page 28: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 28

Confidence Interval Estimation

Input the required confidence level

  CoefficientsStandard

Error t Stat P-value

Lower 95%

Upper 95%

Lower 90.0%

Upper 90.0%

Intercept 32.642092.6092

412.51

0191.56E

-0626.625

17 38.659 27.7900 37.494

X Variable 1 -0.640490.1265

44

-5.061

420.000

975 -0.9323

-0.3486

8 -0.8758 -0.4051

Page 29: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 29

Hypothesis Test

People are normally interested in whether 1 is 0 or not. In other words, whether x has an impact on y. Based on the report from excel, it is very

convenient to conduct such a test. Simply compare whether the p value of the

coefficient is smaller than or not.

Hypothesis:

H0: 1 =0

HA: 1 0

Decision rules: If p < , reject the null hypothesis,

If p , do not reject the null hypothesis.

Compare p and , make the decision.

Page 30: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 30

When you don’t have a good fit If the fitness is not good, that is, the correlation

between x and y is not strong enough.

It is always a good idea to check the scatter plot

first. Cases

• Case A. Maybe there are outliers (explain the

outlier)

15

16

17

18

19

20

21

22

23

24

19 20 21 22 23

15

16

17

18

19

20

21

22

23

24

19 21 23 25 27 29 31

Page 31: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 31

Not a good fit?

Case 2: Check the variation of x. In order to have a good prediction model,

the independent variable should cover a certain range.

Collect more data while guarantee the variations of x.

Case 3: Inherently non-linear relationship

Non-linear regression (not required) Segment regression

• Separate your data into groups and run regression separately.

X X

YY

Page 32: 04/05/06BUS304 – Chapter 12-13 Multivariate Analysis1 Chapter 12 Correlation & Regression  Examine the relationship among two or more random variables

04/05/06 BUS304 – Chapter 12-13 Multivariate Analysis 32

Exercise

Problem 12.14 (Page 498) Problem 12.15 Problem 12.19