statistics practical solutions - correlations and regression

ACE2013: STATISTICS FOR MARKETING AND MANAGEMENT

(SEMESTER 2)

2010

NAME: ATIQAH ISMAIL

BSC MARKETING (N500)

1

PRACTICAL 1: CORRELATION AND REGRESSION

3. Atkins Service Corporation

(a) Multiple linear regression analysis

Minitab output:

Regression Analysis: Production versus Shifts, Bonus, Overtime, Morale

The regression equation isProduction = 518 + 187 Shifts + 42.1 Bonus + 97.7 Overtime - 40.8 Morale

Predictor Coef SE Coef T PConstant 518.1 444.2 1.17 0.296Shifts 186.74 49.76 3.75 0.013Bonus 42.14 14.95 2.82 0.037Overtime 97.71 10.43 9.37 0.000Morale -40.85 46.23 -0.88 0.417

S = 279.892 R-Sq = 95.9% R-Sq(adj) = 92.6%

(b) Full estimated regression equation:

Y = 518.1 + 186.74X1 + 42.14X2 + 97.71X3 - 40.85X4 + , N (0, 279.8922)

(c) Significance of the predictor variables:

Variable X4 (Morale) should be removed, since it has the largest p-value. Its p-value of 0.417 (or 41.7%) is greater than 0.10 (or 10%), hence we have no evidence against the null hypothesis, H0: β4 = 0. Thus, we retain H0 and conclude that β4 is the least significant coefficient in the model. Morale is not an important predictor of production.

Hence we will remove X4 (Morale) from the model, and re-fit the model using only Shifts, Bonus and Overtime.

Minitab output:

Regression Analysis: Production versus Shifts, Bonus, Overtime

The regression equation isProduction = 250 + 201 Shifts + 36.5 Bonus + 101 Overtime

Predictor Coef SE Coef T PConstant 250.1 318.4 0.79 0.462Shifts 201.40 46.05 4.37 0.005Bonus 36.49 13.26 2.75 0.033Overtime 100.986 9.569 10.55 0.000

S = 274.725 R-Sq = 95.2% R-Sq(adj) = 92.8%

Variables X1, X2 and X3 have p-values of less than 0.05, we would reject the hypotheses:

2

H0: β1 = 0H0: β2 = 0 andH0: β3 = 0

Thus, number of shifts worked, bonus rates paid, and the average hours of overtime are important indicators of production.

Final regression equation model:

Y = 250.1 + 201.40 X1 + 36.49 X2 + 100.986 X3 + , N (0, 274.7252)

(d) The R2 statistic

R2 = 95.2%

The R2 has only reduced slightly from 95.9% to 95.2% after removing one predictor variable from the full regression model.

95.2% of the variation in production is explained by the variation in the number of shifts worked, bonus rates paid, and the average hours of overtime.

(e) F-test

Minitab output:

Regression Analysis: Production versus Shifts, Bonus, Overtime

The regression equation isProduction = 250 + 201 Shifts + 36.5 Bonus + 101 Overtime

Predictor Coef SE Coef T PConstant 250.1 318.4 0.79 0.462Shifts 201.40 46.05 4.37 0.005Bonus 36.49 13.26 2.75 0.033Overtime 100.986 9.569 10.55 0.000

S = 274.725 R-Sq = 95.2% R-Sq(adj) = 92.8%

Analysis of Variance

Source DF SS MS F PRegression 3 9036966 3012322 39.91 0.000Residual Error 6 452844 75474Total 9 9489810

Source DF Seq SSShifts 1 290858Bonus 1 340149Overtime 1 8405958

3

An overall test of the model:

H0: β1 = β2 = β3 = 0H1: At least one of β1, β2 and β3 is not zero

F = 39.91

The p-value is 0.000 and hence it is less than 0.01 (or 1%) which is very small. Therefore we have strong evidence to reject the H0. Therefore we accept H1: some, or all, of the predictor variables used in the fit are useful in predicting the production, hence this model is useful for prediction.

(f) If X1 = 6, X2 = 22 and X3 = 15

Y = 250.1 + 201.40 X1 + 36.49 X2 + 100.986 X3

Y = 250.1 + (201.40 × 6) + (36.49 × 22) + (100.986 × 15)

Y = 3776.07

4

PRACTICAL 2: FURTHER TOPICS IN REGRESSION

4. Obesity

(a)

Minitab output:

Logistic Regression Table

Odds 95% CIPredictor Coef SE Coef Z P Ratio Lower UpperConstant -8.19369 4.17538 -1.96 0.050Hours worked 0.239177 0.110792 2.16 0.031 1.27 1.02 1.58

E [ Y ]= e−8.19369+0.239177 X

1+e−8.19369+0.239177 X

(b)

P-value = 0.031

Reject the null hypothesis H 0 : β1=0 because β1 has a small p-value, smaller than 0.05 – we

have moderate evidence againstH 0, and so we reject H 0in favour of the alternativeH 1: it appears that the number of hours worked is an important predictor of obesity.

(c)

X = 40

Pr (Y =1|X ¿= e−8.19369+0.239177 ×40

1+e−8.19369+0.239177 × 40

= 0.79793There is a 0.79793 chance of a worker being obese if he usually works a 40 hour week.

5

5. Pennie Enterprises

(a)

Scatterplot of average weekly productivity against the length of service

Comment:

There is a non-linear relationship between productivity and length of service. It appears to have a negatively-curved relationship.

(b)

Minitab output:

Regression Analysis: Productivity versus Service, ServiceSquared

The regression equation isProductivity = 398 - 7.31 Service + 0.0414 ServiceSquared

Predictor Coef SE Coef T PConstant 398.49 19.59 20.34 0.000Service -7.3072 0.9099 -8.03 0.000ServiceSquared 0.041411 0.007696 5.38 0.000

S = 29.6329 R-Sq = 93.7% R-Sq(adj) = 92.6%

Full estimated regression equation:

Y=398.49−7.3072 X+0.041411 X2+ε, ε N (0 , σ2)

6

(c)

(i)

Both X = Service and X2 = Service2 are important predictor variables in the model, since both β1 and β2 have small p-values, and so we would reject the null hypothesis: H0: B1=0 and H0: B2=0. Therefore, X and X2 are both significant predictors of Y (productivity).

(ii)

R2= 93.7%93.7% of the variation in productivity (Y) is explained by the variation in the length of service (X) and service2 (X2).

(d)

Histogram of residuals

7

Plot of the residuals against the fitted values

Comment:

The quadratic regression model will only be valid, if the assumption that the random term ε is Normally distributed with mean zero and constant variance σ2 is true.

The histogram of residuals does not seem to peak at about zero, and is not normally distributed with irregular peaks around its highest point. Moreover, the graph is not bell-shaped, hence the assumption of Normality does not seem to be plausible. However, the plot of residuals does not seem to show any pattern, it shows just a random scatter of points, and this means that the assumption that the residuals are distributed with constant variance is plausible.

Since the assumptions regarding the residuals are not adequately verified, the quadratic regression model is invalid.

8

(e)

A Scatterplot of average weekly productivity (Y) against length of service (X) with fitted quadratic regression line

9

PRACTICAL 4: FURTHER TIME SERIES AND FORECASTING

1.

Comment:

There is an obvious positive trend, with clear and simple seasonality or cyclical variation. Revenue also seems to peak in the 4th quarter during around October to December every year. There is no obvious outlier.

3. Regression table produced by Minitab:

Regression Analysis: MA versus Time index

The regression equation isMA = 50.7 + 3.48 Time index

35 cases used, 4 cases contain missing values

Predictor Coef SE Coef T PConstant 50.7124 0.0915 553.99 0.000Time index 3.48142 0.00409 852.10 0.000

S = 0.244117 R-Sq = 100.0% R-Sq(adj) = 100.0%

10

Regression equation:

Y* = 50.712 + 3.481T + є, є ~ N (0, 0.2441172)

Significance of the slope:

Yes, the slope term β1 is significant since the p-value is 0.000 – it is less than 0.05 – which means that we reject the null hypothesis, H0: β1 = 0 and hence we accept the alternative hypothesis that H1: β1 ≠ 0, which means that the slope is significant.

4.

5. Seasonal Effects

Minitab output:

Seasonal Indices

Period Index 1 22.7682 2 -26.3280 3 -28.2082 4 31.7679

11

6. Residual Series

Time series plot of the residual series:

7. Partial Autocorrelation

12

8. AR(1)

9. Final Estimates of Parameters

Minitab output:

Final Estimates of Parameters

Type Coef SE Coef T PAR 1 0.5746 0.1344 4.27 0.000Constant 0.01528 0.04050 0.38 0.708Mean 0.03592 0.09520

10. Estimated model:

Yt = 0.03592 + 0.5746(Yt-1 - 0.03592) + єt

11. Forecasted values

Minitab output:

Forecasts from period 39

95% LimitsPeriod Forecast Lower Upper Actual 40 0.017677 -0.478117 0.513472 41 0.025440 -0.546370 0.597250 42 0.029900 -0.564876 0.624677 43 0.032463 -0.569703 0.634630 44 0.033936 -0.570651 0.638522

12. Forecast revenue in April-June 2010 (time-point 40)

Y* = 50.712 + 3.481T + є, є ~ N (0, 0.2441172)

i. Estimate of the trend:Y 40

¿ ¿50.712+3.481 ×40=189.952

ii. Full forecast:0.017677 + 189.952 + (-26.3280) = 163.642,Or about 163, 642 Australian dollars

iii. 95% Confidence Interval for forecasted revenue in April- June 2010Lower: -0.478117 + 189.952 + (-26.3280) = 163.146Upper: 0.513472 + 189.952 + (-26.3280) = 164.137

i.e. about (163,146, 164,137) Australian dollars.

13

13. Forecast revenue in July-September 2010 (time-point 41)

Y* = 50.712 + 3.481T + є, є ~ N (0, 0.2441172)

i. Estimate of the trend:Y 41

¿ =50.712+3.481 × 41=193.433

ii. Full forecast:0.025440 + 193.433 + (-28.2082) = 165.250,Or about 165, 250 Australian dollars

iii. 95% Confidence Interval for forecasted revenue in July-September 2010Lower: -0.546370 + 193.433 + (-28.2082) = 164.678Upper: 0.597250 + 193.433 + (-28.2082) = 165.822i.e. about (164,678, 165,822) Australian dollars.

14.

14

Comment:

The histogram of residuals does not seem to be normally distributed. It peaks roughly at about zero, however it does not tail-off smoothly at both sides. Hence, the assumption of Normality is not plausible. However, the plot of residuals does not seem to show any pattern, it shows just a random scatter of points, and this means that the assumption that the residuals are distributed with constant variance is plausible.

However, since the assumptions regarding the residuals are not adequately verified, the quadratic regression model is invalid.

15

PRACTICAL 5: STATISTICAL PROCESS CONTROL

3.

Comment:

It appears that chance variation occurs between samples 6 and 7, which these samples have gone out of the lower control limit however the successive samples have gone back up within the specification limits. However, samples 21, 22, 23 and 24 appear to be assignable variations as four consecutive samples go out of control, and here, a level shift seems to occur.

16

4.

Comment:

The mean x́ = 5 does not change in the x-charts for both both 3-sigma control limits and 2-sigma control limits. However the lower- and upper-control limits of this chart have changed that the area within the specification limits becomes narrower as opposed to that of 3-sigma control limits. The 2-sigma control limits have led to more samples to appear out of control, with more chances of false alarm occurring.

6. Probability of a false alarm

2-sigma control limits:

Minitab output:

Cumulative Distribution Function

Normal with mean = 0 and standard deviation = 1

x P( X <= x )-2 0.0227501

Probability of a false alarm: 2 × 0.0227501 = 0.0455002

17

2.5-sigma control limits:

Minitab output:

Cumulative Distribution Function

Normal with mean = 0 and standard deviation = 1

x P( X <= x )-2.5 0.0062097

Probability of a false alarm: 2 × 0.0062097 = 0.0124194

8.

9. The x-chart using x́ and S shows a much higher UCL and a much lower LCL than when we assumed μ and σ were known, and it did not show the chance variation which appeared in the x-chart with known μ and σ. The x-chart using x́ and S is more reliable as it uses estimates from the data, as it is unrealistic to assume that we know the μ and σ.

18

statistics practical solutions - correlations and regression

Documents