s2e - stat2var - texcorr - doc - rev 2020€¦ · tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2...

____________________________________________________________________________

IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020

SALES AND MARKETING Department

MATHEMATICS

2nd Semester

________ Bivariate statistics ________

SOLUTIONS of tutorials and exercises

Online document: http://jff-dut-tc.weebly.com section DUT Maths S2

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 1 / 16

Exercise 1. (Tutorial for lesson page 5)

Are people’s behaviour in relation to tobacco and people’s gender related, with a 10% significant level?

Here are the results of a survey made on a sample of 51 men and 66 women:

G : variable "gender" B : variable "behaviour in relation to tobacco"

Gm : men Bn : never smoked

Gw : women Bs : smoke

Bss : stopped smoking

observed

frequencies:

theoretical frequencies

according to H0: Detailed Chi-squares and total:

Gm Gw Gm Gw Gm Gw

Bn 12 23 35 Bn 15.26 19.74 35 Bn 0.69507 0.53710

Bs 31 26 57 Bs 24.85 32.15 57 Bs 1.52417 1.17777

Bss 8 17 25 Bss 10.90 14.10 25 Bss 0.77038 0.59529

51 66 117 51 66 117 5.300

1) Place the subtotals and the general total in the first table, and in the second one, identically.

2) Fill the second table (6 central theoretical values) following proportional calculations.

3) Table #3: calculate the six Chi-square, then add them to get the value χ²calc.

4) Test writing:

Null hypothesis: H0 : Gender and tobacco behaviour are independent

Observed χ²

Value of the variable χ² between the observed and the theoretical samples: χ²calc = 5.3

Rejection area

Significance level: α = 10 %

Number of dof: (r-1)(k-1) = (3 – 1)(2 - 1) = 2

Value of the variable χ² limit until rejection : χ²lim = 4.61

Comparison and decision:

As χ²calc > χ²lim , H0 can be rejected, at a 10% significance level.

In other words, we can say with less than 10% risk of being wrong, that men and women behave

differently with tobacco. However, we could not reject our null hypothesis at a 5% significance level:

χ²lim is 5.99 in such conditions, and so isn’t reached by χ²calc , thus showing us that claiming dependence

is done with more than 5% risk of being wrong.

Exercise 2.

Two candidates A and B compete for a presidential election. In a little town, there are 500 voters. 100 are

retired people, 50 are unemployed and 350 are employees. There, the vote results are:

candidates A B

blank/

abstention voters

unemployed 24 16 10

employees 122 148 80

retired 36 27 37

1) Decide, with a 1% significance level, whether people’s opinion depends on their social group or not.

* H0: "The type of vote is independent of the social group"

* Let’s perform the necessary calculations in order to get χ²calc:


observations in theory (indep.) Chi-square 24 16 10 50 18.2 19.1 12.7 50 1.848 0.503 0.574

122 148 80 350 127.4 133.7 88.9 350 0.229 1.529 0.891 36 27 37 100 36.4 38.2 25.4 100 0.004 3.284 5.298

182 191 127 500 182 191 127 500 Chi²calc = 14.16

* Rejection area: with α = 1 %, and with 4 degrees of freedom : Chi²lim = 13.28

* Decision: as Chi²calc > Chi²lim, we can reject H0 (so: claim that People’s opinion depends on their social group)

with a 1% chance of being wrong.

2. What can we say if we do not include blank votes and abstentions?

Let’s take back the analysis, excluding blank votes and abstentions:

* observations in theory (indep.) Chi-square 24 16 40 19.52 20.48 40 1.03 0.981

122 148 270 131.7 138.3 270 0.72 0.687 36 27 63 30.74 32.26 63 0.9 0.858

182 191 373 182 191 373 Chi²calc = 5.175

* with 2 dof : Chi²lim = 5.991 with α = 5 % and Chi²lim = 4.605 with α = 10 %.

We can assess that people’s opinion depends on their social group, with 10 % chances of being wrong, but we

couldn’t assess it if we wanted to take only 5 % chances of being wrong.

Exercise 3.

The table shows attendance in two stores A and B: how many people

made at least one purchase. These clients are sorted by age group (10 to

15 years old, and so on).

1. Say, with a 5% significance level, whether the chosen store depends on

the age of a client.

store

age A B

10 - 15 46 24

15 - 20 29 35

20 - 40 14 17

> 40 12 18

* store store store obs A B th A B χ² A B 10 to 15 46 24 70 10 to 15 36.26 33.74 70 10 to 15 2.6185 2.8135 5.4320

15 to 20 29 35 64 15 to 20 33.15 30.85 64 15 to 20 0.5192 0.5579 1.0771

20 to 40 14 17 31 20 to 40 16.06 14.94 31 20 to 40 0.2634 0.2830 0.5464

40 + 12 18 30 40 + 15.54 14.46 30 40 + 0.8058 0.8658 1.6716

101 94 195 101 94 195 4.2069 4.5202 8.727

* with 3 dof and a 5% level, the table gives χ²lim = 7.815.

* Thus, this limit value has been exceeded. With a 5 % significance level, we can reject the hypothesis that

the choice of the store and the age group are independent.

2) What age group mostly contributes to the previous result? Explain.

The age group « 10 to 15 year old » mostly contributes to the total χ². It could be easily stated that people

that are over 15 year old show quite the same purchasing behaviour. On the contrary, the first age group

shows a very different frequency distribution (first table, in blue), compared to other customers.

3) Give the meaning of the “5% significance level” on your first answer.

We assume the dependence between age and chosen store with a 5 % chance to be wrong.

4) According to your Chi² table, can you be more accurate about the chance taken in this statement (your first

answer)?

If we wanted to reach a 2% level, χ²calc would have been more than 9.837, but our value isn’t. So, the χ²

table (form) doesn’t allow us to say more than “the risk is between 2% and 5%”.


Exercise 4.

In a survey, 100 people were asked about their age and their attendance at theatres (cinema). We name X the

variable "age" and Y the variable "number of annual cinema shows". The survey result is the following table of

quotes (fr.: citations) :

Y X [15 ; 25[ [25 ; 50[ ≥ 50

none 4 6 13

1 to 11 10 16 15

12 to 23 13 8 4

≥ 24 6 3 2

1) By a χ² independence test, with a 2% significance level, decide whether there’s a link or not between the

age and the level of attendance at the cinema.

Y X [15 ; 25[ [25 ; 50[ 50 and more total

obs th χ² obs th χ² obs th χ² obs th χ²

none 4 7.59 1.698 6 7.59 0.333 13 7.82 3.431 23 23 5.462

1 to 11 10 13.53 0.921 16 13.53 0.451 15 13.94 0.081 41 41 1.453

12 to 23 13 8.25 2.735 8 8.25 0.008 4 8.5 2.382 25 25 5.125

≥ 24 6 3.63 1.547 3 3.63 0.109 2 3.74 0.81 11 11 2.466

total 33 33 6.901 33 33 0.901 34 34 6.704 100 100 14.51

With 6 dof and α = 2%, the χ² table gives Chi²lim = 15.03.

Our Chi²calc (14.51) doesn’t exceed it. So, at a 2% significance level, we can’t reject the idea that age and

level of attendance at the cinema are independent.

2) Using your form table, discuss the level of confidence you can assign to the assertion : “they are

dependent”.

Our Chi²calc (14.51) is located between both Chi²lim of levels 2% and 5%. Thus, we can assume dependence

with more than 95% confidence, but with less than 98% confidence.

3) Identify the most important partial Chi-2s and give the meaning of these high values.

The biggest partial Chi² has been obtained with the “50 year old and more” whose attendance is zero: the

observed frequency (13) is much higher than the expected one (7.82).

The partial Chi² of the “50 year old and more” whose attendance is “between 12 and 23 times a year” is big

too: the observed frequency is much lower than the theoretical one

The partial Chi² of the “15 to 25 year old” whose attendance is “between 12 and 23 times a year” is big too:

the observed frequency is much higher than the theoretical one.

Exercise 5.

Using the data series introduced in the exercice 11, decide, by the mean of a Chi-square test, whether both

variables are independent or not.

Y [0 ; 15[ [15 ; 25[ [25 ; 40[ total

X obs th χ² obs th χ² obs th χ² obs th χ²

1 23 60,06 22,87 92 84,63 0,642 80 50,31 17,52 195 195 41,03

2 77 59,75 4,979 84 84,2 5E-04 33 50,05 5,809 194 194 10,79

3 42 27,72 7,356 35 39,06 0,422 13 23,22 4,498 90 90 12,28

4 12 6,468 4,731 6 9,114 1,064 3 5,418 1,079 21 21 6,875

total 154 154 39,93 217 217 2,128 129 129 28,91 500 500 70,97

With 6 dof and α = 1%, the χ² table gives Chi²lim = 16.8.

Our Chi²calc (70.97) is much bigger. There are more 99% chances of dependence between both variables.



Let’s have a close look of a company’s turnover evolution through time.

Year N N+1 N+2 N+3

tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4

(M€) 28 45 49 36 30 44 48 40 28 46 52 37 31 42 54 39

Though there are big seasonal variations, due to its particular activity, is it possible to find out a global

trend on several years?

Let’s decide to calculate and display the moving means, each for a one-year duration:

(do it as a group job: divide the set of calculations with your neighbours and share your results)

1-5 2-6 3-7 … 12-16

X 3 4 5 6 7 8 9 10 11 12 13 14

Y 39.75 39.875 39.625 40 40.25 40.25 41 41.125 41.125 41 40.75 41.25

calculations:

The values of X (on the graph) correspond to the quantity of trimesters since the beginning:

1st trimester year N → x = 1 ; 2nd trimester year N → x = 2 ; and so on. We deduce that the values of X to be

entered in the table are 3, 4, 5, and so on: 1st value = mean of 1/2-2-3-4-5/2 = 3; 2nd value = mean of 2/2-3-

4-5-6/2 = 4; and so on until the 12th value, which is the mean of 12/2-13-14-15-16/2, that equals 14.

The values of Y calculated in the table above are the average turnovers of the five considered trimesters.

1st value of Y = mean of 28/2-45-49-36-30/2 = 39.75; 2nd value of Y = mean of 45/2-49-36-30-44/2 = 39.875;

and so on.


Let’s take back one of the examples introduced page 3 (lessons doc): effect of the amount of fertilizer on the

harvested production.

fertilizer harvest

plot # X (kg.ha-1) Y (q.ha-1)

1 150 46

2 80 37

3 120 46

4 220 51

5 100 43

1) For each half-cloud, determine the mean points coordinates.

Half-clouds have to be defined: since there are 5 pairs of results, let’s choose a cut in 3 points on the left and 2

points on the right (the contrary would have been allowed too), separating them by the X values (always):

× × × × × × × × × × × ×


1st half-cloud: (80, 37), (100, 43), (120, 46); mean point: G1(100, 42)

2nd half-cloud: (150, 46), (220, 51); mean point: G2(185, 48.5)

2) Determine the expression of the Mayer’s line (G1G2).

slope: 48.5 42 6.5

0.07647185 100 85

a−= = ≈

−

y = 0.07647 x + b can be written with the coordinates of G1 (for instance): 100 = 0.07647×42 + b,

which gives us b = 34.35.

Expression of the Mayer’s line: y’ = 0.07647 x + 34.35

3) On a graph, plot the initial table and draw this line.

Exercise 8.

Determine the expression of the Mayer’s line, taking back the case given in exercise 6.

The 16 values are parted in 8 for N and N+1 besides 8 for N+2 and N+3.

...

...

1

1

G

G

1 2 84.5

8

28 45 4040

8

x

y

+ + += =

+ + += =

...

...

2

2

G

G

9 10 1612.5

8

28 46 3941.125

8

x

y

+ + += =

+ + += =

slope: 1.125

0.1406258

a = =

y’ = 0.140625 x + b can be written with the coordinates of G1 (for instance): 40 = 0.140625×4.5 + b,

which gives us b = 39.367.

Expression of the Mayer’s line: y’ = 0.140625 x + 39.367


Calculate or display on your calculator: the means and standard deviations; the covariance.

1) Taking the data of exercise 7 (fertilizer/harvest)

134x = kg.ha-1 and 44.6y = q.ha-1 ; ( ) 48.826Xσ = kg.ha-1 and ( ) 4.5869Yσ = q.ha-1 (Stat mode).

( ), 1 30900Cov 134 44.6 203.6

5

n

i i

i

x y

X Y x yn

== − = − × =∑

2) Taking the data of exercise 4 (age/# of cinema shows) – choose 60 as average age for the class 50 and more;

choose 36 as average number of shows for the class 24 and more.

39.375x = yo and 10.795y = shows ; ( ) 16.422Xσ = years and ( ) 10.833Yσ = shows (Stat mode).

( ), 1 36890Cov 39.375 10.795 56.15

100

n

i i

i

x y

X Y x yn

== − = − × = −∑

____________________________________________________________________________

IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 6 / 16


Let’s consider the following time series: a company’s annual expenses in advertising.

X : year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

Y : expense (k€) 41 60 55 66 87 61 90 95 82 120 125 118

The corresponding scatter plot is represented:

Determine the expression of the Y on X fitting line, following the least square method; then, draw it.

(D) : y’ = 7.0629 x + 37.42

Exercise 11.

500 people, having passed their driving license exam, are sorted in the table below.

They are distributed with respect to the number X of times they took the exam before passing it and to the

number Y of hours of driving lessons before their first attempt.

Y

[0 ; 15[ [15 ; 25[ [25 ; 40[

X

1 23 92 80

2 77 84 33

3 42 35 13

4 12 6 3

1) Define a margin frequency. Then, give an example from the table.

A margin frequency is the total number of individuals associated to a value of one of the variables.

e.g.: 195 (margin frequency) people passed their exam following their first attempt (value: X = 1).

2) Describe, shortly, the way to enter the data set in your calculator.

We use to enter the frequencies in List3, so 12 values here; List1 and List2 will be used for entering the

corresponding X and Y values.

3) Calculate the covariance of the pair (X, Y) and give a concrete comment about this value.

( ),16815

Cov 1.874 19.375 2.679500

X Y = − × = − , non-positive. Globally, the more hours of driving lessons one

takes, the less attempts one needs to pass the exam.

4) Among those who took between 15 and 25 hours of driving lessons, what is the rate of those who passed

their exam on the third attempt? 35/217 = 16.13 %

5) Among those who passed their exam on the third attempt, what is the rate of those who took between 15

and 25 hours of driving lessons? 35/90 = 38.89 %

year 1: 2007

____________________________________________________________________________


Exercise 12.

A sales agent wishes to analyse his (or her) activity and efficiency. On

each appointment to a prospect have been noted the length (X, in

minutes) of the presentation of the product, and the sold quantity

(Y). The twelve values inside the table show the number of

appointments that correspond to each pair (X, Y).

1) Give the meaning of the frequency "8" found inside the table.

During each of 8 appointments with prospects, the sales agent made a 10 to 20 min-long presentation and

then sold 2 units.

2) Calculate, manually, the average time spent per appointment.

Margin frequencies of the three values of X: 7, 19 and 21. The corresponding lengths are 5, 15 and 25 (in

minutes). Total number of appointments: 47.

The average time is then (5×7 + 19×15 + 21×25)/47 = 17.98 minutes per appointment (about 18 minutes).

3) Give the covariance of the pair (X, Y).

( ),1595

Cov 17.9787 1.80851 1.42247

= − × =X Y

Exercise 13.

The following table indicates the sales price (€) of an equipment and the number of sold items, for 4 years.

year rank 1 2 3 4

sales price (€) X 300 210 270 375

# of sold items Y 198 240 222 160

1) Build the scatter plot with an orthogonal frame. The axes intersection must be the point (210, 160);

scales: 1 cm for €15 on the abscissas axis, 1 cm for 10 items on the ordinates axis.

2) Determine the coordinates of G, mean point of the cloud.

G(288.75 ; 205)

3) a. Determine the expression of the Y on X fitting line, following the least square method.

The coefficients will be expressed with 6 significant figures.

y’ = -0.498274 x + 348.876

b. Draw this regression line on the graph.

4) Which year saw the highest turnover? For which amount?

The turnover is X×Y. Its four values are: 59400, 59940, 50400 and 60000. The highest was in year # 4.

____________________________________________________________________________


going further:

5) Now, we assume that, each year, the number of sold items y and the sales price x are related this way:

y = – 0.498 x + 349. We denote S(x) the turnover achieved by selling y items, €x each.

a. Express S(x) with respect to x.

S(X) = xy = -0.498 x² + 349 x

b. Find the variations of the function S defined in [210 ; 375].

S’(X) = -0.996 x + 349 > 0 iff x < 350.4. S is decreasing in [210 ; 350.4] and increasing in [350,4 ; 375].

c. Deduce the sales price we would have to set for a fifth year if we want a maximum turnover. How many

items will be sold (round to one unit)? For what turnover?

We have to set the sales price at €350.4. number of sold items: y = – 0.498×350.4 + 349 = 174.5.

Considering 174 items, x = €350.4/unit and turnover = €60969.6;

considering 175 items, x = €350.4/unit and turnover = €61320.

Exercise 14.

A survey wishes to compare people's expense in high tech equipment compared to their sales. Each column of

the table T below represents, in a given French land, the average monthly income of people (X) and the

average monthly expense (Y) in high-tech equipment.

land A B C D E F

income X (€) 1550 1620 1770 1850 1930 2000

expense Y (€) 57 61 66 73 76 82

1) Calculate the covariance and then the linear correlation coefficient of the pair (X, Y).

Give an interpretation of both parameters.

( ),749720

Cov 1786.66667 69.1666667 1375.555566

X Y = − × ≈ , positive, showing a global upward trend

of the expense, as the income increases.

1375.555560.9901

160.2775 8.66827r ≈ ≈

×, very close to 1, hence an excellent linear correlation between X and Y.

2) a. Give, by the mean of your calculator, the expression of the Y on X regression line.

y’ = 0.05355 x – 26.50

b. Obtain the expression of the Mayer's line of the series, from the table T.

Let's part the table into two groups: {A, B, C} and {D, E, F} (indeed, the values of X have already been

sorted in an ascending order). The coordinates of both mean points are:

G1(1646.6667 ; 61.333333) and G2(1926.66667 ; 77)

The Mayer's fitting line, (G1G2), has a typical expression y’ = ax + b, where

( ); :2 1

1 1

2 1

G G

G G M

G G

0.05595 and 30.80 0.05595 30.80y y

a b y a x D y xx x

−′= ≈ = − × ≈ − = −

−

c. Both lines slightly differ. Find the income for which they both give the same expense. What makes this

common point special, inside the point cloud?

Let's act as if we didn't already know this common point.

We can seek it by an identification of both expressions: 0.05595 x – 30.80 = 0.05355 x – 26.50.

That gives: 0.0024 x = 4.3 and then x = 1791.67. We can deduce the value of y: 69.44.

Both lines give an estimated average expense of € 69.44 €, for an average income of € 1791.67.

This common point is in fact the midpoint of the cloud: 1791.67 is the actual average value of X in the

table, and 69.44 is the average value of Y (little differences can be seen, mostly due to the rounded

slopes used four lines above).

This particularity is general, as explained in the lessons of this chapter: a least square fitting line, as well

as a Mayer's fitting line, meets Mayer's criterion, which is equivalent to "the line owns G"!

____________________________________________________________________________



Data about the fuel consumption of a motorcycle have been

collected. Consumption: Y, in L/100km, speed: X, in km/h):

X 10 20 30 40 50 60 70 80 90

Y 15.2 11.6 9.3 7.8 7 6.6 6.9 8 9.6

The scatter plot, on the right, clearly shows us that a linear

regression would be inappropriate to describe the evolution of the

consumption with respect to the speed. Thus, we will propose a

variable change.

1) Let’s define the variable T by: T = (X – 60)².

Complete the following table:

T 2500 1600 900 400 100 0 100 400 900

Y 15.2 11.6 9.3 7.8 7 6.6 6.9 8 9.6

2) Perform a linear regression of Y on T.

Cov(T, Y) = 81280/9 – 766.66667×9.111111 = 2045.926 ; r = 2045.926/780.3133/2.62782 = 0.997759

r is very close to 1, a linear fitting is appropriate, between T and Y.

Least square regression line: y’ = 0.00336 t + 6.535

3) Thus, deduce the expression of the regression curve, for the initial scatter plot.

Regression curve of the pair (X, Y) : y’ = 0.00336 (x – 60)² + 6.535

Exercise 16. quadratic fitting

A company took note of its profits Y with respect to X, produced and sold quantity:

X (tons) 2 3 5 7 11

Y (k€) 38 55 72 69 24

T -16 -9 -1 -1 -25

1) Thanks to your calculator, give the linear correlation coefficient between X and Y. Comment.

Cov(X, Y) = 1348/5 – 5.6×51.6 = -19.36 ; r = -19.36/3,2/18.315 = -0.3303

This is far from -1, the linear correlation is very bad between X and Y.

2) Let’s settle the variable T = -(X - 6)².

a. Complete the table.

b. Calculate Cov(T, Y) and then the linear correlation coefficient between both variables.

Cov(T, Y) = -1844/5 - (-10.4)×51.6 = 167.84 ; r = 167.84/9.2/18.315 = 0.9961

c. Is a linear fitting of Y on T appropriate?

r is very close to 1, a linear fitting is appropriate, between T and Y.

d. Determine the expression of the Y on T fitting line, following the least square method.

y’ = 1.983 t + 72.22

e. Deduce an expression of the regression of Y on X.

y’ = -1.983(x - 6)² + 72.22

____________________________________________________________________________


Exercise 17. quadratic fitting

A market study was conducted on a new type of product. The table below gives, for several proposed sales

price, the number of people willing to pay that price.

unit price (€) X 2 3 4 5 6 7

number of people Y 66 47 34 25 18 14

unit p. nb X(X-20) nb sales

X Y T Y ’ CA CA’

2 66 -36 62.97 132 125.9

3 47 -51 48.88 141 146.6

4 34 -64 36.66 136 146.7

5 25 -75 26.33 125 131.7

6 18 -84 17.88 108 107.3

7 14 -91 11.3 98 79.13

1) Calculate the covariance of the variables X and Y, then comment its sign.

( ),740

Cov 4.5 34 29.676

X Y = − × = − , non-positive: Y values tend to improve as X decreases.

2) We set T = X(X - 20)

a. Calculate le the linear correlation coefficient between both variables T and Y.

( ) ( ),11610

Cov 66.8333 34 337.336

T Y−= − − × = .

337.330.992487

18.95096 17.93507r = =

×

b. Comment its value.

This coefficient (0.992487) is an excellent one.

c. Determine the expression of the Y on T fitting line, following the least square method.

y’ = 0.9393 t + 96.78

d. Deduce an expanded expression of the regression of Y with respect to X.

y’ = 0.9393 (x² - 20x) + 96.78 = 0.9393 x² - 18.79 x + 96.78

3) Here we examine the expected turnover (unit selling price × number of sales), if the numbers of citations

obtained in the survey are considered to be the numbers of units sold.

a. Calculate the turnovers that can be extracted from the initial table.

See above: grey table (turnover = CA = XY)

b. Calculate, for the same values of X, the turnovers CA' that can be got thanks to the formula obtained in

question 2)d.

See above: grey table (turnover = CA’ = XY’)

c. What unit selling price should we fix, so that the best turnover would be reached?

According to the model, it seems that CA’ would be maximum when X is between €3 and €4.

Le’s be a little more accurate:

X 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4

CA' 146.6 147.4 148.1 148.5 148.8 148.8 148.7 148.4 148 147.4 146.6

We will recommend a selling price at about € 3.5 for an optimized turnover.

Exercise 18. inverse fitting

A perfumery, on analysing its turnover, connects the sales quantities (Y) to various perfume brands and

models prices (X). The results are gathered in the following table:

X, bottle’s price (€) 15 25 30 40 45 60 75 90

Y, # of sold bottles 202 117 107 82 78 60 55 48

Answer the questions beginning with "calculate" by using your calculator’s results.

____________________________________________________________________________


calculator’s results:

1) a. Calculate the covariance of X and Y; comment its sign.

( ),28000

Cov 47.5 93.625 947.28

X Y = − × = − , non-positive. Y is globally a decreasing function of X.

b. Calculate the linear correlation coefficient of X and Y; comment its value.

947.20.8357

24.109 46.843XYr

−= = −×

, not very close to 1. The linear correlation between X and Y is not

excellent (the point cloud may be noisy or following a curve).

2) In order to have a more precise idea of how X and Y are related, we set the variable change: 850

TX

=

a. After having calculated the list of values of T, in a third list (calculator), justify that the linear correlation

is excellent between T and Y.

The values of T have been show above. The calculations, relatively to the pair (T, Y), lead to r = 0.9971,

very close to 1. Their linear correlation is excellent.

b. Give the expression of the Y on T regression line, according to the least square method.

y’ = 3.215 t + 15.62

c. What is the least square criterion?

The sum of the squared residues must be minimum (which makes the fitting line unique).

d. Deduce from question 2)b a modelled expression of Y with respect to X.

850 273315.62

ay at b b

x x′ = + = + = +

e. According to this model, how many bottles whose cost is €150 would the perfumery expect to sell?

If x = 150, the estimate of y is: 2733

15.62 33.84 34150

+ ≈ ≈ : it can expect to sell 34 bottles.


Calculate the point estimates, in the given situations.

1) Taking back exercise 10, give an estimate of the expense in 2020.

y’ = 7.0629 x + 37.42 ; x0 = 14 ; hence y’0 = k€ 136.3

2) Taking back exercise 7, give an estimate of the quantity of fertilizer that would offer a harvest of 60 q/ha.

y’ = 0.07647 x + 34.35 ; y’0 = 60 q/ha ; hence x’0 = 335.4 kg/ha

3) Taking back exercise 15, give an estimate of the fuel consumption when the speed is 100 km/h.

y’ = 0.00336 (x – 60)² + 6.535 ; x0 = 100 ; hence y’0 = 11,91 L/100km


Let’s take back exercise 10. We want to estimate the expense, for the year 2020, by a 95% confidence interval.

____________________________________________________________________________


1) a. Get the values of Y’, from the values of X and the expression of the fitting

line;

b. Get the values of Z, by dividing Y by Y’;

c. Then, give the mean and standard deviation of Z.

;1.000971 0.125286Zz σ= =

2) Give the point estimate of the expense in 2020.

see exercise 18-1: y’0 = k€ 136.3

3) Give the coefficient u corresponding to the confidence level.

u = 1.96

4) Then, give the confidence interval.

[129.2(1.000971 – 1.96×0.125286) ; 129.2(1.000971 + 1.96 × 0.125286)] =

[97.6 ; 161]


With exercise 7, estimate the harvest by a 99% confidence interval, due to 300 kg/ha of fertilizer.

1) a. Get the values of Y’, from the values of X and the expression of the fitting line;

b. Get the values of Z, by dividing Y by Y’;

c. Then, give the mean and standard deviation of Z.

;0.9991106 0.0472554Zz σ= =

2) Give a point estimate of the harvest.

y’ = 0.07647 x + 34.35 ; x0 = 300 kg/ha ; hence y’0 = 57.29 q/ha

3) Give the coefficient u corresponding to the confidence level.

u = 2.58

4) Then, give the confidence interval.

[57.29(0.9991 – 2.58×0.047255) ; 57.29(0.9991 + 2.58×0.047255)] = [50.25 ; 64.22]


On each person in a sample, a survey noted the age class (X) and the visual acuity (Y, 1/10 = 0.1):

X

[5 ; 35[ [35 ; 45[ [45 ; 55[ [55 ; 65[

Y

0.3 1 5 10 20

0.6 8 12 25 18

0.9 55 30 14 6

Estimate the visual acuity of a 80 year-old person, by a 99% confidence interval.

Y’ = -0.008430X + 1.0424.

Results on Z: ;0.99886 0.29616Zz σ= =

Point estimate:

y’ = -0.008430 x + 1.0424 ; x0 = 80 ; hence y’0 = 0.3680

Coefficient u: u = 2.58

Confidence interval:

[0.3680(0.99886 – 2.58×0.29616) ;

0.3680(0.99886 + 2.58×0.29616)]

= [0.0864 ; 0.6488]

____________________________________________________________________________


Exercise 23.

In a country, two variables are compared: the consumer force index and the turnover of its car industry:

consumer force (index) X 3.26 3.85 3.44 3.08 3.6

car industry turnover (G€) Y 9.3 9.56 9.36 9.24 9.47

1) Give the expression of the Y on X Mayer’s line.

Two ways to cut this data set (3 points then 2, or 2 points then 3) as X increases.

case 1: G1(3.26 ; 9.3) and G2(3.725 ; 9.515) y = 0.4624 x + 7.793

case 2: G1(3.17 ; 9.27) and G2(3.63 ; 9.463) y = 0.4283 x + 7.912

2) By the mean of a point estimate, give a value of the consumer force that would correspond to a G€ 10 car

industry turnover.

case #1: y = 10 iff x = 4.733

case #2: y = 10 iff x = 4.875

3) Is a strong correlation between two variables a sign of a cause and effect relationship between them?

Not necessarily. This numerical relationship may just be a coincidence.

Exercise 24. least square + confidence interval

Monthly revenues of a commercial website are listed below, from January to December 2018:

in k€ : 3 5 4 8 10 9 13 12 17 18 18 21

1) In a few words, describe the least square method.

This method consists in finding out the line that minimizes the sum of the squared residues (rises between

the points and the line).

2) Thanks to the global trend of the evolution of the monthly revenue, give the 95% confidence interval of the

predictable revenue in December 2019. (number the months from 1 for January 2018)

month, X 1 2 3 4 5 6 7 8 9 10 11 12

revenue,

Y

3 5 4 8 10 9 13 12 17 18 18 21

Y ’ 2.5 4.136 5.573 7.409 9.045 10.68 12.32 13.95 15.59 17.23 18.86 20.5

Z 1.2 1.209 0.693 1.08 1.106 0.843 1.055 0.86 1.09 1.045 0.954 1.024

Expression of the Y on X regression line: y’ = 1.636 x + 0.8636

Point estimate of the revenue in December 2016 (x = 24): y’0 = k€ 40.14

Variable Z : z = 1.0132222 and σ Z = 0.14538387

Coefficient u for a 95 % confidence level: u = 1.96

Confidence interval: [29.23 ; 52.10]

3) Give the probability that, in December 2019, the revenue would be less than k€ 29.23.

There are 95% chances that this revenue be inside this interval. Moreover, the concept of confidence

interval involves a symmetric probability distribution (the normal law); thus, there are 2.5% chances that

the revenue would be less than the values included in the interval, and 2.5% chances that it would be more

than them. Answer: 2.5%.

4) Build the scatter plot (scale: 2 cm for one month), draw the regression line and finally represent the

confidence interval.

____________________________________________________________________________


Exercise 25. Mayer + confidence interval

city X Y The given table includes eight among the major cities of a country. The variable X

gives, in thousands, the number of city residents; the variable Y gives, in

thousands, the number of students in this city.

1) Build the scatter plot from this data series. see below

2) Give the coordinates of the mean point of the cloud. G(439.1 ; 26)

3) a. Using Mayer’s method, determine manually the expression of the Y on X

regression line.

G1(273.3 ; 13.75) and G2(605 ; 38.25) slope: a = 0.07385

With G1: b = y – ax = -6.430 expression: y’ = 0.07385 x - 6.43

A 850 58

B 623 37

C 587 38

D 360 20

E 312 16

F 275 15

G 262 12

H 244 12

b. Draw this line. Does G belong to it? G always belongs to it

c. Give "Mayer’s principle". the sum of the residues must be zero

4) We will use here another fitting line, whose expression is: y' = 0.07 x – 6.

a. With this line, give the 95% confidence interval of the predictable number of students in a town that has

two million inhabitants.

Y

revenue (k€)

X

month

Y : # students

(thousands)

X : # residents

(thousands)

____________________________________________________________________________


X 850 623 587 360 312 275 262 244

Y 58 37 58 20 16 15 12 12

Y ’ 53.5 37.61 35.09 19.2 15.84 13.25 12.34 11.08

Z 1.084 0.984 1.083 1.042 1.01 1.132 0.972 1.083

Expression of the Y on X regression line: y’ = 0.07 x - 6

Point estimate of the number of students (x = 2000): y’0 = 134


Coefficient u associated to a 95 % confidence level: u = 1.96

Confidence interval: [126.7 ; 154.3]

b. What can we say about the chances that the number of students would exceed 155,000 in such a town ?

There are a bit less than 2.5 % chances.

Exercise 26. logarithmic fitting + confidence interval

Service life of some identical office equipment has been studied. In the following table, ti represents the

duration of use - expressed in thousands of hours - and R(ti) the rate of equipment still in use at the time ti.

(e.g. : after 1,000 hours, ti = 1, there are still 90 % left of equipment in use, R(ti) = 0.90)..

ti 1 2 3 4 5 6 7 8 9

R(ti) 0.9 0.66 0.53 0.4 0.32 0.25 0.19 0.14 0.1

1) We set yi = ln[R(ti)] where ln is the natural logarithm. Fill the following table, then build the scatter plot,

using the points Mi (ti, yi), into an orthogonal frame.

ti 1 2 3 4 5 6 7 8 9

yi -0.105 -0.416 -0.635 -0.916 -1.139 -1.386 -1.661 -1.966 -2.303

2) May a linear fitting be relevant in the previous point?

Calculate the linear correlation coefficient between T and Y.

These points are almost collinear; a linear fitting appears to be relevant.

3) Using the least square method, determine an expression of the Y on T regression line.

Deduce from this expression that there are two positive real numbers k and λ such that: R(t) = k e- λt.

y’ = -0.26604 t + 0.1605 . y = ln R(t) implies R(t) = ey = e-0.26604 t + 0.1605 = e0.1605 × e-0.26604 t = 1.174 e-0.26604 t .

4) In this question, we'll take k = 1.174 and λ = 0.266.

a. Determine the predictable rate of equipment still in use after 10,000 hours.

After 10,000 hours, t = 10 ; hence R(t) = 1.174 e- 2.66 = 0.08184 = 8,2 % rounded.

b. After how long are there exactly 50 % of equipment still in use?

R(t) = 0.5 implies 1.174 e- 0.266 t = 0.5 iff e- 0.266 t = 0.5/1.174 iff -0.266 t = ln(0.5/1.174)

iff t = ln(0.5/1.174) / (-0.266) = 3.209. Answer: after 3,209 hours.

____________________________________________________________________________


5) Give a 99% confidence interval of the rate of equipment still in use after 10,000 hours of service.

T 1 2 3 4 5 6 7 8 9

Y -0.105 -0.416 -0.635 -0.916 -1.139 -1.386 -1.661 -1.966 -2.303

Y ’ -0.106 -0.372 -0.638 -0.904 -1.170 -1.436 -1.702 -1.968 -2.234

Z 0.998 1.118 0.996 1.014 0.974 0.966 0.976 0.999 1.031

Expression of the Y on T regression line: y’ = -0.26604 t + 0.1605

Point estimate of the rate (t = 10) : y’0 = -2.5


Coefficient u associated to a 99 % confidence level: u = 2.58

Confidence interval on y : [-2.8003 ; -2.2395] and the the interval on R is: [0.0608 ; 0.1065].

Exercise 27.

100 children have been classified by age (X) and size (Y):

Y

[95 ; 105[ [105 ; 125[ [125 ; 135[

X

[3 ; 5[ 15 10 0

[5 ; 7[ 8 32 5

[7 ; 9[ 2 13 15

1) Enter this table in your calculator.

2) Give the means and standard deviations of X and Y, calculate their covariance.

( ) ( ). , . . , .23940

6 1 years V 6 1 2 19 1 480 year100

x X Xσ = = − = =

;

( ) ( ). , . . , .21315375

114 25 cm V 114 25 100 6875 10 03 cm100

y Y Yσ = = − = =

.

( ),70540

Cov 6.1 114.25 8.475100

X Y = − × = .

3) Calculate their linear correlation coefficient. Comment this value.

.8 4750.5709

1.480 10.03r = =

×, a very weak linear correlation (the cloud may be noisy and curved).

4) Nevertheless, does the table allow us to see some trend?

We see that from one age to another, the sizes corresponding to the greatest number of individuals are not

the same. But these largest frequencies do not represent, in their column, an overwhelming majority,

which reflects a high variability of sizes for children of the same age. To model the growth of a child by a

straight line is therefore difficult, or even by a well-defined curve.

5) Assuming that the relationship between age and size is linear until the age of 12, give the 95% confidence

interval of the size of a 12 year-old child.

X 4 6 8 4 6 8 4 6 8

Y 100 100 100 115 115 115 130 130 130

n 15 8 2 10 32 13 0 5 15

Y ’ 106.12 113.86 121.6 106.12 113.86 121.6 106.12 113.86 121.6

Z 0.94233 0.87827 0.82237 1.08368 1.01001 0.94572 1.22503 1.14175 1.06908

Expression of the Y on X regression line: y’ = 3.87 x + 90.64

Point estimate of the size of a 12 yo child (x = 12): y’0 = 137.08 cm


Coefficient u corresponding to a 99 % confidence level: u = 1.96

Confidence interval on y : [106.1 ; 171.6].

s2e - stat2var - texcorr - doc - rev 2020€¦ · tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2...

Documents