multivariate data. regression and correlation the scatter plot
TRANSCRIPT
![Page 1: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/1.jpg)
Multivariate data
![Page 2: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/2.jpg)
Regression and Correlation
![Page 3: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/3.jpg)
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
The Scatter Plot
![Page 4: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/4.jpg)
Pearson’s correlation coefficient
xy
xx yy
Sr
S S
![Page 5: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/5.jpg)
n
x
xxxS
n
iin
ii
n
iixx
2
1
1
2
1
2
n
yx
yx
n
ii
n
iin
iii
11
1
n
y
yyyS
n
iin
ii
n
iiyy
2
1
1
2
1
2
n
iiixy yyxxS
1
where
![Page 6: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/6.jpg)
Where for each case i, di = ri – si = difference in the rank of xi and the rank of yi.
1
61
21
2
nn
dn
ii
Spearman’s rank correlation coefficient
![Page 7: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/7.jpg)
Simple Linear Regression
Fitting straight lines to data
![Page 8: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/8.jpg)
The Least Squares Line The Regression Line
• When data is correlated it falls roughly about a straight line.
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
![Page 9: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/9.jpg)
In this situation wants to:
• Find the equation of the straight line through the data that yields the best fit.
The equation of any straight line:
is of the form:
Y = a + bX
b = the slope of the line
a = the intercept of the line
![Page 10: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/10.jpg)
For any equation of a straight line
Y = a + b X
The predicted value of Y when X = xi (ith case)
can be computed:
ˆi iy a bx
The error in the prediction is given by:
ˆi i i i ir y y y a bx
This is called the residual for the ith case.
![Page 11: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/11.jpg)
The residuals
iiiii bxayyyr ˆ
,ˆ,,ˆ,ˆ 222111 nnn yyryyryyr
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
can be computed for each case in the sample,
The residual sum of squares (RSS)
is a measure of the goodness of fit of the line
Y = a + bX to the data
![Page 12: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/12.jpg)
The optimal choice of a and b will result in the residual sum of squares
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
attaining a minimum.
If this is the case than the line:
Y = a + bX
is called the Least Squares Line
![Page 13: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/13.jpg)
Then the slope of the least squares line can be shown to be:
n
ii
n
iii
xx
xy
xx
yyxx
S
Sb
1
2
1
![Page 14: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/14.jpg)
and the intercept of the least squares line can be shown to be:
xS
Syxbya
xx
xy
![Page 15: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/15.jpg)
Computing the residual sum of squares for the least squares line
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
Once a and b have been determined this can be computed using the far right hand side.
This can also be computed using the values of Sxx, Syy and Sxy.
For the Least Squares Line2xy
yyxx
SRSS S
S
![Page 16: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/16.jpg)
The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.
Country (i) Xi Yi
Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20
![Page 17: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/17.jpg)
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
![Page 18: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/18.jpg)
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
![Page 19: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/19.jpg)
404,541
2
n
iix
914,161
n
iii yx
018,61
2
n
iiy
Fitting the Least Squares Line
6641
n
iix
2261
n
iiy
![Page 20: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/20.jpg)
55.1432211
66454404
2
xxS
73.1374
11
2266018
2
yyS
82.3271
11
22666416914 xyS
Fitting the Least Squares Line - continued
First compute the following three quantities:
![Page 21: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/21.jpg)
Computing Estimate of Slope and Intercept
288.055.14322
82.3271
xx
xy
S
Sb
756.611
664288.0
11
226
xbya
2 23271.81811374.72727-
14322.545xy
yyxx
SRSS S
S
627.3196
Computing the Residual Sum of Squares
![Page 22: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/22.jpg)
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
Y = 6.756 + (0.228)X
![Page 23: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/23.jpg)
Interpretation of the slope and intercept
1. Intercept – value of Y at X = 0.– Predicted death rate from lung cancer
(6.756) for men in 1950 in Counties with no smoking in 1930 (X = 0).
2. Slope – rate of increase in Y per unit increase in X.
– Death rate from lung cancer for men in 1950 increases 0.228 units for each increase of 1 cigarette per capita consumption in 1930.
![Page 24: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/24.jpg)
Relationship between correlation and Linear Regression
1. Pearsons correlation.
• Takes values between –1 and +1
n
ii
n
ii
n
iii
yyxx
xy
yyxx
yyxx
SS
Sr
1
2
1
2
1
![Page 25: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/25.jpg)
2. Least squares Line Y = a + bX– Minimises the Residual Sum of Squares:
– The Sum of Squares that measures the variability in Y that is unexplained by X.
– This can also be denoted by:
SSunexplained
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
![Page 26: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/26.jpg)
Some other Sum of Squares:
– The Sum of Squares that measures the total variability in Y (ignoring X).
n
iiTotal yySS
1
2
![Page 27: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/27.jpg)
– The Sum of Squares that measures the total variability in Y that is explained by X.
n
iiExplained yySS
1
2ˆ
![Page 28: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/28.jpg)
It can be shown:
(Total variability in Y) = (variability in Y explained by X) + (variability in Y unexplained by X)
n
iii
n
ii
n
ii yyyyyy
1
2
1
2
1
2 ˆˆ
lainedUnExplainedTotal SSSSSS exp
![Page 29: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/29.jpg)
It can also be shown:
= proportion variability in Y unexplained by X.
= the coefficient of determination
n
ii
n
ii
yy
yyr
1
2
1
2
2
ˆ
![Page 30: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/30.jpg)
Further:
= proportion variability in Y that is unexplained by X.
n
ii
n
iii
yy
yyr
1
2
1
2
2
ˆ1
![Page 31: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/31.jpg)
Example
TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.
Country (i) Xi Yi
Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20
![Page 32: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/32.jpg)
Computing r and r2
737.0
73.137455.14322
82.3271
yyxx
xy
SS
Sr
544.0737.0 22 r
54.4% of the variability in Y (death rate due to lung Cancer (1950) is explained by X (per capita cigarette smoking in 1930)
![Page 33: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/33.jpg)
Categorical Data
Techniques for summarizing, displaying and graphing
![Page 34: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/34.jpg)
The frequency tableThe bar graph
Suppose we have collected data on a categorical variable X having k categories – 1, 2, … , k.
To construct the frequency table we simply count for each category (i) of X, the number of cases falling in that category (fi)
To plot the bar graph we simply draw a bar of height fi above each category (i) of X.
![Page 35: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/35.jpg)
Example
In this example data has been collected for n = 34,188 subjects.
• The purpose of the study was to determine the relationship between the use of Antidepressants, Mood medication, Anxiety medication, Stimulants and Sleeping pills.
• In addition the study interested in examining the effects of the independent variables (gender, age, income, education and role) on both individual use of the medications and the multiple use of the medications.
![Page 36: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/36.jpg)
The variables were: 1. Antidepressant use, 2. Mood medication use, 3. Anxiety medication use, 4. Stimulant use and 5. Sleeping pills use.6. gender, 7. age, 8. income, 9. education and 10. Role –
i. Parent, worker, partnerii. Parent, partneriii. Parent, workeriv. worker, partner
v. worker onlyvi. Parent onlyvii. Partner onlyviii. No roles
![Page 37: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/37.jpg)
Frequency Table for Age
Age - (G)
5349 15.7 15.7 15.7
6758 19.8 19.8 35.5
6420 18.8 18.8 54.3
5528 16.2 16.2 70.5
4400 12.9 12.9 83.4
5663 16.6 16.6 100.0
34118 100.0 100.0
20-29
30-39
40-49
50-59
60-69
70+
Total
ValidFrequency Percent Valid Percent
CumulativePercent
![Page 38: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/38.jpg)
20-29 30-39 40-49 50-59 60-69 70+
Age - (G)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Co
un
t
Bar Graph for Age
![Page 39: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/39.jpg)
Frequency Table for Role
role
6614 19.4 24.5 24.5
1068 3.1 4.0 28.5
1351 4.0 5.0 33.5
5427 15.9 20.1 53.6
5711 16.7 21.2 74.7
456 1.3 1.7 76.4
3262 9.6 12.1 88.5
3097 9.1 11.5 100.0
26986 79.1 100.0
7132 20.9
34118 100.0
parent, partner, worker
parent, partner
parent, worker
partner, worker
worker only
parent only
partner only
no roles
Total
Valid
SystemMissing
Total
Frequency Percent Valid PercentCumulative
Percent
![Page 40: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/40.jpg)
parent, partner, worker
parent, partnerparent, worker
partner, workerworker only
parent onlypartner only
no roles
role
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Co
un
t
Bar Graph for Role
![Page 41: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/41.jpg)
The two way frequency table
The 2 statistic
Techniques for examining dependence amongst two categorical
variables
![Page 42: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/42.jpg)
Situation
• We have two categorical variables R and C.
• The number of categories of R is r.
• The number of categories of C is c.
• We observe n subjects from the population and count
xij = the number of subjects for which R = i and
C = j.
• R = rows, C = columns
![Page 43: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/43.jpg)
Example
Both Systolic Blood pressure (C) and Serum Chlosterol (R) were meansured for a sample of n = 1237 subjects.
The categories for Blood Pressure are:
<126 127-146 147-166 167+
The categories for Chlosterol are:
<200 200-219 220-259 260+
![Page 44: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/44.jpg)
Table: two-way frequency
Serum Cholesterol
Systolic Blood pressure <127 127-146 147-166 167+ Total
< 200 117 121 47 22 307200-219 85 98 43 20 246220-259 115 209 68 43 439
260+ 67 99 46 33 245
Total 388 527 204 118 1237
![Page 45: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/45.jpg)
Example
This comes from the drug use data.
The two variables are:
1. Age (C) and
2. Antidepressant Use (R)
measured for a sample of n = 33,957 subjects.
![Page 46: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/46.jpg)
Two-way Frequency Table
Took anti-depressants - 12 mo * Age - (G) Crosstabulation
Count
322 523 570 522 265 249 2451
5007 6201 5822 4982 4114 5380 31506
5329 6724 6392 5504 4379 5629 33957
YES
NO
Took anti-depressants- 12 mo
Total
20-29 30-39 40-49 50-59 60-69 70+
Age - (G)
Total
Age - (G)
20-29 30-39 40-49 50-59 60-69 70+6.04% 7.78% 8.92% 9.48% 6.05% 4.42%
Percentage antidepressant use vs Age
![Page 47: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/47.jpg)
Antidepressant Use vs Age
0.0%
5.0%
10.0%
20-29 30-39 40-49 50-59 60-69 70+
![Page 48: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/48.jpg)
The 2 statistic for measuring dependence
amongst two categorical variables
DefineTotal row
1
thc
jiji ixR
1
column Totalc
thj ij
i
C x j
n
CRE ji
ij
= Expected frequency in the (i,j) th cell in the case of independence.
![Page 49: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/49.jpg)
Columns
1 2 3 4 5 Total
1 x11 x12 x13 x14 x15 R1
2 x21 x22 x23 x24 x25 R2
3 x31 x32 x33 x34 x35 R3
4 x41 x42 x43 x44 x45 R4
Total C1 C2 C3 C4 C5 N
Total row 1
thc
jiji ixR
1
column Totalc
thj ij
i
C x j
![Page 50: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/50.jpg)
Columns
1 2 3 4 5 Total
1 E11 E12 E13 E14 E15 R1
2 E21 E22 E23 E24 E25 R2
3 E31 E32 E33 E34 E35 R3
4 E41 E42 E43 E44 E45 R4
Total C1 C2 C3 C4 C5 n
n
CRE ji
ij
![Page 51: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/51.jpg)
Justification if i jij
R CE
n then ij j
i
E C
R n
1 2 3 4 5 Total
1 E11 E12 E13 E14 E15 R1
2 E21 E22 E23 E24 E25 R2
3 E31 E32 E33 E34 E35 R3
4 E41 E42 E43 E44 E45 R4
Total C1 C2 C3 C4 C5 n
Proportion in column j for row i
overall proportion in column j
![Page 52: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/52.jpg)
and if i jij
R CE
n then ij i
j
E R
C n
1 2 3 4 5 Total
1 E11 E12 E13 E14 E15 R1
2 E21 E22 E23 E24 E25 R2
3 E31 E32 E33 E34 E35 R3
4 E41 E42 E43 E44 E45 R4
Total C1 C2 C3 C4 C5 n
Proportion in row i for column j
overall proportion in row i
![Page 53: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/53.jpg)
The 2 statistic
r
i
c
j ij
ijij
E
Ex
1 1
2
2
Eij= Expected frequency in the (i,j) th cell in the case of independence.
xij= observed frequency in the (i,j) th cell
![Page 54: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/54.jpg)
Example: studying the relationship between Systolic Blood pressure and Serum Cholesterol
In this example we are interested in whether Systolic Blood pressure and Serum Cholesterol are related or whether they are independent.
Both were measured for a sample of n = 1237 cases
![Page 55: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/55.jpg)
Serum Cholesterol
Systolic Blood pressure <127 127-146 147-166 167+ Total
< 200 117 121 47 22 307200-219 85 98 43 20 246220-259 115 209 68 43 439
260+ 67 99 46 33 245
Total 388 527 204 118 1237
Observed frequencies
![Page 56: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/56.jpg)
Serum Cholesterol
Systolic Blood pressure <127 127-146 147-166 167+ Total
< 200 96.29 130.79 50.63 29.29 307200-219 77.16 104.8 40.47 23.47 246220-259 137.70 187.03 72.40 41.88 439
260+ 76.85 104.38 40.04 23.37 245
Total 388 527 204 118 1237
Expected frequencies
In the case of independence the distribution across a row is the same for each rowThe distribution down a column is the same for each column
![Page 57: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/57.jpg)
Table Expected frequencies, Observed frequencies, Standardized Residuals
Serum Systolic Blood pressure
Cholesterol <127 127-146 147-166 167+ Total <200 96.29 130.79 50.63 29.29 307 (117) (121) (47) (22) 2.11 -0.86 -0.51 -1.35 200-219 77.16 104.80 40.47 23.47 246 (85) (98) (43) (20) 0.86 -0.66 0.38 -0.72 220-259 137.70 187.03 72.40 41.88 439 (119) (209) (68) (43) -1.59 1.61 -0.52 0.17 260+ 76.85 104.38 40.04 23.37 245 (67) (99) (46) (33) -1.12 -0.53 0.88 1.99 Total 388 527 204 118 1237
2 = 20.85
ij
ijijij
E
Exr
![Page 58: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/58.jpg)
Standardized residuals
ij
ijijij
E
Exr
85.20
1 1
2
1 1
2
2
r
i
c
jij
r
i
c
j ij
ijij rE
Ex
The 2 statistic
![Page 59: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/59.jpg)
Example
This comes from the drug use data.
The two variables are:
1. Role (C) and
2. Antidepressant Use (R)
measured for a sample of n = 33,957 subjects.
![Page 60: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/60.jpg)
Two-way Frequency Table
Percentage antidepressant use vs Role
Took anti-depressants - 12 mo * role Crosstabulation
Count
344 101 201 275 455 63 224 414 2077
6268 967 1150 5150 5249 392 3036 2679 24891
6612 1068 1351 5425 5704 455 3260 3093 26968
YES
NO
Took anti-depressants- 12 mo
Total
parent,partner,worker
parent,partner parent, worker
partner,worker worker only parent only partner only no roles
role
Total
Role parent, partner, worker
parent, partner
parent, worker
partner, worker
worker only parent only
partner only no roles
5.20% 9.46% 14.88% 5.07% 7.98% 13.85% 6.87% 13.39%
![Page 61: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/61.jpg)
Antidepressant Use vs Role
0.0%
5.0%
10.0%
15.0%
20.0%
parent,partner,worker
parent,partner
parent,worker
partner,worker
workeronly
parentonly
partneronly
no roles
2 = 381.961
![Page 62: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/62.jpg)
Calculation of 2
1 2 3 4 5 6 7 8 Total
YES 344 101 201 275 455 63 224 414 2077NO 6268 967 1150 5150 5249 392 3036 2679 24891
Total 6612 1068 1351 5425 5704 455 3260 3093 26968
The Raw data
Expected frequencies1 2 3 4 5 6 7 8 Total (R i )
YES 509.24 82.25 104.05 417.82 439.31 35.04 251.08 238.21 2077NO 6102.76 985.75 1246.95 5007.18 5264.69 419.96 3008.92 2854.79 24891
Total (C j ) 6612 1068 1351 5425 5704 455 3260 3093 26968
ij
ijijij
E
Exr
i jij
R CE
n
![Page 63: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/63.jpg)
The Residuals
The calculation of 2
ij
ijijij
E
Exr
1 2 3 4 5 6 7 8
YES -7.32 2.07 9.50 -6.99 0.75 4.72 -1.71 11.39NO 2.12 -0.60 -2.75 2.02 -0.22 -1.36 0.49 -3.29
2
2 2 381.961ij ij
iji j i j ij
x Er
E
![Page 64: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/64.jpg)
Probability Theory
Modelling random phenomena
![Page 65: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/65.jpg)
Some counting formulae
Permutations
the number of ways that you can order n objects is:
n! = n(n-1)(n-2)(n-3)…(3)(2)(1)
Example:
the number of ways you can order the three letters A, B, and C is 3! = 3(2)(1) = 6
ABC ACB BAC BCA CAB CBA
![Page 66: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/66.jpg)
Permutations
the number of ways that you can choose k objects from n objects in a specific order:
Example:
the number of ways you choose two letters from the four letters A, B, D, C in a specific order is
)1()1()!(
!
knnn
kn
nPkn
12)3)(4(!2
!4
)!24(
!424
P
![Page 67: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/67.jpg)
AB BA AC CA AD DA
BC CB BD DB CD DC
![Page 68: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/68.jpg)
Combinations
the number of ways that you can choose k objects from n objects (order irrelevant) is:
)1()1(
)1()1(
)!(!
!
kk
knnn
knk
n
k
nCkn
![Page 69: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/69.jpg)
Example:
the number of ways you choose two letters from the four letters A, B, D, C
{A,B} {A,C} {A,D} {B,C} {B,D}{C,D}
62
12
)1)(2(
)3)(4(
!2!2
!4
)!24(!2
!4
2
424
C
![Page 70: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/70.jpg)
Example:
Suppose we have a committee of 10 people and we want to choose a sub-committee of 3 people. How many ways can this be done
45)1)(2)(3(
)3)(9)(10(
!7!3
!10
3
10310
C
![Page 71: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/71.jpg)
Example: Random sampling
Suppose we have a club of N =1000 persons and we want to choose sample of k = 250 of these individuals to determine there opinion on a given issue. How many ways can this be performed?
The choice of the sample is called random sampling if all of the choices has the same probability of being selected
2422501000 10823.4
!750!250
!1000
250
1000
C
![Page 72: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/72.jpg)
Important Note:
0! is always defined to be 1.
Also
are called Binomial Coefficients
)!(!
!
knk
n
k
nCkn
![Page 73: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/73.jpg)
Reason:
The Binomial Theorem
nyx
0222
111
00 yxCyxCyxCyxC n
nnn
nn
nn
n
022110
210yx
n
nyx
nyx
nyx
n nnnn
![Page 74: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/74.jpg)
Binomial Coefficients can also be calculated using Pascal’s triangle
11 1
1 2 11 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
![Page 75: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/75.jpg)
Random Variables
Probability distributions
![Page 76: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/76.jpg)
Definition:
A random variable X is a number whose value is determined by the outcome of a random experiment (random phenomena)
![Page 77: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/77.jpg)
Examples1. A die is rolled and X = number of spots
showing on the upper face.2. Two dice are rolled and X = Total number
of spots showing on the two upper faces.3. A coin is tossed n = 100 times and
X = number of times the coin toss resulted in a head.
4. A person is selected at random from a population and
X = weight of that individual.
![Page 78: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/78.jpg)
5. A sample of n = 100 individuals are selected at random from a population (i.e. all samples of n = 100 have the same probability of being selected) .
X = the average weight of the 100 individuals.
![Page 79: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/79.jpg)
In all of these examples X fits the definition of a random variable, namely:– a number whose value is determined by the
outcome of a random experiment (random phenomena)
![Page 80: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/80.jpg)
Probability distribution of a Random Variable
![Page 81: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/81.jpg)
Random variables are either
• Discrete– Integer valued – The set of possible values for X are integers
• Continuous– The set of possible values for X are all real
numbers – Range over a continuum.
![Page 82: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/82.jpg)
Examples
• Discrete
– A die is rolled and X = number of spots showing on the upper face.
– Two dice are rolled and X = Total number of spots showing on the two upper faces.
– A coin is tossed n = 100 times and X = number of times the coin toss resulted in a head.
![Page 83: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/83.jpg)
Examples
• Continuous– A person is selected at random from a
population and X = weight of that individual.– A sample of n = 100 individuals are selected
at random from a population (i.e. all samples of n = 100 have the same probability of being selected) . X = the average weight of the 100 individuals.
![Page 84: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/84.jpg)
The probability distribution of a discrete random variable is describe by its :
probability function p(x).
p(x) = the probability that X takes on the value x.
![Page 85: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/85.jpg)
Examples
• Discrete
– A die is rolled and X = number of spots showing on the upper face.
– Two dice are rolled and X = Total number of spots showing on the two upper faces.
x 1 2 3 4 5 6
p(x) 1/6 1/6 1/6 1/6 1/6 1/6
x 2 3 4 5 6 7 8 9 10 11 12p(x) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
![Page 86: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/86.jpg)
Graphs
To plot a graph of p(x), draw bars of height p(x) above each value of x.
Rolling a die
0
1 2 3 4 5 6
![Page 87: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/87.jpg)
Rolling two dice
0
![Page 88: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/88.jpg)
Note:1. 0 p(x) 1
2.
3.
x
xp 1
b
ax
xpbXaP )(
![Page 89: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/89.jpg)
The probability distribution of a continuous random variable is described by its :
probability density curve f(x).
![Page 90: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/90.jpg)
i.e. a curve which has the following properties :• 1. f(x) is always positive.
• 2. The total are under the curve f(x) is one.
• 3. The area under the curve f(x) between a and b is the probability that X lies between the two values.
![Page 91: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/91.jpg)
0
0.005
0.01
0.015
0.02
0.025
0 20 40 60 80 100 120
f(x)
![Page 92: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/92.jpg)
An Important discrete distribution
The Binomial distribution
Suppose we have an experiment with two outcomes – Success(S) and Failure(F).
Let p denote the probability of S (Success).
In this case q=1-p denotes the probability of Failure(F).
Now suppose this experiment is repeated n times independently.
![Page 93: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/93.jpg)
Let X denote the number of successes occuring in the n repititions.
Then X is a random variable.
It’s possible values are
0, 1, 2, 3, 4, … , (n – 2), (n – 1), n
and p(x) for any of the above values of x is given by:
xnxxnx qpx
npp
x
nxp
1
![Page 94: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/94.jpg)
X is said to have the Binomial distribution with parameters n and p.
![Page 95: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/95.jpg)
Summary:
X is said to have the Binomial distribution with parameters n and p.
1. X is the number of successes occuring in the n repititions of a Success-Failure Experiment.
2. The probability of success is p.
3. xnx pp
x
nxp
1
![Page 96: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/96.jpg)
Examples:
1. A coin is tossed n = 5 times. X is the number of heads occuring in the 5 tosses of the coin. In this case p = ½ and
3215
215
21
21
555
xxxxp xx
x 0 1 2 3 4 5
p(x)321
325
325
321
3210
3210
![Page 97: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/97.jpg)
Random Variables
Numerical Quantities whose values are determine by the outcome of a
random experiment
![Page 98: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/98.jpg)
Discrete Random VariablesDiscrete Random Variable: A random variable usually assuming an integer value.
• a discrete random variable assumes values that are isolated points along the real line. That is neighbouring values are not “possible values” for a discrete random variable
Note: Usually associated with counting• The number of times a head occurs in 10 tosses of a coin
• The number of auto accidents occurring on a weekend
• The size of a family
![Page 99: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/99.jpg)
Continuous Random Variables
Continuous Random Variable: A quantitative random variable that can vary over a continuum
• A continuous random variable can assume any value along a line interval, including every possible value between any two points on the line
Note: Usually associated with a measurement• Blood Pressure
• Weight gain
• Height
![Page 100: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/100.jpg)
Probability Distributionsof a Discrete Random Variable
![Page 101: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/101.jpg)
Probability Distribution & Function
Probability Distribution: A mathematical description of how probabilities are distributed with each of the possible values of a random variable.
Notes: The probability distribution allows one to determine probabilities
of events related to the values of a random variable. The probability distribution may be presented in the form of a
table, chart, formula.
Probability Function: A rule that assigns probabilities to the values of the random variable
![Page 102: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/102.jpg)
x 0 1 2 3
p(x) 6/14 4/14 3/14 1/14
ExampleIn baseball the number of individuals, X, on base when a home run is hit ranges in value from 0 to 3. The probability distribution is known and is given below:
P X( )the random variable equals 2 p ( ) 23
14
Note: This chart implies the only values x takes on are 0, 1, 2, and 3. If the random variable X is observed repeatedly the probabilities,
p(x), represents the proportion times the value x appears in that sequence.
2least at is variablerandom the XP 32 pp 14
4
14
1
14
3
![Page 103: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/103.jpg)
A Bar Graph
No. of persons on base when a home run is hit
0.429
0.286
0.214
0.071
0.000
0.100
0.200
0.300
0.400
0.500
0 1 2 3
# on base
p(x)
![Page 104: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/104.jpg)
Comments:Every probability function must satisfy:
1)(0 xp
1. The probability assigned to each value of the random variable must be between 0 and 1, inclusive:
x
xp
1)(
2. The sum of the probabilities assigned to all the values of the random variable must equal 1:
b
ax
xpbXaP )(3.
)()1()( bpapap
![Page 105: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/105.jpg)
Mean and Variance of aDiscrete Probability Distribution
• Describe the center and spread of a probability distribution
• The mean (denoted by greek letter (mu)), measures the centre of the distribution.
• The variance (2) and the standard deviation () measure the spread of the distribution.
is the greek letter for s.
![Page 106: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/106.jpg)
Mean of a Discrete Random Variable• The mean, , of a discrete random variable x is found by
multiplying each possible value of x by its own probability and then adding all the products together:
Notes: The mean is a weighted average of the values of X.
x
xxp
kk xpxxpxxpx 2211
The mean is the long-run average value of the random variable.
The mean is centre of gravity of the probability distribution of the random variable
![Page 107: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/107.jpg)
-
0.1
0.2
0.3
1 2 3 4 5 6 7 8 9 10 11
![Page 108: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/108.jpg)
2
Variance and Standard DeviationVariance of a Discrete Random Variable: Variance, 2, of a discrete random variable x is found by multiplying each possible value of the squared deviation from the mean, (x )2, by its own probability and then adding all the products together:
Standard Deviation of a Discrete Random Variable: The positive square root of the variance:
x
xpx 22
2
2
xx
xxpxpx
22 x
xpx
![Page 109: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/109.jpg)
ExampleThe number of individuals, X, on base when a home run is hit ranges in value from 0 to 3.
x p (x ) xp(x) x 2 x 2 p(x)
0 0.429 0.000 0 0.0001 0.286 0.286 1 0.2862 0.214 0.429 4 0.8573 0.071 0.214 9 0.643
Total 1.000 0.929 1.786
)(xp )(xxp )(2 xpx
![Page 110: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/110.jpg)
• Computing the mean:
Note: • 0.929 is the long-run average value of the random variable • 0.929 is the centre of gravity value of the probability
distribution of the random variable
929.0x
xxp
![Page 111: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/111.jpg)
• Computing the variance:
x
xpx 22
2
2
xx
xxpxpx
923.0929.786.1 2
• Computing the standard deviation:
2
961.0923.0
![Page 112: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/112.jpg)
The Binomial distribution1. We have an experiment with two outcomes
– Success(S) and Failure(F).
2. Let p denote the probability of S (Success).
3. In this case q=1-p denotes the probability of Failure(F).
4. This experiment is repeated n times independently.
5. X denote the number of successes occuring in the n repititions.
![Page 113: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/113.jpg)
The possible values of X are
0, 1, 2, 3, 4, … , (n – 2), (n – 1), n
and p(x) for any of the above values of x is given by:
xnxxnx qpx
npp
x
nxp
1
X is said to have the Binomial distribution with parameters n and p.
![Page 114: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/114.jpg)
Summary:
X is said to have the Binomial distribution with parameters n and p.
1. X is the number of successes occurring in the n repetitions of a Success-Failure Experiment.
2. The probability of success is p.
3. The probability function
xnx ppx
nxp
1
![Page 115: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/115.jpg)
Example:
1. A coin is tossed n = 5 times. X is the number of heads occurring in the 5 tosses of the coin. In this case p = ½ and
3215
215
21
21
555
xxxxp xx
x 0 1 2 3 4 5
p(x)321
325
325
321
3210
3210
![Page 116: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/116.jpg)
0.0
0.1
0.2
0.3
0.4
1 2 3 4 5 6
number of heads
p(x
)
![Page 117: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/117.jpg)
Computing the summary parameters for the distribution – , 2,
x p (x ) xp(x) x 2 x 2 p(x)
0 0.03125 0.000 0 0.0001 0.15625 0.156 1 0.1562 0.31250 0.625 4 1.2503 0.31250 0.938 9 2.8134 0.15625 0.625 16 2.5005 0.03125 0.156 25 0.781
Total 1.000 2.500 7.500
)(xp )(xxp )(2 xpx
![Page 118: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/118.jpg)
• Computing the mean: 5.2
x
xxp
• Computing the variance:
x
xpx 22
2
2
xx
xxpxpx
25.15.25.7 2
• Computing the standard deviation:
2
118.125.1
![Page 119: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/119.jpg)
Example:
• A surgeon performs a difficult operation n = 10 times.
• X is the number of times that the operation is a success.
• The success rate for the operation is 80%. In this case p = 0.80 and
• X has a Binomial distribution with n = 10 and p = 0.80.
![Page 120: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/120.jpg)
xx
xxp
1020.080.0
10
x 0 1 2 3 4 5p (x ) 0.0000 0.0000 0.0001 0.0008 0.0055 0.0264
x 6 7 8 9 10p (x ) 0.0881 0.2013 0.3020 0.2684 0.1074
Computing p(x) for x = 1, 2, 3, … , 10
![Page 121: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/121.jpg)
The Graph
-
0.1
0.2
0.3
0.4
0 1 2 3 4 5 6 7 8 9 10
Number of successes, x
p(x
)
![Page 122: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/122.jpg)
Computing the summary parameters for the distribution – , 2,
)(xxp )(2 xpx
x p (x ) xp(x) x 2 x 2 p(x)
0 0.0000 0.000 0 0.0001 0.0000 0.000 1 0.0002 0.0001 0.000 4 0.0003 0.0008 0.002 9 0.0074 0.0055 0.022 16 0.0885 0.0264 0.132 25 0.6616 0.0881 0.528 36 3.1717 0.2013 1.409 49 9.8658 0.3020 2.416 64 19.3279 0.2684 2.416 81 21.743
10 0.1074 1.074 100 10.737Total 1.000 8.000 65.600
![Page 123: Multivariate data. Regression and Correlation The Scatter Plot](https://reader035.vdocuments.site/reader035/viewer/2022062805/5697c0281a28abf838cd68b4/html5/thumbnails/123.jpg)
• Computing the mean: 0.8
x
xxp
• Computing the variance:
x
xpx 22
2
2
xx
xxpxpx
60.10.86.65 2
• Computing the standard deviation:
2 118.125.1