multivariate data

Multivariate data

Graphical Techniques

• The scatter plot

• The two dimensional Histogram

Some Scatter Patterns

-100

-50

0

50

100

150

200

250

40 60 80 100 120 140

-100

-50

0

50

100

150

200

250

40 60 80 100 120 140 0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

• Circular

• No relationship between X and Y

• Unable to predict Y from X

Ellipsoidal

• Positive relationship between X and Y

• Increases in X correspond to increases in Y (but not always)

• Major axis of the ellipse has positive slope

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

Ellipsoidal

• Negative relationship between X and Y

• Increases in X correspond to decreases in Y (but not always)

• Major axis of the ellipse has negative slope slope

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140 0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

Non-Linear Patterns

0

200

400

600

800

1000

1200

-20 -10 0 10 20 30 40 50

-20

0

20

40

60

80

100

120

0 10 20 30 40 50

Measures of strength of a relationship (Correlation)

• Pearson’s correlation coefficient (r)

• Spearman’s rank correlation coefficient (rho, )

Pearsons correlation coefficient is defined as below:

n

ii

n

ii

n

iii

yyxx

xy

yyxx

yyxx

SS

Sr

1

2

1

2

1

where:

n

iixx xxS

1

2

n

iiyy yyS

1

2

n

iiixy yyxxS

1

Properties of Pearson’s correlation coefficient r

1. The value of r is always between –1 and +1.2. If the relationship between X and Y is positive, then

r will be positive.3. If the relationship between X and Y is negative,

then r will be negative.4. If there is no relationship between X and Y, then r

will be zero.

5. The value of r will be +1 if the points, (xi, yi) lie on a straight line with positive slope.

6. The value of r will be +1 if the points, (xi, yi) lie on a straight line with positive slope.

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r =1

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = 0.95

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = 0.7

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

r = 0.4

-100

-50

0

50

100

150

200

250

40 60 80 100 120 140

r = 0

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.4

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.7

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.8

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.95

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -1

Computing formulae for the statistics:

n

iixx xxS

1

2

n

iiyy yyS

1

2

n

iiixy yyxxS

1

n

x

xxxS

n

iin

ii

n

iixx

2

1

1

2

1

2

n

yx

yx

n

ii

n

iin

iii

11

1

n

y

yyyS

n

iin

ii

n

iiyy

2

1

1

2

1

2

n

iiixy yyxxS

1

Spearman’s rank

correlation coefficient

(rho)

Spearman’s rank correlation coefficient (rho)

Spearman’s rank correlation coefficient is computed as follows:• Arrange the observations on X in increasing order and assign them the ranks 1, 2, 3, …, n• Arrange the observations on Y in increasing order and assign them the ranks 1, 2, 3, …, n.

•For any case (i) let (xi, yi) denote the observations on X and Y and let (ri, si) denote the ranks on X and Y.

• If the variables X and Y are strongly positively correlated the ranks on X should generally agree with the ranks on Y. (The largest X should be the largest Y, The smallest X should be the smallest Y).

• If the variables X and Y are strongly negatively correlated the ranks on X should in the reverse order to the ranks on Y. (The largest X should be the smallest Y, The smallest X should be the largest Y).

• If the variables X and Y are uncorrelated the ranks on X should randomly distributed with the ranks on Y.

Spearman’s rank correlation coefficient

is defined as follows:

For each case let di = ri – si = difference in the two ranks.

Then Spearman’s rank correlation coefficient () is defined as follows:

1

61

21

2

nn

dn

ii

Properties of Spearman’s rank correlation coefficient 1. The value of is always between –1 and +1.2. If the relationship between X and Y is positive, then

will be positive.3. If the relationship between X and Y is negative,

then will be negative.4. If there is no relationship between X and Y, then

will be zero.5. The value of will be +1 if the ranks of X

completely agree with the ranks of Y.6. The value of will be -1 if the ranks of X are in

reverse order to the ranks of Y.

Examplexi 25.0 33.9 16.7 37.4 24.6 17.3 40.2

yi 24.3 38.7 13.4 32.1 28.0 12.5 44.9

Ranking the X’s and the Y’s we get:

ri 4 5 1 6 3 2 7

si 3 6 2 5 4 1 7

Computing the differences in ranks gives us:

di 1 -1 -1 1 -1 1 0

61

2

n

iid

1

61

21

2

nn

dn

ii

177

661

2

47

31

487

361

893.028

25

Computing Pearsons correlation coefficient, r, for the same problem:

n

ii

n

ii

n

iii

yyxx

xy

yyxx

yyxx

SS

Sr

1

2

1

2

1

n

x

xxxS

n

iin

ii

n

iixx

2

1

1

2

1

2

n

yx

yx

n

ii

n

iin

iii

11

1

n

y

yyyS

n

iin

ii

n

iiyy

2

1

1

2

1

2

n

iiixy yyxxS

1

To compute

first compute

xxS yyS xyS

35.59721

2

n

iixC

78.60531

n

iii yxE

41.62541

2

n

iiyD

9.1931

n

iiyB1.195

1

n

iixA

Then

63.5347

1.19535.5972

22

n

ACSxx

38.8837

9.19341.6254

22

n

BDS yy

51.649

7

9.1931.19578.6053

n

BAESxy

and

Compare with

945.038.88363.534

51.649r

893.0

Comments: Spearman’s rank correlation coefficient and Pearson’s correlation coefficient r

1. The value of an also be computed from:

2. Spearman’s is Pearson’s r computed from the ranks.

n

ii

n

ii

n

iii

ssrr

ssrr

1

2

1

2

1

3. Spearman’s is less sensitive to extreme observations. (outliers)

4. The value of Pearson’s r is much more sensitive to extreme outliers.

This is similar to the comparison between the median and the mean, the standard deviation and the pseudo-standard deviation. The mean and standard deviation are more sensitive to outliers than the median and pseudo- standard deviation.

Simple Linear Regression

Fitting straight lines to data

The Least Squares Line The Regression Line

• When data is correlated it falls roughly about a straight line.

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

In this situation wants to:• Find the equation of the straight line through

the data that yields the best fit.

The equation of any straight line:is of the form:

Y = a + bX

b = the slope of the linea = the intercept of the line

a

Run = x2-x1

Rise = y2-y1

b =RiseRun x2-x1

=y2-y1

• a is the value of Y when X is zero

• b is the rate that Y increases per unit increase in X.

• For a straight line this rate is constant.

• For non linear curves the rate that Y increases per unit increase in X varies with X.

Linear

0

20

40

60

80

100

120

0 10 20 30 40 50

Non-linear

Age Class 30-40 40-50 50-60 60-70 70-80Mipoint Age (X) 35 45 55 65 75Median BP (Y) 114 124 143 158 166

Example: In the following example both blood pressure and age were measure for each female subject. Subjects were grouped into age classes and the median Blood Pressure measurement was computed for each age class. He data are summarized below:

0

20

40

60

80

100

120

140

160

180

200

0 10 20 30 40 50 60 70 80

Y = 65.1 + 1.38 X

Graph:

Interpretation of the slope and intercept

1. Intercept – value of Y at X = 0.– Predicted Blood pressure of a newborn (65.1).– This interpretation remains valid only if

linearity is true down to X = 0.

2. Slope – rate of increase in Y per unit increase in X.

– Blood Pressure increases 1.38 units each year.

The Least Squares Line

Fitting the best straight line

to “linear” data

Reasons for fitting a straight line to data

1. It provides a precise description of the relationship between Y and X.

2. The interpretation of the parameters of the line (slope and intercept) leads to an improved understanding of the phenomena that is under study.

3. The equation of the line is useful for prediction of the dependent variable (Y) from the independent variable (X).

Assume that we have collected data on two variables X and Y. Let

(x1, y1) (x2, y2) (x3, y3) … (xn, yn)

denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

LetY = a + b X

denote an arbitrary equation of a straight line.a and b are known values.This equation can be used to predict for each value of X, the value of Y.

For example, if X = xi (as for the ith case) then the predicted value of Y is:

ii bxay ˆ

For example if

Y = a + b X = 25.2 + 2.0 X

Is the equation of the straight line.

and if X = xi = 20 (for the ith case) then the

predicted value of Y is:

2.65200.22.25ˆ ii bxay

If the actual value of Y is yi = 70.0 for case i, then the difference

is the error in the prediction for case i.

is also called the residual for case i

8.42.6570ˆ ii yy

iiiii bxayyyr ˆ

If the residual

can be computed for each case in the sample,

The residual sum of squares (RSS) is

a measure of the “goodness of fit of the line

Y = a + bX to the data

iiiii bxayyyr ˆ

,ˆ,,ˆ,ˆ 222111 nnn yyryyryyr

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

X

Y=a+bX

Y

(x1,y1)

(x2,y2)

(x3,y3)

(x4,y4)

r1

r2

r3 r4

The optimal choice of a and b will result in the residual sum of squares

attaining a minimum.

If this is the case than the line:

Y = a + bX

is called the Least Squares Line

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

R.S.S = 3389.9

0

10

20

30

40

50

60

70

0 10 20 30 40 50

Y = 10 + (0.5)X

R.S.S = 1861.9

0

10

20

30

40

50

60

70

0 10 20 30 40 50

Y = 15 + (0.5)X

R.S.S = 833.9

0

10

20

30

40

50

60

70

0 10 20 30 40 50

Y = 20 + (0.5)X

R.S.S = 883.1

0

10

20

30

40

50

60

70

0 10 20 30 40 50

Y = 20 + (1)X

R.S.S = 303.98

0

10

20

30

40

50

60

70

0 10 20 30 40 50

Y = 20 + (0.7)X

R.S.S = 225.74

0

10

20

30

40

50

60

70

0 10 20 30 40 50

Y = 26.46 + (0.55)X

The equation for the least squares line

Let

n

iixx xxS

1

2

n

iiyy yyS

1

2

n

iiixy yyxxS

1

n

x

xxxS

n

iin

ii

n

iixx

2

1

1

2

1

2

n

yx

yx

n

ii

n

iin

iii

11

1

n

y

yyyS

n

iin

ii

n

iiyy

2

1

1

2

1

2

n

iiixy yyxxS

1

Computing Formulae:

Then the slope of the least squares line can be shown to be:

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1

and the intercept of the least squares line can be shown to be:

xS

Syxbya

xx

xy

The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.

Country (i) Xi Yi

Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20

Iceland

NorwaySweden

DenmarkCanada

Australia

HollandSwitzerland

Great Britain

Finland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140

Per capita consumption of cigarettes

deat

h ra

tes

from

lung

can

cer

(195

0)

404,541

2

n

iix

914,161

n

iii yx

018,61

2

n

iiy

Fitting the Least Squares Line

6641

n

iix

2261

n

iiy

55.1432211

66454404

2

xxS

73.1374

11

2266018

2

yyS

82.3271

11

22666416914 xyS


First compute the following three quantities:

Computing Estimate of Slope and Intercept

288.055.14322

82.3271

xx

xy

S

Sb

756.611

664288.0

11

226

xbya

Iceland

NorwaySweden

DenmarkCanada

Australia

HollandSwitzerland

Great Britain

Finland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140


deat

h ra

tes

from

lung

can

cer

(195

0)

Y = 6.756 + (0.228)X

Interpretation of the slope and intercept

1. Intercept – value of Y at X = 0.– Predicted death rate from lung cancer

(6.756) for men in 1950 in Counties with no smoking in 1930 (X = 0).

2. Slope – rate of increase in Y per unit increase in X.

– Death rate from lung cancer for men in 1950 increases 0.228 units for each increase of 1 cigarette per capita consumption in 1930.

Age Class 30-40 40-50 50-60 60-70 70-80Mipoint Age (X) 35 45 55 65 75Median BP (Y) 114 124 143 158 166

Example: In the following example both blood pressure and age were measure for each female subject. Subjects were grouped into age classes and the median Blood Pressure measurement was computed for each age class. He data are summarized below:

125,161

2

n

iix

155,401

n

iii yx

341,1011

2

n

iiy


2751

n

iix

7051

n

iiy

10005

27516125

2

xxS

1936

5

705101341

2

yyS

1380

5

70527540155 xyS




38.11000

1380

xx

xy

S

Sb

1.655

275380.1

5

705

xbya

0

20

40

60

80

100

120

140

160

180

200

0 10 20 30 40 50 60 70 80

Y = 65.1 + 1.38 X

Graph:

Relationship between correlation and Linear Regression

1. Pearsons correlation.

• Takes values between –1 and +1

n

ii

n

ii

n

iii

yyxx

xy

yyxx

yyxx

SS

Sr

1

2

1

2

1

2. Least squares Line Y = a + bX– Minimises the Residual Sum of Squares:

– The Sum of Squares that measures the variability in Y that is unexplained by X.

– This can also be denoted by:

SSunexplained

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

Some other Sum of Squares:

– The Sum of Squares that measures the total variability in Y (ignoring X).

n

iiTotal yySS

1

2

– The Sum of Squares that measures the total variability in Y that is explained by X.

n

iiExplained yySS

1

2ˆ

It can be shown:

(Total variability in Y) = (variability in Y explained by X) + (variability in Y unexplained by X)

n

iii

n

ii

n

ii yyyyyy

1

2

1

2

1

2 ˆˆ

lainedUnExplainedTotal SSSSSS exp

It can also be shown:

= proportion variability in Y unexplained by X.

= the coefficient of determination

n

ii

n

ii

yy

yyr

1

2

1

2

2

ˆ

Further:

= proportion variability in Y that is unexplained by X.

n

ii

n

iii

yy

yyr

1

2

1

2

2

ˆ1

Web sites demonstrating statistical principles using Java applets:

These can be found at the link:http://www.csustan.edu/ppa/llg/stat_demos.htm

http://www.csustan.edu/ppa/llg/stat_demos.htm

http://www.csustan.edu/ppa/llg/stat_demos.htm

Example

TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.

Country (i) Xi Yi

Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20

55.1432211

66454404

2

xxS

73.1374

11

2266018

2

yyS

82.3271

11

22666416914 xyS




288.055.14322

82.3271

xx

xy

S

Sb

756.611

664288.0

11

226

xbya

Computing r and r2

737.0

73.137455.14322

82.3271

yyxx

xy

SS

Sr

544.0737.0 22 r

54.4% of the variability in Y (death rate due to lung Cancer (1950) is explained by X (per capita cigarette smoking in 1930)

Iceland

NorwaySweden

DenmarkCanada

Australia

HollandSwitzerland

Great Britain

Finland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140


deat

h ra

tes

from

lung

can

cer

(195

0)

Y = 6.756 + (0.228)X

multivariate data

Documents

positive slope

straight line