introduction to biostatistics, harvard extension school © scott evans, ph.d. and lynne peeples,...

© Scott Evans, Ph.D. and Lynne Peeples, M.S.1

Introduction to Biostatistics, Harvard Extension School

Descriptive Statistics,The Normal Distribution,

andStandardization



Happy Valentine’s Day!

How many candy hearts in a box of NECCO Sweethearts?

1, 2, 3, 4, …, 40?1, 2, 3, 4, …, 40?



Big Picture revisited…

Sample

x, s, s2

Populationμ, σ, σ2

Step I Step II Step III

StatisticalInference

(w/ Probability)



SAMPLE = Boxes of Sweethearts

x, s, s2

POPULATION =All boxes of Sweethearts

μ, σ, σ2

~ 8 billion hearts made each year at NECCO!!

Step I: Take the Sample

Want a representative sample of boxes (i.e. hoping for different batches - purchased at different stores in Cambridge and Boston)

The larger the sample, the better



Step I: Take the Sample

SAMPLE = 12 boxes Sweetheart counts ranging from 28 to 36

xx11 = 29= 29 xx22 = 31 = 31 xx33 = 32 = 32 xx44 = 27 = 27 xx55 = 36 = 36 xx66 = 35 = 35 xx77 = 29 = 29 xx88 = 30 = 30 xx99 = 31 = 31 xx1010 = 29 = 29 xx1111 = 28 = 28 xx1212 = 33 = 33



Step II: Describe the Sample

Descriptive Statistics Measures of Central Tendency Measures of Variability Other Descriptive Measures

How can we describe our Sweetheart sample?



Measures of Central Tendency

Measures the “center” of the data Examples

Mean Median Mode

The choice of which to use, depends… It is okay to report more than one.

They are simply descriptive (not inferential) However, when presenting (i.e. journal) limited on

space - forced to choose



Mean

The “average”. If the data are made up of n observations:

x1, x2,…, xn, then the mean is given by the sum of the observations divided by the number of observations.

For example, if the data are: x1=1, x2=2, x3=3, then the mean is (1+2+3)/3=2.

Often denoted as

n

i

iXn

X1

1



Mean

The population mean is often denoted by μ. This is usually unknown (although we try to make inferences about this).

The sample mean is an estimator of the population mean.



Mean

n

i

iXn

X1

1= (29 + 31 + … + 28 + 33)/12

= 370/12 = 30.83

≈ 31 Sweethearts

What is the mean of our sample of sweethearts?



Median

The “middle observation” according to its rank in the data.

The median is: The observation with rank (n+1)/2 if n is

odd. For example, if the data are {1,2,3}, then the median is 2.

The average of observations with rank n/2 and (n+2)/2 if n is even. For example, if the data are {1,2,3,4} then the median is 2.5.



Median

What is the median of our sample of Sweethearts? Sort our 12 boxes in order by counts:

27, 28, 29, 29, 29, 30 | 31, 31, 32, 33, 35, 36

In our example, 30 and 31 are our middle numbers…So, the median = 30.5.



Median

Another example: Income level with Bill Gates in the room.

The median is more robust than the mean to extreme observations. If data are skewed to the right, then the mean >

median (in general). For example, if the data are {1,2,3,4,20} then median=3 and mean=6.

If data are skewed to the left, then mean < median (in general).

For example, if the data are {1,15,16,18,20} then median=16 and mean=14.

If data are symmetric, then mean≈median



Mode

The value that occurs the most often. For example, if data are {1,1,2,2,2,2,3,3}, the mode is 2.

Good for ordinal or nominal data in which there are a limited number of categories.

Not very useful for continuous data. For example, if data are {2,2,3,4,5,6,7,8,9}, the mode is 2 but is

not a good measure of central tendency in this case.

29 appears the most often (3x) in our Sweetheart example.



Measures of Variability

Measure the “spread” in the data Example: Age distribution in the Extension

School vs. FAS college Some important measures

Variance Standard Deviation Range Interquartile Range

--- The larger value of these measures, the larger the spread and variability.



Variance

The sample variance (s2) may be calculated from the data. It is the average of the square deviations of the observations from the mean.

The population variance is often denoted by σ2. This is usually unknown.

n

ii XX

ns

1

22

1

1



Variance

The deviations are squared because we are only interested in the size of the deviation rather than the direction (larger or smaller than the mean).

Note:

Why? 0

1

n

ii XX



Variance

The reason that we divide by n -1 instead of n has to do with the number of “information units” in the variance. After estimating the sample mean, there are only n-1 observations that are a priori unknown (degrees of freedom).



Variance

n

ii XX

ns

1

22

1

1

12

1

283.30112

1

iiX

222 )83.3033(...)83.3031()83.3029(11

1

= = =

= 7.61

11

67.83

11

84.4...04.024.3

11

2.2...2.08.1 222

For our Sweetheart data…



Standard Deviation

Square root of the variance s = sqrt(s2) = sample SD

Calculate from the data (see formula for s2 ) σ = sqrt(σ2) = population SD

Usually unknown

Expressed in the same units as the mean (instead of squared units like the variance)

In our Sweetheart example,

Now, summarized sample with just 2 numbers!

ssweetheart 76.261.7



Range

Maximum-Minimum Sweetheart example: 36 – 28 = 8

Very sensitive to extreme observations (outliers)



Interquartile Range

IQR=Q3-Q1 Q1: the first quartile Q3: the third quartile

More robust than the range to extreme observations

In our example, 27, 28, 29 | 29, 29, 30 | 31, 31, 32 | 33, 35, 36

IQR = 29-32.5 = 3.5 Sweethearts



Other Descriptive Measures

Minimum and Maximum Very sensitive to extreme observations

Sample size (N) (i.e. 12 boxes)

Percentiles Examples:

Median = 50th percentile Q1, Q3 = 25th and 75th percentiles



Small Samples

For very small samples (e.g., <5 observations), summary statistics are not very meaningful (actually can be misleading).

Better to simply list the data.



Example – Firefighter CHD StudyTable 4: CHD Retirements versus Active Firefighters (Controls)

CHD Retirements (n= 277)Mean (Median), % (n)

Active Firefighters(n=310)Mean (Median), % (n)

Age 54.2 (55.0) 39.3 (39.0)

Age≥ 45 years old 94% (261) 21% (64)

Current Smoking 30% (76) 10% (31)

Hypertension 59% (141) 21% (65)

Cholesterol >/= 5.18 mmol/L(200 mg/dl)

80% (169) 63% (196)

Prior Diagnosis of CHD 22% (48) 1% (3)

BMI 30.3 (29.8) 28..9 (28.4)

Obesity, BMI >/=30 41% (98) 34% (104)



Example – A5095

TZV Pooled EFV (n=382) (n=765)

Male 81% 81%

Mean age, years 38.0 38.0

Race or ethnic groupNon-Hispanic White 39% 41%Non-Hispanic Black 37% 36%Hispanic 21% 21%Other 2% <1%

Mean baseline HIV RNA, log10 c/mL 4.85 4.86

100,000 c/mL at screening 43% 43%

Mean baseline CD4 count, cells/mm3 234 242



Random Variables

Variable A characteristic that can be measured,

categorized, quantified, or qualified. Random variable

A variable whose value is determined by a random phenomena (I.e., not determined by study design)

Continuous random variable Can take on any value within a specified interval

or continuum



Probability Distributions

Every random variable has a corresponding probability distribution

A probability distribution describes the behavior of the random variable It identifies possible values of the random variable

and provides information about the probability that these values (or ranges of values) will occur.

A particularly important probability distribution is the Normal Distribution…



Normal Distribution

xxf (2

1exp

2

1)(



Normal Distribution

“ Bell-shaped curve”

Symmetric about its mean (μ)

The closer that an observation is to the mean, the more frequently it occurs.

Notation: X~N(μ,σ)

),(~ NX



Location & Shape

μ = LOCATION σ = SHAPE

Note that some may have same mean, but differentiated by their spread (shape)



Normal Distribution

The normal distribution, N(μ, σ) can be described by the following “density function”:

x

exf(

2

1

2

1)(



Normal Distribution

The area under this curve (function) is one. Probabilities may be calculated as the area under the curve

(above the x-axis). Integration (calculus) can help quantify these areas (probabilities).



Moving towards Step III…

Sample

x, s, s2

Populationμ, σ, σ2

Step I Step II Step III

StatisticalInference

(w/ Probability)



Standard Normal Distribution

A special normal distribution: N(0,1) Values from this distribution represent

the number of SDs away from the mean (0).

Known properties of this distribution-- Can make probabilistic statements using

the standard normal table



Standard Normal Distribution

-4 -3 -2 -1 0 1 2 3 4 μ-2σ μ-σ μ μ+σ μ+2σ

• For any variable X, with

mean μ and SD = σ :

• Z now has mean 0 and SD = 1.

• This “standardization” creates a variable Z, such that values of this variable represent the number of SD’s away from the mean (0).

X

Z



Standard Normal Table



Standardization

Common Mistake: X has mean μ and SD = σ, then Z=(X- μ)/σ ~ N(0,1). This is NOT true!! It is true that Z has mean 0 and SD=1

(standardization). However, Z is only normal if X was also normal.



Standardization

However, if X~N(μ,σ), then Z=(X-μ)/σ ~N(0,1). We can then make probabilistic statements about X.

Thus we can make probabilistic statements about any variable with any normal distribution.



Example - IQ

IQ~N(100,15)

What’s the probability that a person chosen at random has an IQ>135?

Z = (135-100)/15 = 2.33

-4 -3 -2 -1 0 1 2 3 4 70 85 100 115 130

15

100Z

IQ



P(Z>2.33) = 0.010

Example - IQ



Example – IQ

-4 -3 -2 -1 0 1 2 3 4 70 85 100 115 130

What’s the probability that a person chosen at random has an IQ<90?

Z = (90-100)/15 = -0.67

By symmetry, P(Z<-0.67) = P(Z>0.67)

Probabilities that a person chosen at random has an IQ between two values may also be obtained.



Example - IQP(Z>0.67) = 0.251



Central Limit Theorem

A very important result in statistics that permits use of the normal distribution for

making inferences (hypothesis testing and estimation) concerning the

population mean.

),(~n

NX n




Sample 1

x1

Population(any distribution)

μ,σ

Sample 3

x3

Sample 2

x2

Sample 4

x4

Sample 5

x5

),(~n

NX n

• All samples of size n

Sample Means




If the distribution of each observation in the population has mean μ and standard deviation σ regardless of whether the distribution is normal or not :

1. The distribution of the sample means (from samples of size n taken from the population) has mean μ identical to that of the population.

2. The standard deviation of this distribution is .

as n

as σ

3. As n gets large the shape of the distribution of the sample means is approximately that of a normal distribution

n

nX




Variable X, population mean=100, SD=15 Samples of size 25 (for example)

Sample 1, mean=90 Sample 2, mean=115 Sample 3, mean=101 Sample 4, mean=94 . . Sample 30, mean=99




Plot sample means (histogram): The sample means have mean 100 The sample means have a SD of

= 15/5 = 3

The distribution of sample means would tend to be normal as n gets large.

2515




Now we can combine this normality result from the CLT with standardization to make probabilistic statements about the population mean!



Assume, μ = 30.83 = 2.71/3.46 = 0.78

Sampling Distributionof Sweethearts

12

28.1 30.8 33.5 30.0 30.8 31.6

Population Distribution Sampling Distribution of Means



Sampling Distributionof Sweethearts

28.1 30.8 33.5 -1 0 1

30.0 30.8 31.6 -1 0 1

Population Distribution Sampling Distribution of Means

73.2

83.30XZ

78.0

83.30XZ

introduction to biostatistics, harvard extension school © scott evans, ph.d. and lynne peeples,...

Documents

boxes of sweethearts

lynne peeples

sample sample

sample mean

big picture revisited

sample of sweethearts

population mean

sweetheart sample