introduction to biostatistics, harvard extension school © scott evans, ph.d. and lynne peeples,...
Post on 19-Dec-2015
236 views
TRANSCRIPT
© Scott Evans, Ph.D. and Lynne Peeples, M.S.1
Introduction to Biostatistics, Harvard Extension School
Descriptive Statistics,The Normal Distribution,
andStandardization
© Scott Evans, Ph.D. and Lynne Peeples, M.S.2
Introduction to Biostatistics, Harvard Extension School
Happy Valentine’s Day!
How many candy hearts in a box of NECCO Sweethearts?
1, 2, 3, 4, …, 40?1, 2, 3, 4, …, 40?
© Scott Evans, Ph.D. and Lynne Peeples, M.S.3
Introduction to Biostatistics, Harvard Extension School
Big Picture revisited…
Sample
x, s, s2
Populationμ, σ, σ2
Step I Step II Step III
StatisticalInference
(w/ Probability)
© Scott Evans, Ph.D. and Lynne Peeples, M.S.4
Introduction to Biostatistics, Harvard Extension School
SAMPLE = Boxes of Sweethearts
x, s, s2
POPULATION =All boxes of Sweethearts
μ, σ, σ2
~ 8 billion hearts made each year at NECCO!!
Step I: Take the Sample
Want a representative sample of boxes (i.e. hoping for different batches - purchased at different stores in Cambridge and Boston)
The larger the sample, the better
© Scott Evans, Ph.D. and Lynne Peeples, M.S.5
Introduction to Biostatistics, Harvard Extension School
Step I: Take the Sample
SAMPLE = 12 boxes Sweetheart counts ranging from 28 to 36
xx11 = 29= 29 xx22 = 31 = 31 xx33 = 32 = 32 xx44 = 27 = 27 xx55 = 36 = 36 xx66 = 35 = 35 xx77 = 29 = 29 xx88 = 30 = 30 xx99 = 31 = 31 xx1010 = 29 = 29 xx1111 = 28 = 28 xx1212 = 33 = 33
© Scott Evans, Ph.D. and Lynne Peeples, M.S.6
Introduction to Biostatistics, Harvard Extension School
Step II: Describe the Sample
Descriptive Statistics Measures of Central Tendency Measures of Variability Other Descriptive Measures
How can we describe our Sweetheart sample?
© Scott Evans, Ph.D. and Lynne Peeples, M.S.7
Introduction to Biostatistics, Harvard Extension School
Measures of Central Tendency
Measures the “center” of the data Examples
Mean Median Mode
The choice of which to use, depends… It is okay to report more than one.
They are simply descriptive (not inferential) However, when presenting (i.e. journal) limited on
space - forced to choose
© Scott Evans, Ph.D. and Lynne Peeples, M.S.8
Introduction to Biostatistics, Harvard Extension School
Mean
The “average”. If the data are made up of n observations:
x1, x2,…, xn, then the mean is given by the sum of the observations divided by the number of observations.
For example, if the data are: x1=1, x2=2, x3=3, then the mean is (1+2+3)/3=2.
Often denoted as
n
i
iXn
X1
1
© Scott Evans, Ph.D. and Lynne Peeples, M.S.9
Introduction to Biostatistics, Harvard Extension School
Mean
The population mean is often denoted by μ. This is usually unknown (although we try to make inferences about this).
The sample mean is an estimator of the population mean.
© Scott Evans, Ph.D. and Lynne Peeples, M.S.10
Introduction to Biostatistics, Harvard Extension School
Mean
n
i
iXn
X1
1= (29 + 31 + … + 28 + 33)/12
= 370/12 = 30.83
≈ 31 Sweethearts
What is the mean of our sample of sweethearts?
© Scott Evans, Ph.D. and Lynne Peeples, M.S.11
Introduction to Biostatistics, Harvard Extension School
Median
The “middle observation” according to its rank in the data.
The median is: The observation with rank (n+1)/2 if n is
odd. For example, if the data are {1,2,3}, then the median is 2.
The average of observations with rank n/2 and (n+2)/2 if n is even. For example, if the data are {1,2,3,4} then the median is 2.5.
© Scott Evans, Ph.D. and Lynne Peeples, M.S.12
Introduction to Biostatistics, Harvard Extension School
Median
What is the median of our sample of Sweethearts? Sort our 12 boxes in order by counts:
27, 28, 29, 29, 29, 30 | 31, 31, 32, 33, 35, 36
In our example, 30 and 31 are our middle numbers…So, the median = 30.5.
© Scott Evans, Ph.D. and Lynne Peeples, M.S.13
Introduction to Biostatistics, Harvard Extension School
Median
Another example: Income level with Bill Gates in the room.
The median is more robust than the mean to extreme observations. If data are skewed to the right, then the mean >
median (in general). For example, if the data are {1,2,3,4,20} then median=3 and mean=6.
If data are skewed to the left, then mean < median (in general).
For example, if the data are {1,15,16,18,20} then median=16 and mean=14.
If data are symmetric, then mean≈median
© Scott Evans, Ph.D. and Lynne Peeples, M.S.14
Introduction to Biostatistics, Harvard Extension School
Mode
The value that occurs the most often. For example, if data are {1,1,2,2,2,2,3,3}, the mode is 2.
Good for ordinal or nominal data in which there are a limited number of categories.
Not very useful for continuous data. For example, if data are {2,2,3,4,5,6,7,8,9}, the mode is 2 but is
not a good measure of central tendency in this case.
29 appears the most often (3x) in our Sweetheart example.
© Scott Evans, Ph.D. and Lynne Peeples, M.S.15
Introduction to Biostatistics, Harvard Extension School
Measures of Variability
Measure the “spread” in the data Example: Age distribution in the Extension
School vs. FAS college Some important measures
Variance Standard Deviation Range Interquartile Range
--- The larger value of these measures, the larger the spread and variability.
© Scott Evans, Ph.D. and Lynne Peeples, M.S.16
Introduction to Biostatistics, Harvard Extension School
Variance
The sample variance (s2) may be calculated from the data. It is the average of the square deviations of the observations from the mean.
The population variance is often denoted by σ2. This is usually unknown.
n
ii XX
ns
1
22
1
1
© Scott Evans, Ph.D. and Lynne Peeples, M.S.17
Introduction to Biostatistics, Harvard Extension School
Variance
The deviations are squared because we are only interested in the size of the deviation rather than the direction (larger or smaller than the mean).
Note:
Why? 0
1
n
ii XX
© Scott Evans, Ph.D. and Lynne Peeples, M.S.18
Introduction to Biostatistics, Harvard Extension School
Variance
The reason that we divide by n -1 instead of n has to do with the number of “information units” in the variance. After estimating the sample mean, there are only n-1 observations that are a priori unknown (degrees of freedom).
© Scott Evans, Ph.D. and Lynne Peeples, M.S.19
Introduction to Biostatistics, Harvard Extension School
Variance
n
ii XX
ns
1
22
1
1
12
1
283.30112
1
iiX
222 )83.3033(...)83.3031()83.3029(11
1
= = =
= 7.61
11
67.83
11
84.4...04.024.3
11
2.2...2.08.1 222
For our Sweetheart data…
© Scott Evans, Ph.D. and Lynne Peeples, M.S.20
Introduction to Biostatistics, Harvard Extension School
Standard Deviation
Square root of the variance s = sqrt(s2) = sample SD
Calculate from the data (see formula for s2 ) σ = sqrt(σ2) = population SD
Usually unknown
Expressed in the same units as the mean (instead of squared units like the variance)
In our Sweetheart example,
Now, summarized sample with just 2 numbers!
ssweetheart 76.261.7
© Scott Evans, Ph.D. and Lynne Peeples, M.S.21
Introduction to Biostatistics, Harvard Extension School
Range
Maximum-Minimum Sweetheart example: 36 – 28 = 8
Very sensitive to extreme observations (outliers)
© Scott Evans, Ph.D. and Lynne Peeples, M.S.22
Introduction to Biostatistics, Harvard Extension School
Interquartile Range
IQR=Q3-Q1 Q1: the first quartile Q3: the third quartile
More robust than the range to extreme observations
In our example, 27, 28, 29 | 29, 29, 30 | 31, 31, 32 | 33, 35, 36
IQR = 29-32.5 = 3.5 Sweethearts
© Scott Evans, Ph.D. and Lynne Peeples, M.S.23
Introduction to Biostatistics, Harvard Extension School
Other Descriptive Measures
Minimum and Maximum Very sensitive to extreme observations
Sample size (N) (i.e. 12 boxes)
Percentiles Examples:
Median = 50th percentile Q1, Q3 = 25th and 75th percentiles
© Scott Evans, Ph.D. and Lynne Peeples, M.S.24
Introduction to Biostatistics, Harvard Extension School
Small Samples
For very small samples (e.g., <5 observations), summary statistics are not very meaningful (actually can be misleading).
Better to simply list the data.
© Scott Evans, Ph.D. and Lynne Peeples, M.S.25
Introduction to Biostatistics, Harvard Extension School
Example – Firefighter CHD StudyTable 4: CHD Retirements versus Active Firefighters (Controls)
CHD Retirements (n= 277)Mean (Median), % (n)
Active Firefighters(n=310)Mean (Median), % (n)
Age 54.2 (55.0) 39.3 (39.0)
Age≥ 45 years old 94% (261) 21% (64)
Current Smoking 30% (76) 10% (31)
Hypertension 59% (141) 21% (65)
Cholesterol >/= 5.18 mmol/L(200 mg/dl)
80% (169) 63% (196)
Prior Diagnosis of CHD 22% (48) 1% (3)
BMI 30.3 (29.8) 28..9 (28.4)
Obesity, BMI >/=30 41% (98) 34% (104)
© Scott Evans, Ph.D. and Lynne Peeples, M.S.26
Introduction to Biostatistics, Harvard Extension School
Example – A5095
TZV Pooled EFV (n=382) (n=765)
Male 81% 81%
Mean age, years 38.0 38.0
Race or ethnic groupNon-Hispanic White 39% 41%Non-Hispanic Black 37% 36%Hispanic 21% 21%Other 2% <1%
Mean baseline HIV RNA, log10 c/mL 4.85 4.86
100,000 c/mL at screening 43% 43%
Mean baseline CD4 count, cells/mm3 234 242
© Scott Evans, Ph.D. and Lynne Peeples, M.S.27
Introduction to Biostatistics, Harvard Extension School
Random Variables
Variable A characteristic that can be measured,
categorized, quantified, or qualified. Random variable
A variable whose value is determined by a random phenomena (I.e., not determined by study design)
Continuous random variable Can take on any value within a specified interval
or continuum
© Scott Evans, Ph.D. and Lynne Peeples, M.S.28
Introduction to Biostatistics, Harvard Extension School
Probability Distributions
Every random variable has a corresponding probability distribution
A probability distribution describes the behavior of the random variable It identifies possible values of the random variable
and provides information about the probability that these values (or ranges of values) will occur.
A particularly important probability distribution is the Normal Distribution…
© Scott Evans, Ph.D. and Lynne Peeples, M.S.29
Introduction to Biostatistics, Harvard Extension School
Normal Distribution
xxf (2
1exp
2
1)(
© Scott Evans, Ph.D. and Lynne Peeples, M.S.30
Introduction to Biostatistics, Harvard Extension School
Normal Distribution
“ Bell-shaped curve”
Symmetric about its mean (μ)
The closer that an observation is to the mean, the more frequently it occurs.
Notation: X~N(μ,σ)
),(~ NX
© Scott Evans, Ph.D. and Lynne Peeples, M.S.31
Introduction to Biostatistics, Harvard Extension School
Location & Shape
μ = LOCATION σ = SHAPE
Note that some may have same mean, but differentiated by their spread (shape)
© Scott Evans, Ph.D. and Lynne Peeples, M.S.32
Introduction to Biostatistics, Harvard Extension School
Normal Distribution
The normal distribution, N(μ, σ) can be described by the following “density function”:
x
exf(
2
1
2
1)(
© Scott Evans, Ph.D. and Lynne Peeples, M.S.33
Introduction to Biostatistics, Harvard Extension School
Normal Distribution
The area under this curve (function) is one. Probabilities may be calculated as the area under the curve
(above the x-axis). Integration (calculus) can help quantify these areas (probabilities).
© Scott Evans, Ph.D. and Lynne Peeples, M.S.34
Introduction to Biostatistics, Harvard Extension School
Moving towards Step III…
Sample
x, s, s2
Populationμ, σ, σ2
Step I Step II Step III
StatisticalInference
(w/ Probability)
© Scott Evans, Ph.D. and Lynne Peeples, M.S.35
Introduction to Biostatistics, Harvard Extension School
Standard Normal Distribution
A special normal distribution: N(0,1) Values from this distribution represent
the number of SDs away from the mean (0).
Known properties of this distribution-- Can make probabilistic statements using
the standard normal table
© Scott Evans, Ph.D. and Lynne Peeples, M.S.36
Introduction to Biostatistics, Harvard Extension School
Standard Normal Distribution
-4 -3 -2 -1 0 1 2 3 4 μ-2σ μ-σ μ μ+σ μ+2σ
• For any variable X, with
mean μ and SD = σ :
• Z now has mean 0 and SD = 1.
• This “standardization” creates a variable Z, such that values of this variable represent the number of SD’s away from the mean (0).
X
Z
© Scott Evans, Ph.D. and Lynne Peeples, M.S.37
Introduction to Biostatistics, Harvard Extension School
Standard Normal Table
© Scott Evans, Ph.D. and Lynne Peeples, M.S.38
Introduction to Biostatistics, Harvard Extension School
Standardization
Common Mistake: X has mean μ and SD = σ, then Z=(X- μ)/σ ~ N(0,1). This is NOT true!! It is true that Z has mean 0 and SD=1
(standardization). However, Z is only normal if X was also normal.
© Scott Evans, Ph.D. and Lynne Peeples, M.S.39
Introduction to Biostatistics, Harvard Extension School
Standardization
However, if X~N(μ,σ), then Z=(X-μ)/σ ~N(0,1). We can then make probabilistic statements about X.
Thus we can make probabilistic statements about any variable with any normal distribution.
© Scott Evans, Ph.D. and Lynne Peeples, M.S.40
Introduction to Biostatistics, Harvard Extension School
Example - IQ
IQ~N(100,15)
What’s the probability that a person chosen at random has an IQ>135?
Z = (135-100)/15 = 2.33
-4 -3 -2 -1 0 1 2 3 4 70 85 100 115 130
15
100Z
IQ
© Scott Evans, Ph.D. and Lynne Peeples, M.S.41
Introduction to Biostatistics, Harvard Extension School
P(Z>2.33) = 0.010
Example - IQ
© Scott Evans, Ph.D. and Lynne Peeples, M.S.42
Introduction to Biostatistics, Harvard Extension School
Example – IQ
-4 -3 -2 -1 0 1 2 3 4 70 85 100 115 130
What’s the probability that a person chosen at random has an IQ<90?
Z = (90-100)/15 = -0.67
By symmetry, P(Z<-0.67) = P(Z>0.67)
Probabilities that a person chosen at random has an IQ between two values may also be obtained.
© Scott Evans, Ph.D. and Lynne Peeples, M.S.43
Introduction to Biostatistics, Harvard Extension School
Example - IQP(Z>0.67) = 0.251
© Scott Evans, Ph.D. and Lynne Peeples, M.S.44
Introduction to Biostatistics, Harvard Extension School
Central Limit Theorem
A very important result in statistics that permits use of the normal distribution for
making inferences (hypothesis testing and estimation) concerning the
population mean.
),(~n
NX n
© Scott Evans, Ph.D. and Lynne Peeples, M.S.45
Introduction to Biostatistics, Harvard Extension School
Central Limit Theorem
Sample 1
x1
Population(any distribution)
μ,σ
Sample 3
x3
Sample 2
x2
Sample 4
x4
Sample 5
x5
),(~n
NX n
• All samples of size n
Sample Means
© Scott Evans, Ph.D. and Lynne Peeples, M.S.46
Introduction to Biostatistics, Harvard Extension School
Central Limit Theorem
If the distribution of each observation in the population has mean μ and standard deviation σ regardless of whether the distribution is normal or not :
1. The distribution of the sample means (from samples of size n taken from the population) has mean μ identical to that of the population.
2. The standard deviation of this distribution is .
as n
as σ
3. As n gets large the shape of the distribution of the sample means is approximately that of a normal distribution
n
nX
© Scott Evans, Ph.D. and Lynne Peeples, M.S.47
Introduction to Biostatistics, Harvard Extension School
Central Limit Theorem
Variable X, population mean=100, SD=15 Samples of size 25 (for example)
Sample 1, mean=90 Sample 2, mean=115 Sample 3, mean=101 Sample 4, mean=94 . . Sample 30, mean=99
© Scott Evans, Ph.D. and Lynne Peeples, M.S.48
Introduction to Biostatistics, Harvard Extension School
Central Limit Theorem
Plot sample means (histogram): The sample means have mean 100 The sample means have a SD of
= 15/5 = 3
The distribution of sample means would tend to be normal as n gets large.
2515
© Scott Evans, Ph.D. and Lynne Peeples, M.S.49
Introduction to Biostatistics, Harvard Extension School
Central Limit Theorem
Now we can combine this normality result from the CLT with standardization to make probabilistic statements about the population mean!
© Scott Evans, Ph.D. and Lynne Peeples, M.S.50
Introduction to Biostatistics, Harvard Extension School
Assume, μ = 30.83 = 2.71/3.46 = 0.78
Sampling Distributionof Sweethearts
12
28.1 30.8 33.5 30.0 30.8 31.6
Population Distribution Sampling Distribution of Means
© Scott Evans, Ph.D. and Lynne Peeples, M.S.51
Introduction to Biostatistics, Harvard Extension School
Sampling Distributionof Sweethearts
28.1 30.8 33.5 -1 0 1
30.0 30.8 31.6 -1 0 1
Population Distribution Sampling Distribution of Means
73.2
83.30XZ
78.0
83.30XZ