6 random sampling and data descriptionhart/601/chapter6.pdf · statistics 601 6 random sampling and...
TRANSCRIPT
Statistics 601
6 Random Sampling and Data Description
Parameters are numerical characteristics of a population.
Statistics are numerical quantities calculated from the sample.
We will now use statistics to gain an understanding of the sample data. We will develop both
graphical and numerical methods of summarizing data.
Before we can use statistics to draw inferences about population parameters, we will need to learn
about statistical models which have probability theory as their basis.
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 1
Statistics 601
6.1 Pictorial and Tabular Methods in Descriptive Statistics
Consider the Following Data Set:
The concentration of suspended solids in river water is an important environmental characteristic.
The paper “Water Quality in Agricultural Watershed: Impact of Riparian Vegetation During Base
Flow” (Water Resources Bull., 1981, pp. 233-239) reported on concentrations (in parts per million,
or ppm) for several different rivers. Suppose the following 50 observations had been obtained for a
particular river.
55.8 60.9 37.0 91.3 65.8 42.3 33.8 60.6 76.0 69.045.9 39.1 35.5 56.0 44.6 71.7 61.2 61.5 47.2 74.583.2 40.0 31.7 36.7 62.3 47.3 94.6 56.3 30.0 68.275.3 71.4 65.2 52.6 58.2 48.0 61.8 78.8 39.8 65.060.7 77.1 59.1 49.5 69.3 69.8 64.9 27.1 87.1 66.3
Question: What do these data tell us about the concentration of suspended solids?
First few steps in analyzing a data set:
1. Organize and summarize the data.
2. Find the center of the data.
3. Examine the spread of the data.
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 2
Statistics 601
6.2 Stem and Leaf Display
A compact and descriptive method of organizing data without losing any information in the data.
• Leading digits are stems.
• Trailing digits are leaves.
• Indicate units somewhere on the display.
• Option: Sort the leaves.
• Comparative stem & leaf.
• Repeat stems if need be.
Advantages:
• No loss of information.
• Easy to do for small data sets.
Disadvantages:
• Time consuming for large data sets (by hand)
• Cannot be used for categorical data.
• Very space consuming for large data sets.
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 3
Statistics 601
Stem-and-leaf display of the solids data set with sorted leaves:
2 : 7
3 : 0245779
4 : 002567789
5 : 366689
6 : 111112255566899
7 : 01245679
8 : 37
9 : 15 units: ppm
Stem-and-leaf display with two stems per tens place:
2*: 7
3 : 024
3*: 5779
4 : 002
4*: 567789
5 : 3
5*: 66689
6 : 1111122
6*: 55566899
7 : 0124
7*: 5679
8 : 3
8*: 7
9 : 1
9*: 5 units: ppm
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 4
Statistics 601
Comparative stem-and-leaf display on the solids data set taken two years earlier:
Two Years Ago Current
-------------------------------------
8 : 1 :
9851 : 2 : 7
9887640 : 3 : 0245779
9997765322111 : 4 : 002567789
877554200 : 5 : 366689
9887653221 : 6 : 111112255566899
72210 : 7 : 01245679
95 : 8 : 37
: 9 : 15 units: ppm
Sometimes we redefine the leaves for low-numbered or ”narrow” data sets:
58, 58, 57, 54, 54, 54, 57, 57, 56, 56, 57, 51, 58, 54, 52, . . . , 52, 54
60 : 0
59 : 00
58 : 00000000000
57 : 0000000000
56 : 0000000000
55 : 0000000000000
54 : 0000000000000
53 : 0000
52 : 000
51 : 0
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 5
Statistics 601
6.3 Frequency Distributions for Quantitative Data
A very popular way to summarize data is with a frequency distribution. A frequency distribution is
a compact summary of a data set using a table with 3 or 4 columns:
Class interval (or category) — disjoint intervals containing all observations in the data
set
Frequency — Number of obs. in a class interval = f
Relative frequency — Proportion of obs. in interval = f/n
Cumulative frequency — Sum of the relative frequencies∑class
i=1 f/n.
Having too many intervals leads to a very jagged histogram.
Having too few intervals smooths away important features.
The number of classes is usually 5 to 20.
Use at least (2n)1/3 for a rough idea.
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 6
Statistics 601
We will form a frequency distribution for the solids data set:
55.8 60.9 37.0 91.3 65.8 42.3 33.8 60.6 76.0 69.045.9 39.1 35.5 56.0 44.6 71.7 61.2 61.5 47.2 74.583.2 40.0 31.7 36.7 62.3 47.3 94.6 56.3 30.0 68.275.3 71.4 65.2 52.6 58.2 48.0 61.8 78.8 39.8 65.060.7 77.1 59.1 49.5 69.3 69.8 64.9 27.1 87.1 66.3
50 observations. Approximate number of classes:√
50 = 7.07.
Class Interval [Tally] Frequency Relative f Cumulative f
20–29.9 1 .02 .02
30–39.9 8 .16 .18
40–49.9 8 .16 .34
50–59.9 6 .12 .46
60–69.9 16 .32 .78
70–79.9 7 .14 .92
80–89.9 2 .04 .96
90–99.9 2 .04 1.0
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 7
Statistics 601
6.4 Histogram
A histogram is a pictorial representation of a frequency distribution.
1. Draw an x-axis and mark class intervals.
2. Draw a rectangle whose area is proportional to the frequency of that interval.
20 40 60 80 100
05
1015
solids
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 8
Statistics 601
A true histogram or a density scale will have an area that is equal to 1.0. In that case we make the:
Rectangle Height =Relative Frequency
Base Length
In the case where all the intervals are of equal length all we need to do is add the appropriately
labeled y-axis.
20 40 60 80 100
0.0
0.00
50.
010
0.01
50.
020
0.02
50.
030
solids
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 9
Statistics 601
Histograms often exhibit particular shapes:
• unimodal
• bimodal
• multimodal
• symmetric
• positively skewed
• negatively skewed
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 10
Statistics 601
6.5 Measures of Location
Another step in gaining understanding of our data is to find the “center” of our data. What is the
center?
6.6 Mean / Average
We calculate the sample mean or average as follows:
x̄ = 1n
∑ni=1 xi
xi: The ith observation in the sample.
n : Sample size.
Example: Calculate the average concentration of solids.
x̄ =1n
50∑
i=1
xi =150× 2927 = 58.54
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 11
Statistics 601
6.7 Median
Median: The middle observation of the sorted data set.
Sample Median = x̃
We calculate the median:
n odd: x̃ = x((n+1)/2)
n even: x̃ = (x(n/2) + x((n+2)/2))/2
Example: Calculate the median of the solid concentrations.
27.1 30.0 31.7 33.8 35.5 36.7 37.0 39.1 39.8 40.042.3 44.6 45.9 47.2 47.3 48.0 49.5 52.6 55.8 56.056.3 58.2 59.1 60.6 60.7 60.9 61.2 61.5 61.8 62.364.9 65.0 65.2 65.8 66.3 68.2 69.0 69.3 69.8 71.471.7 74.5 75.3 76.0 77.1 78.8 83.2 87.2 91.3 94.6
n/2 = 25, x̃ = (x(25) + x(26))/2 = (60.7 + 60.9)/2 = 60.8
Discussion: How do outliers affect the mean and median?
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 12
Statistics 601
6.8 Other Measures of Location
6.8.1 Trimmed Mean
A trimmed mean is a compromise between x̄ and x̃ in that outliers will have some effect on the
trimmed mean but not as much as they have on the mean. It is calculated by eliminating a certain
percentage of the observations from both ends and calculating the average of the remaining data.
For example a 10% trimmed mean would eliminate 10% of the observations from each end of the
data (20% total) and average the remaining 80% of the observations.
Example: Calculate the 10% trimmed mean for the solid concentrations.
We have n = 50 observations. 10% of this is 50× .10 = 5. Therefore we eliminate 5
observations from each end for a total of 10 observations:
x̄ =140
45∑
i=6
xi =140× 2333.8 = 58.345
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 13
Statistics 601
6.8.2 Percentiles and Quartiles
The 100pth percentile is the observation in our data set where 100p% are equal to or less than
this observation. The median is the 50th percentile.
The following is a general approach to calculate the 100pth percentile x[p]:
1. Let x(i), i = 1, . . . , n, refer to our data set in ascending order.
2. Let ip = np.
3. Find the first index i such that i > ip.
4. The 100pth percentile is then:
x[p] =
x(i−1)+x(i)
2 if i− 1 = ip
x(i) otherwise
In short: If ip is integer we average the ithp and (ip + 1)th observation. Otherwise we round ip
up and take the dipeth observation.
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 14
Statistics 601
Q1 : First Quartile = 25th percentile
Q2 : Second Quartile = 50th percentile
Q3 : Third Quartile = 75th percentile
IQR = Q3 −Q1 = “Interquartile Range”
We can calculate quartiles by using our rules for finding the median. We consider two cases:
• n even:
– To obtain Q1, obtain the median of x(1), . . . , x(n/2).
– To obtain Q3, find the median of x((n/2)+1), . . . , x(n)
• n odd:
– To obtain Q1, obtain the median of x(1), . . . , x((n+1)/2).
– To obtain Q3, find the median of x((n+1)/2), . . . , x(n)
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 15
Statistics 601
Example: Calculate Q1 and Q3 for the solid concentrations.
27.1 30.0 31.7 33.8 35.5 36.7 37.0 39.1 39.8 40.042.3 44.6 45.9 47.2 47.3 48.0 49.5 52.6 55.8 56.056.3 58.2 59.1 60.6 60.7 60.9 61.2 61.5 61.8 62.364.9 65.0 65.2 65.8 66.3 68.2 69.0 69.3 69.8 71.471.7 74.5 75.3 76.0 77.1 78.8 83.2 87.2 91.3 94.6
Q1 =median of {x(1), . . . , x(25)} = x(13) = 45.9
Q3 =median of {x(26), . . . , x(50)} = x(38) = 69.3
Example: Calculate Q1 and Q3 for the values {2, 4, 9, 17, 22, 43, 65, 88, 103}.
n = 9
Q1 =median of {x(1), . . . , x(5)} = x(3) = 9
Q3 =median of {x(5), . . . , x(9)} = x(7) = 65
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 16
Statistics 601
6.8.3 Boxplots
Box plots are useful in summarizing various aspects of the data. Side-by-side box plots provide
useful comparisons of two or more sets of data.
1. Form an axis that includes all possible values of the data.
2. Draw a box extending from Q1 to Q3.
3. Draw a vertical bar at the median.
4. Draw whiskers (horizontal lines) to the most extreme observation within 1.5 IQR from
each end of the box.
5. Indicate mild outliers with a “◦”
6. Indicate extreme outliers with a “∗”
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 17
Statistics 601
Example: Calculate the summary statistics x̄, x̃, Q1, Q3 for the water quality data set. Then
construct a box plot.
x̄ = 58.54
Min = 27.10
Q1 = 45.9
X̃ = 60.80
Q3 = 69.3
Max = 94.60
IQR = 69.3− 45.9 = 23.4
4060
80
Particulate Matter
solid
s(pp
m)
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 18
Statistics 601
6.9 Measures of Variability
The mean, median, etc. do not give us a complete overview (summary) of our data.
For Example: Consider the following three data sets:
Data Measures of Spread
1: 20 30 40 50 60 70 50 30 350 18.71
2: 20 43 44 46 47 70 50 4 252 15.87
3: 40 43 44 46 47 50 10 4 12 3.46
– The mean and median are 45 for all three data sets.
– These data sets have very different spreads.
Ways to measure spread:
Range: range = maximum observation – minimum observation
Interquartile Range: IQR = Q3 −Q1
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 19
Statistics 601
Average Deviation from the Mean: We define the ith deviation to be: xi − x̄. Intuitive:
We average the deviations:1n
∑(xi − x̄)
Problem: this does not give us anything useful!
1n
∑(xi − x̄) = 1
n
∑xi − 1
n
∑x̄ = 1
n
∑xi − 1
nnx̄ = x̄− x̄ = 0
The result is always equal to zero!
Variance: We average the squared deviations from the mean and divide by n− 1instead of n to get a measure of spread called the sample variance:
s2 =1
n− 1
∑(xi − x̄)2
Calculation formula:
s2 =1
n− 1
(∑x2
i −(∑
xi
)2
n
)
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 20
Statistics 601
Standard Deviation: The units of the variance are units of the data squared. To make
the units the same as that of the data set we take the square root of the variance. This
is called the sample standard deviation:
s =√
s2
Note:
s is translation invariant:
s(x1, ..., xn) = s(x1 + a, ..., xn + a) for all a.
s is scale equivariant:
s(ax1, ..., axn) = |a|s(x1, ..., xn) for all a.
Example: Calculate the range, variance and standard deviation of the particulate solid data.
Range = Maximum−Minimum = 94.6− 27.1 = 67.5
s2 = 270.8469
s =√
270.8469 = 16.8469
Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 21