6 random sampling and data descriptionhart/601/chapter6.pdf · statistics 601 6 random sampling and...

Statistics 601

6 Random Sampling and Data Description

Parameters are numerical characteristics of a population.

Statistics are numerical quantities calculated from the sample.

We will now use statistics to gain an understanding of the sample data. We will develop both

graphical and numerical methods of summarizing data.

Before we can use statistics to draw inferences about population parameters, we will need to learn

about statistical models which have probability theory as their basis.

Chapter 6: Random Sampling and Data Description Copyright c©2007 by Thomas E. Wehrly Slide 1

Statistics 601

6.1 Pictorial and Tabular Methods in Descriptive Statistics

Consider the Following Data Set:

The concentration of suspended solids in river water is an important environmental characteristic.

The paper “Water Quality in Agricultural Watershed: Impact of Riparian Vegetation During Base

Flow” (Water Resources Bull., 1981, pp. 233-239) reported on concentrations (in parts per million,

or ppm) for several different rivers. Suppose the following 50 observations had been obtained for a

particular river.

55.8 60.9 37.0 91.3 65.8 42.3 33.8 60.6 76.0 69.045.9 39.1 35.5 56.0 44.6 71.7 61.2 61.5 47.2 74.583.2 40.0 31.7 36.7 62.3 47.3 94.6 56.3 30.0 68.275.3 71.4 65.2 52.6 58.2 48.0 61.8 78.8 39.8 65.060.7 77.1 59.1 49.5 69.3 69.8 64.9 27.1 87.1 66.3

Question: What do these data tell us about the concentration of suspended solids?

First few steps in analyzing a data set:

1. Organize and summarize the data.

2. Find the center of the data.

3. Examine the spread of the data.


Statistics 601

6.2 Stem and Leaf Display

A compact and descriptive method of organizing data without losing any information in the data.

• Leading digits are stems.

• Trailing digits are leaves.

• Indicate units somewhere on the display.

• Option: Sort the leaves.

• Comparative stem & leaf.

• Repeat stems if need be.

Advantages:

• No loss of information.

• Easy to do for small data sets.

Disadvantages:

• Time consuming for large data sets (by hand)

• Cannot be used for categorical data.

• Very space consuming for large data sets.


Statistics 601

Stem-and-leaf display of the solids data set with sorted leaves:

2 : 7

3 : 0245779

4 : 002567789

5 : 366689

6 : 111112255566899

7 : 01245679

8 : 37

9 : 15 units: ppm

Stem-and-leaf display with two stems per tens place:

2*: 7

3 : 024

3*: 5779

4 : 002

4*: 567789

5 : 3

5*: 66689

6 : 1111122

6*: 55566899

7 : 0124

7*: 5679

8 : 3

8*: 7

9 : 1

9*: 5 units: ppm


Statistics 601

Comparative stem-and-leaf display on the solids data set taken two years earlier:

Two Years Ago Current

-------------------------------------

8 : 1 :

9851 : 2 : 7

9887640 : 3 : 0245779

9997765322111 : 4 : 002567789

877554200 : 5 : 366689

9887653221 : 6 : 111112255566899

72210 : 7 : 01245679

95 : 8 : 37

: 9 : 15 units: ppm

Sometimes we redefine the leaves for low-numbered or ”narrow” data sets:

58, 58, 57, 54, 54, 54, 57, 57, 56, 56, 57, 51, 58, 54, 52, . . . , 52, 54

60 : 0

59 : 00

58 : 00000000000

57 : 0000000000

56 : 0000000000

55 : 0000000000000

54 : 0000000000000

53 : 0000

52 : 000

51 : 0


Statistics 601

6.3 Frequency Distributions for Quantitative Data

A very popular way to summarize data is with a frequency distribution. A frequency distribution is

a compact summary of a data set using a table with 3 or 4 columns:

Class interval (or category) — disjoint intervals containing all observations in the data

set

Frequency — Number of obs. in a class interval = f

Relative frequency — Proportion of obs. in interval = f/n

Cumulative frequency — Sum of the relative frequencies∑class

i=1 f/n.

Having too many intervals leads to a very jagged histogram.

Having too few intervals smooths away important features.

The number of classes is usually 5 to 20.

Use at least (2n)1/3 for a rough idea.


Statistics 601

We will form a frequency distribution for the solids data set:

55.8 60.9 37.0 91.3 65.8 42.3 33.8 60.6 76.0 69.045.9 39.1 35.5 56.0 44.6 71.7 61.2 61.5 47.2 74.583.2 40.0 31.7 36.7 62.3 47.3 94.6 56.3 30.0 68.275.3 71.4 65.2 52.6 58.2 48.0 61.8 78.8 39.8 65.060.7 77.1 59.1 49.5 69.3 69.8 64.9 27.1 87.1 66.3

50 observations. Approximate number of classes:√

50 = 7.07.

Class Interval [Tally] Frequency Relative f Cumulative f

20–29.9 1 .02 .02

30–39.9 8 .16 .18

40–49.9 8 .16 .34

50–59.9 6 .12 .46

60–69.9 16 .32 .78

70–79.9 7 .14 .92

80–89.9 2 .04 .96

90–99.9 2 .04 1.0


Statistics 601

6.4 Histogram

A histogram is a pictorial representation of a frequency distribution.

1. Draw an x-axis and mark class intervals.

2. Draw a rectangle whose area is proportional to the frequency of that interval.

20 40 60 80 100

05

1015

solids


Statistics 601

A true histogram or a density scale will have an area that is equal to 1.0. In that case we make the:

Rectangle Height =Relative Frequency

Base Length

In the case where all the intervals are of equal length all we need to do is add the appropriately

labeled y-axis.

20 40 60 80 100

0.0

0.00

50.

010

0.01

50.

020

0.02

50.

030

solids


Statistics 601

Histograms often exhibit particular shapes:

• unimodal

• bimodal

• multimodal

• symmetric

• positively skewed

• negatively skewed


Statistics 601

6.5 Measures of Location

Another step in gaining understanding of our data is to find the “center” of our data. What is the

center?

6.6 Mean / Average

We calculate the sample mean or average as follows:

x̄ = 1n

∑ni=1 xi

xi: The ith observation in the sample.

n : Sample size.

Example: Calculate the average concentration of solids.

x̄ =1n

50∑

i=1

xi =150× 2927 = 58.54


Statistics 601

6.7 Median

Median: The middle observation of the sorted data set.

Sample Median = x̃

We calculate the median:

n odd: x̃ = x((n+1)/2)

n even: x̃ = (x(n/2) + x((n+2)/2))/2

Example: Calculate the median of the solid concentrations.

27.1 30.0 31.7 33.8 35.5 36.7 37.0 39.1 39.8 40.042.3 44.6 45.9 47.2 47.3 48.0 49.5 52.6 55.8 56.056.3 58.2 59.1 60.6 60.7 60.9 61.2 61.5 61.8 62.364.9 65.0 65.2 65.8 66.3 68.2 69.0 69.3 69.8 71.471.7 74.5 75.3 76.0 77.1 78.8 83.2 87.2 91.3 94.6

n/2 = 25, x̃ = (x(25) + x(26))/2 = (60.7 + 60.9)/2 = 60.8

Discussion: How do outliers affect the mean and median?


Statistics 601

6.8 Other Measures of Location

6.8.1 Trimmed Mean

A trimmed mean is a compromise between x̄ and x̃ in that outliers will have some effect on the

trimmed mean but not as much as they have on the mean. It is calculated by eliminating a certain

percentage of the observations from both ends and calculating the average of the remaining data.

For example a 10% trimmed mean would eliminate 10% of the observations from each end of the

data (20% total) and average the remaining 80% of the observations.

Example: Calculate the 10% trimmed mean for the solid concentrations.

We have n = 50 observations. 10% of this is 50× .10 = 5. Therefore we eliminate 5

observations from each end for a total of 10 observations:

x̄ =140

45∑

i=6

xi =140× 2333.8 = 58.345


Statistics 601

6.8.2 Percentiles and Quartiles

The 100pth percentile is the observation in our data set where 100p% are equal to or less than

this observation. The median is the 50th percentile.

The following is a general approach to calculate the 100pth percentile x[p]:

1. Let x(i), i = 1, . . . , n, refer to our data set in ascending order.

2. Let ip = np.

3. Find the first index i such that i > ip.

4. The 100pth percentile is then:

x[p] =

x(i−1)+x(i)

2 if i− 1 = ip

x(i) otherwise

In short: If ip is integer we average the ithp and (ip + 1)th observation. Otherwise we round ip

up and take the dipeth observation.


Statistics 601

Q1 : First Quartile = 25th percentile

Q2 : Second Quartile = 50th percentile

Q3 : Third Quartile = 75th percentile

IQR = Q3 −Q1 = “Interquartile Range”

We can calculate quartiles by using our rules for finding the median. We consider two cases:

• n even:

– To obtain Q1, obtain the median of x(1), . . . , x(n/2).

– To obtain Q3, find the median of x((n/2)+1), . . . , x(n)

• n odd:

– To obtain Q1, obtain the median of x(1), . . . , x((n+1)/2).

– To obtain Q3, find the median of x((n+1)/2), . . . , x(n)


Statistics 601

Example: Calculate Q1 and Q3 for the solid concentrations.

27.1 30.0 31.7 33.8 35.5 36.7 37.0 39.1 39.8 40.042.3 44.6 45.9 47.2 47.3 48.0 49.5 52.6 55.8 56.056.3 58.2 59.1 60.6 60.7 60.9 61.2 61.5 61.8 62.364.9 65.0 65.2 65.8 66.3 68.2 69.0 69.3 69.8 71.471.7 74.5 75.3 76.0 77.1 78.8 83.2 87.2 91.3 94.6

Q1 =median of {x(1), . . . , x(25)} = x(13) = 45.9

Q3 =median of {x(26), . . . , x(50)} = x(38) = 69.3

Example: Calculate Q1 and Q3 for the values {2, 4, 9, 17, 22, 43, 65, 88, 103}.

n = 9

Q1 =median of {x(1), . . . , x(5)} = x(3) = 9

Q3 =median of {x(5), . . . , x(9)} = x(7) = 65


Statistics 601

6.8.3 Boxplots

Box plots are useful in summarizing various aspects of the data. Side-by-side box plots provide

useful comparisons of two or more sets of data.

1. Form an axis that includes all possible values of the data.

2. Draw a box extending from Q1 to Q3.

3. Draw a vertical bar at the median.

4. Draw whiskers (horizontal lines) to the most extreme observation within 1.5 IQR from

each end of the box.

5. Indicate mild outliers with a “◦”

6. Indicate extreme outliers with a “∗”


Statistics 601

Example: Calculate the summary statistics x̄, x̃, Q1, Q3 for the water quality data set. Then

construct a box plot.

x̄ = 58.54

Min = 27.10

Q1 = 45.9

X̃ = 60.80

Q3 = 69.3

Max = 94.60

IQR = 69.3− 45.9 = 23.4

4060

80

Particulate Matter

solid

s(pp

m)


Statistics 601

6.9 Measures of Variability

The mean, median, etc. do not give us a complete overview (summary) of our data.

For Example: Consider the following three data sets:

Data Measures of Spread

1: 20 30 40 50 60 70 50 30 350 18.71

2: 20 43 44 46 47 70 50 4 252 15.87

3: 40 43 44 46 47 50 10 4 12 3.46

– The mean and median are 45 for all three data sets.

– These data sets have very different spreads.

Ways to measure spread:

Range: range = maximum observation – minimum observation

Interquartile Range: IQR = Q3 −Q1


Statistics 601

Average Deviation from the Mean: We define the ith deviation to be: xi − x̄. Intuitive:

We average the deviations:1n

∑(xi − x̄)

Problem: this does not give us anything useful!

1n

∑(xi − x̄) = 1

n

∑xi − 1

n

∑x̄ = 1

n

∑xi − 1

nnx̄ = x̄− x̄ = 0

The result is always equal to zero!

Variance: We average the squared deviations from the mean and divide by n− 1instead of n to get a measure of spread called the sample variance:

s2 =1

n− 1

∑(xi − x̄)2

Calculation formula:

s2 =1

n− 1

(∑x2

i −(∑

xi

)2

n

)


Statistics 601

Standard Deviation: The units of the variance are units of the data squared. To make

the units the same as that of the data set we take the square root of the variance. This

is called the sample standard deviation:

s =√

s2

Note:

s is translation invariant:

s(x1, ..., xn) = s(x1 + a, ..., xn + a) for all a.

s is scale equivariant:

s(ax1, ..., axn) = |a|s(x1, ..., xn) for all a.

Example: Calculate the range, variance and standard deviation of the particulate solid data.

Range = Maximum−Minimum = 94.6− 27.1 = 67.5

s2 = 270.8469

s =√

270.8469 = 16.8469


6 random sampling and data descriptionhart/601/chapter6.pdf · statistics 601 6 random sampling and...

Documents