descriptive statistics, histograms, and normal approximations

Descriptive Statistics, Histograms, and Normal Approximations

Math 1680

Overview

Obtaining Data Sets Descriptive Statistics Histograms The Normal Curve Standardization Normal Approximation Summary

Obtaining Data Sets

Before we can analyze a data set, we need to have a data set How far do you travel to get to class, in

miles? How tall are you?

Today, numerical data is easily stored and organized (and even analyzed) by several computer programs

Obtaining Data Sets

Notice that in its raw form, the data is difficult to deal with

By sorting the data, we can get a better picture of its distribution, or shape We are often interested in…

Where the data are centered How spread out they are With what frequency numbers appear

Obtaining Data Sets

Usually, the entire data set is too large to work with directly We want ways to summarize the data We have quantitative (numerical) and

pictorial descriptions available to us Descriptive Statistics Histograms

Descriptive Statistics

We can summarize the data set with a few simple numbers, called descriptive (or summary) statistics

The first and most often-used summary stat is the average (or mean) Represents the central tendency of the data set

Gives an idea of where the bulk of the points lie

To calculate the average, add up the values of all of the points and divide by the total number of points in the set


Calculate the average of the following data sets

60 60 60 60 60

18 59 60 63 100

18 35 60 87 100

605

300

5

6060606060

605

300

5

10087603518

605

300

5

10063605918


Despite having the same average, the three data sets are clearly different The average alone usually does not

describe data sets uniquely


The median is another central tendency measure The median marks the point where

exactly half of the data are less than (or equal to) the median If there are an odd number of data points,

then the median is just the number in the middle of the sorted set

Otherwise, the median is the average of the two points in the middle of the sorted set


Calculate the median of each data set 1 4 5 7 10 15 18

5249444438383534313131302929

2928282726242322212120202017

17171513131313101099330

5.222

2322


The average is like a balance point It represents the place where the data set

is equally “heavy” on both sides If there are outliers on one side of the

data set, the average will be skewed The median is more robust

What this means is that it is usually less s affected by outliers or data entry errors.


In a certain class of 13 students, 10 showed up the first exam, while 3 blew it off Here are the grades; in order:

Calculate the class median… Including all students Not counting those who slept in

98949387848179786855000

79

82.5


In a certain class of 13 students, 10 showed up the first exam, while 3 blew it off Here are the grades; in order:

Calculate the class average… Including all students Not counting those who slept in

98949387848179786855000

62.8

81.7


Suppose the teacher mistyped the grade of 55 as being a 15 Not counting the sleepers,

What is the new median?

What is the new average?

98949387848179786815000

98949387848179786855000

82.5

77.7


Earlier, we saw that the average did not necessarily uniquely describe a data set

We use the standard deviation (SD) to measure spread in a data set

When paired, the average and SD are highly effective summary statistics


The Root-Mean-Square (RMS) measures the typical absolute value of data points in a set Calculated by reading its name backwards

Square all entries in the data set Take their mean Take the square root of that mean

Find the average and then the RMS size of the numbers of the list 36531

Average = 0 RMS = 4


The SD embodies the same concept of “typical” distance Where the RMS measures typical

distance from 0, the SD measures typical distance from the data set’s average

This is accomplished by subtracting the average from every data point and then taking the RMS of the differences (or deviations from the mean)


1 4 5 7 10 15 has an average of 7

The deviations are then -6 -3 -2 0 3 8 Note how the subtraction process re-

centers the data set so that the average is at 0


Taking the RMS of the deviations gives the standard deviation Normally, about two thirds to three

quarters of a data set should be within one SD of the mean

1 4 5 7 10 15 has an average of 7 and an SD of about 4.5


1 4 5 7 10 15

-6 -3 -2 0 3 8

36 9 4 0 9 64

Average = 7

(1+4+5+7+10+15)/6 = 7

1-7 4-7 5-7 7-7 10-7 15-7

(36+9+4+0+9+64)/6 = 122/6 ≈ 20.3

√(20.3) ≈ 4.5

SD ≈ 4.5

(-6)2 (-3)2 (-2)2 02 32 82


What we had on the previous slide is called the SD of the sample. However, if the goal is to use this sample to estimate the SD of a larger population, we would divide by n-1 instead of n (where n is the number of points) and call the result Sample SD.

Most calculators actually calculate the sample SD. In general, the higher a set’s SD, the more spread

out its points are An SD of 0 indicates that every point in the data set

has the same value


Calculate the SD’s of the data sets 60 60 60 60 60

18 59 60 63 100

12 35 60 87 100

0

26.0

30.7

Histograms

Often, we would prefer a pictorial representation of a data set to a two-number summary

The most common way to graphically represent a data set is to draw a frequency histogram (or just histogram)

Histograms

Histograms tend to look like city skylines In a histogram, the area under the curve

between two points on the horizontal axis represents the proportion of data points between those two points

Continuing the city skyline analogy, the size of the building determines how many people live there A long, low building can house as many people as a

thin skyscraper

Histograms

To draw a histogram, we first need to organize our data into bins (or class intervals) Often, the bins are dictated to us If we get to choose them, we try to pick the bins

so that they give a fair representation of the data Then mark a horizontal axis with the bin

values, spacing them correctly

Histograms

Often, data is given in percentage form If not, divide the number of points in the bin by

the number of points in the data set to get the percentage

Draw a box for each bin so that the area of the box is the percentage of the data in that bin To get the correct height of the box, divide the

percentage of the box by the width of the bin

Histograms

Note that the average and median can be visually located on a histogram If the histogram was balanced on a see-saw, the fulcrum

would meet the histogram at the average If you draw a vertical line through the histogram so that it

splits the area in half, then the line passes through the median

On a symmetric histogram, the average and median tend to coincide

Asymmetric tails pull the average in the direction of the tail

The Normal Curve

A great many data sets have similarly-shaped histograms SAT scores Attendance at baseball games Battery life Cash flow of a bank Heights of adult males/females

The Normal Curve

These histograms are similar to one generated by a very special distribution It is called the normal distribution, and it

is identified by two parameters we are already familiar with average standard deviation

The Normal Curve

This is the standard normal curve, where the average is 0 and the SD is 1

The Normal Curve

Though the equation used to draw the curve is not easy to work with, there is a table of values for the standard normal distribution We will use this table to find areas under

the curve The table is on page A-105 of your text

The Normal Curve

Properties of the standard normal curve The curve is “bell-shaped” with its highest point

at 0

It is symmetric about a vertical line through 0

The curve approaches the horizontal axis, but the curve and the horizontal axis never meet

The Normal Curve

Area underneath the standard normal curve Half the area lies to the left of 0; half lies to the

right Approximately 68% of the area lies between –1

and 1 Approximately 95% of the area lies between –2

and 2 Approximately 99.7% of the area lies between

–3 and 3

Standardization

Most data sets do not have a mean of 0 and an SD of 1

To be able to use the standard normal curve, we’ll need to standardize numbers in the original data set To standardize a number, subtract the data

set’s average and then divide the difference by the data set’s SD

Standardizing is basically a change of scale Like converting feet to miles

Standardization

Suppose there are two different sections of the same course The scores for the midterm in each section were

approximately normally distributed In first section, the average was 64 and the standard

deviation was 5 Tina scored a 74 in first section

In second section, the average was 72 and the standard deviation was 10 Jack scored an 82 in second section

Which of the two scores is most impressive, relative to the students in his/her section?

Standardization

Convert the following scores in the first section to standard units Alice got a 50

Bob got a 61

Carol got a 64

Dan got a 77

-2.8

-0.6

0

2.6

Standardization

In Jack’s section, students with grades between 62 and 82 received a B What percentage of students in this

section received Bs?

Is this percentage exact?

68.27%

No

Normal Approximation

According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches


The normal curve is a smooth-curve histogram for normally distributed data We can estimate percentages within a

given range Find the area under the curve between those

ranges using the standard normal table


Sometimes will require cutting and pasting different areas together The standard normal table on page A-

105 takes a standard score z It returns to you the area under the curve

between –z and z


Find the area between –1.2 and 1.2 under normal curve

76.99%


Find the area between 0 and 1.65 under the standard normal curve

45.055%


Find the area between 0 and 3.3 under the standard normal curve

49.9515%


Find the area between –0.35 and 0.95 under normal curve

46.58%


Find the area between 1.2 and 1.85 under the normal curve

8.29%


Find the area between –2.1 and –1.05 under the normal curve

12.9%


Find the area to the right of 1 under the normal curve

15.865%


Find the area to the left of 0.85 under the normal curve

80.235%


If a data set is approximately normal in distribution, we can use the normal curve in place of the data set’s histogram

If you want to estimate the percentage of the data set between two numbers… Standardize the numbers to get z scores Look each z score up in the standard normal

table Cut and paste the areas to match the region you

originally wanted The percentage under the curve will be close to the

percentage in the data set


It is generally helpful to sketch the curve first and shade in the desired area This will remind you what the target area

is


According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches What percentage of women has heights

between 60 and 68 inches?

88.71%


According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches What percentage of women are taller

than 66 inches?

15.865%


Sometimes, you will be given the percentage of the data set Want to find score(s) which mark(s) off that

percentage Adjust the area to “center” it

Look up the z score associated with that area in the table

Unstandardize the z score by multiplying it by the SD and adding the average to the product


For a certain population of high school students, the SAT-M scores are normally distributed with average 500 and SD=100 A certain engineering college will accept only

high school seniors with SAT-M scores in the top 5%

What is the minimum SAT-M score for this program?

665


One way to determine how large a number is in the data set is to find its percentile rank The kth percentile is the value so that k

percent of the data set have values below it

Percentile ranks can be calculated for any data set


In one year, the 1600-point SAT scores were approximately normal with an average of 1030 and an SD of 190 If a student scores a 1460, what is her

percentile rank?

98th percentile

Summary

It is often useful to describe a data set with summary statistics The average and median are central tendency statistics

The average is more sensitive to outliers

The standard deviation (SD) is the most common summary statistic for describing a data set’s spread The SD is calculated by taking the RMS of the deviations

from the mean of each data point in the set Most of the points in most data sets will lie within one or

two SD’s from the average

Summary

We can represent a data set graphically by drawing a histogram The percentage of the data set in a bin is the area under

the histogram of that section The height of each block in a histogram is the percentage

of the data in the corresponding bin divided by the width of the bin

The total area under any histogram is 100% The average of a data set is located at the balance

point of the histogram Long tails pull the average in the direction of the tail

Summary

Using the average and SD, we can standardize numbers in the data set The standard score (z) of a number is its

distance from the average in terms of SD’s

We can also take a standard score and convert it back to a raw score

Summary

Many data sets are approximately normal We can estimate the percentage of points in a

data set that fall between two numbers Convert the numbers to standard units Find the area under the standard normal curve by

using the normal table

If a data set is approximately normal, we can use the normal table to estimate percentile ranks

descriptive statistics, histograms, and normal approximations

Documents

numerical data

data setgives

data setwe

data set1

data setwhen

data sethow

entire data set

data entry errors