math 3680 lecture #1 graphical representation of data

Math 3680

Lecture #1

Graphical Representationof Data

In this first lecture, we will discuss some brief quantitative measures, which capture essential properties of a data set. This is often important in presentations:

– It is often not necessary to report exactly how each subject faired in an experiment. – Instead, report succinct summaries of the data.– Your audience has a short attention span– Communicate only the most important information

Types of Variables

Population - some generalization about a class of individuals, set of measurements, either existing or conceptual

Sample - subset of measurements from the population, some part of the population being examined

Units/Subjects - the things/people in a population

Inferences - a generalization made about a population based on a sample

Parameters - numerical facts about a population that investigators want to know

Statistics - numbers which can be computed from a sample. Parameters are estimated by statistics.

DEFINITIONSDEFINITIONS

Variables: There are several ways to characterize data – qualitative and quantitative.

Qualitative or categorical variables have answers which are descriptive words or phrases. – Ordinal : can be meaningfully ranked (e.g. survey data, grades)– Nominal : cannot be meaningfully ranked (e.g. race, gender etc.)

Quantitative variables have answers which are numbers.– Discrete variables (e.g. number of home runs)

have gaps between possible values– Continuous variables (e.g. household income)

have no gaps between possible values

Exercise: Classify the following variables as qualitative (nominal or ordinal) or quantitative (discrete or continuous).

• occupation • weight • opinion of teaching effectiveness• region of residence • grade point average• height • number of televisions owned • blood type• size of wrench randomly chosen from a wrench set

Median, Interquartile Range and Box-and-Whiskers Plot

Definition: Mode

The mode is the most frequently occurring value. (With rare exceptions, the mode is useless.)

What is the mode for the given data (points scored in the NFL postseason, 1992-94)?

26482838935154413172122929

133021383441727133020282329

17522030132010341029324031

Definition: Median

The median is chosen so that half of the data lies above the median and half lies below.

What is the median for the given data? • To find the median, we first order the data,

counting multiplicities:

5249444438383534313131302929

2928282726242322212120202017

17171513131313101099330

• If there is an even number of data values, the median must be constructed as above.

• If there is an odd number of data values, the median is simply the middle value.

Short Cut: For a data set with n values, the median rank is the entry. If this rank ends in 0.5, we take the average of the data values in the adjacent positions.

thn

21

While the median is often a useful summary for data, it is not complete by itself.

In particular, it does not provide information about the spread of the data.

Example: three data sets with median 60:

1006060606060606060600

10099959189602015310

1007876757460545150490

Definition: Range. The range of a data set is the difference between the largest element and the smallest element. That is:

range = largest – smallest.

While the range measures variation, it is not perfect.

1006060606060606060600

10099959189602015310

1007876757460545150490

Definitions:

• First Quartile. The first quartile is chosen so that 25% of the data lie at or below it.

• Second Quartile .The second quartile is chosen so that 50% of the data lie at or below it.

• Third Quartile. The third quartile is chosen so that 75% of the data lie at or below it.

1. Rank the data from smallest to largest.

2. Find the median – it is the second quartile.

3. Take the lower half of the data. (If there are an odd number of measurements, include the median.) The median of this lower half is the first quartile, Q1.

4. Repeat for the upper half to find the third quartile, Q3.

5. The difference Q3 - Q1 is called the interquartile range (IQR).

Computing quartiles:

Computing quartiles may be facilitated by using Microsoft Excel:

31 0 29 23 29 924 3 28 20 22 2129 10 30 13 17 1334 10 27 17 44 1520 13 44 3 35 930 20 38 21 38 2852 17 30 13 48 26

QUARTILE(A1:F7,1)

BOX-AND-WHISKER PLOTS1. Draw a vertical scale to include the low and high

values.2. To the scale’s right, draw a box between the first

and third quartiles.3. Draw a line through the box at the median value.4. Draw lines (whiskers) from the box to the low and

high values.5. Often the whiskers are drawn to the most extreme

values within 1.5 IQR of both Q1 and Q3. Symbols (+, *) are used to mark each possible outlier between 1.5 and 3 IQR, and each probable outlier beyond 3 IQR of both Q1 and Q3, respectively.

Exercise: Draw a boxplot for the domestic gross receipts of the top 100 movies of all time:

www.boxofficemojo.com/alltime/domestic.htm

Note: Many statistical software packages (SPSS, SAS, etc.) can create boxplots automatically. Unfortunately, Excel is not one of them.

33 92 63 72 8182 81 42 87 5582 48 49 72 7395 102 101 92 8995 74 73 99

Stem-and-Leaf Plots

Histograms:Continuous Data

In a histogram:

1) A histogram is a special kind of bar chart.

2) Percentages are represented by areas, not heights.

3) The height of a block represents the percentage per horizontal unit.

4) Be sure to decide on the endpoint convention.

Ex. Construct a histogram for the 2007 salaries of the 50 U.S. governors (p. 27, from Council of State Governments):

http://www.stateline.org/live/details/story?contentId=207914

Relative Density Class Frequency frequency per $1000

$ 70- 90,000

$ 90-110,000

$110-120,000

$120-130,000

$130-150,000

$150-170,000

$170-210,000

Ex: Draw a histogram for the domestic gross receipts of all movies that grossed at least $100 million:

www.boxofficemojo.com/alltime/domestic.htm Relative Density Class Frequency frequency per $1M

$100 - 110M

$110 - 120M

$120 - 130M

$130 - 150M

$150 - 175M

$175 - 200M

$200 - 250M

$250 - 300M

$300 - 800M

How do you decide on the classes?

1. Too few classes: very undescriptive.

2. Too many classes: very choppy.

3. Sturge’s rule of thumb: for a data set of size n,

k ≈ log2 n = ,

rounded up to the nearest integer.

4. For long tails, use wide classes as appropriate.

5. Within these guidelines, there are no absolute rules.

ln nln 2

Conclusion:

Do NOT use Excel to make histograms

Other software packages (R, Minitab, SPSS etc.) can make correct histograms.

For now, just draw histograms by hand.

Histograms:Discrete Data

Example: A production line inspector records the number of defective items produced each hour of an eight-hour shift:

4 2 4 5 10 5 3 6

Number Relative frequency of items Frequency (Density)

2

3

4

5

6

7

8

9

10

Notice that the bar for 2 stretches from 1.5 to 2.5, giving that bar a width of 1.

math 3680 lecture #1 graphical representation of data

Documents

data qualitative

data sets

survey data

given data points

odd number of data values

median rank

quantitative variables

discrete variables