math 3680 lecture #1 graphical representation of data
TRANSCRIPT
Math 3680
Lecture #1
Graphical Representationof Data
In this first lecture, we will discuss some brief quantitative measures, which capture essential properties of a data set. This is often important in presentations:
– It is often not necessary to report exactly how each subject faired in an experiment. – Instead, report succinct summaries of the data.– Your audience has a short attention span– Communicate only the most important information
Types of Variables
Population - some generalization about a class of individuals, set of measurements, either existing or conceptual
Sample - subset of measurements from the population, some part of the population being examined
Units/Subjects - the things/people in a population
Inferences - a generalization made about a population based on a sample
Parameters - numerical facts about a population that investigators want to know
Statistics - numbers which can be computed from a sample. Parameters are estimated by statistics.
DEFINITIONSDEFINITIONS
Variables: There are several ways to characterize data – qualitative and quantitative.
Qualitative or categorical variables have answers which are descriptive words or phrases. – Ordinal : can be meaningfully ranked (e.g. survey data, grades)– Nominal : cannot be meaningfully ranked (e.g. race, gender etc.)
Quantitative variables have answers which are numbers.– Discrete variables (e.g. number of home runs)
have gaps between possible values– Continuous variables (e.g. household income)
have no gaps between possible values
Exercise: Classify the following variables as qualitative (nominal or ordinal) or quantitative (discrete or continuous).
• occupation • weight • opinion of teaching effectiveness• region of residence • grade point average• height • number of televisions owned • blood type• size of wrench randomly chosen from a wrench set
Median, Interquartile Range and Box-and-Whiskers Plot
Definition: Mode
The mode is the most frequently occurring value. (With rare exceptions, the mode is useless.)
What is the mode for the given data (points scored in the NFL postseason, 1992-94)?
26482838935154413172122929
133021383441727133020282329
17522030132010341029324031
Definition: Median
The median is chosen so that half of the data lies above the median and half lies below.
What is the median for the given data? • To find the median, we first order the data,
counting multiplicities:
5249444438383534313131302929
2928282726242322212120202017
17171513131313101099330
• If there is an even number of data values, the median must be constructed as above.
• If there is an odd number of data values, the median is simply the middle value.
Short Cut: For a data set with n values, the median rank is the entry. If this rank ends in 0.5, we take the average of the data values in the adjacent positions.
thn
21
While the median is often a useful summary for data, it is not complete by itself.
In particular, it does not provide information about the spread of the data.
Example: three data sets with median 60:
1006060606060606060600
10099959189602015310
1007876757460545150490
Definition: Range. The range of a data set is the difference between the largest element and the smallest element. That is:
range = largest – smallest.
While the range measures variation, it is not perfect.
1006060606060606060600
10099959189602015310
1007876757460545150490
Definitions:
• First Quartile. The first quartile is chosen so that 25% of the data lie at or below it.
• Second Quartile .The second quartile is chosen so that 50% of the data lie at or below it.
• Third Quartile. The third quartile is chosen so that 75% of the data lie at or below it.
1. Rank the data from smallest to largest.
2. Find the median – it is the second quartile.
3. Take the lower half of the data. (If there are an odd number of measurements, include the median.) The median of this lower half is the first quartile, Q1.
4. Repeat for the upper half to find the third quartile, Q3.
5. The difference Q3 - Q1 is called the interquartile range (IQR).
Computing quartiles:
Computing quartiles may be facilitated by using Microsoft Excel:
31 0 29 23 29 924 3 28 20 22 2129 10 30 13 17 1334 10 27 17 44 1520 13 44 3 35 930 20 38 21 38 2852 17 30 13 48 26
QUARTILE(A1:F7,1)
BOX-AND-WHISKER PLOTS1. Draw a vertical scale to include the low and high
values.2. To the scale’s right, draw a box between the first
and third quartiles.3. Draw a line through the box at the median value.4. Draw lines (whiskers) from the box to the low and
high values.5. Often the whiskers are drawn to the most extreme
values within 1.5 IQR of both Q1 and Q3. Symbols (+, *) are used to mark each possible outlier between 1.5 and 3 IQR, and each probable outlier beyond 3 IQR of both Q1 and Q3, respectively.
Exercise: Draw a boxplot for the domestic gross receipts of the top 100 movies of all time:
www.boxofficemojo.com/alltime/domestic.htm
Note: Many statistical software packages (SPSS, SAS, etc.) can create boxplots automatically. Unfortunately, Excel is not one of them.
33 92 63 72 8182 81 42 87 5582 48 49 72 7395 102 101 92 8995 74 73 99
Stem-and-Leaf Plots
Histograms:Continuous Data
In a histogram:
1) A histogram is a special kind of bar chart.
2) Percentages are represented by areas, not heights.
3) The height of a block represents the percentage per horizontal unit.
4) Be sure to decide on the endpoint convention.
Ex. Construct a histogram for the 2007 salaries of the 50 U.S. governors (p. 27, from Council of State Governments):
http://www.stateline.org/live/details/story?contentId=207914
Relative Density Class Frequency frequency per $1000
$ 70- 90,000
$ 90-110,000
$110-120,000
$120-130,000
$130-150,000
$150-170,000
$170-210,000
Ex: Draw a histogram for the domestic gross receipts of all movies that grossed at least $100 million:
www.boxofficemojo.com/alltime/domestic.htm Relative Density Class Frequency frequency per $1M
$100 - 110M
$110 - 120M
$120 - 130M
$130 - 150M
$150 - 175M
$175 - 200M
$200 - 250M
$250 - 300M
$300 - 800M
How do you decide on the classes?
1. Too few classes: very undescriptive.
2. Too many classes: very choppy.
3. Sturge’s rule of thumb: for a data set of size n,
k ≈ log2 n = ,
rounded up to the nearest integer.
4. For long tails, use wide classes as appropriate.
5. Within these guidelines, there are no absolute rules.
ln nln 2
Conclusion:
Do NOT use Excel to make histograms
Other software packages (R, Minitab, SPSS etc.) can make correct histograms.
For now, just draw histograms by hand.
Histograms:Discrete Data
Example: A production line inspector records the number of defective items produced each hour of an eight-hour shift:
4 2 4 5 10 5 3 6
Number Relative frequency of items Frequency (Density)
2
3
4
5
6
7
8
9
10
Notice that the bar for 2 stretches from 1.5 to 2.5, giving that bar a width of 1.