unit ii exploring data: describing patterns and …
TRANSCRIPT
UNIT II EXPLORING DATA:
DESCRIBING PATTERNS AND DEPARTURES FROM PATTERNS (Chapters 1-3)
(Chapter 1)
TYPES OF DATA
Data is the information that is often collected from a smaller group, a sample, to describe or infer about a
larger group, a population.
• A statistic is a numerical measurement that describes a sample (a sub-collection of elements).
~as with a group of randomly selected doctors…
~English letters: s, x , …
• A parameter is a numerical measurement that describes a population (all elements to be studied).
~as with a census of all US adults…
~Greek letters: s, m , …
Data are the information we are collecting, represented as counts or percentages (known as frequencies or
relative frequencies, respectively).
A variable is any characteristic of individuals (cases)—people (subjects), animals, objects, or places—that
can take different values for different individuals, such as height, gas mileage, heart rate, and eye color.
The distribution of a variable shows what values a variable takes and how often it takes those values.
Variables are classified as categorical or quantitative.
Categorical data (qualitative)—values are non-numerical labels/names; data are placed into one of
several groups/categories
• Eye color: brown, blue, green, …
Quantitative data (numeric)—values are numbers; data takes numerical values for which it is
sensible to find averages
• Measurements: height, weight, length, …
EXAMPLE: Complete the lists of examples of data:
Categorical data Quantitative data
(categories with non-numerical characteristics) (numbers representing counts or measurements)
— Eye color — Heights
— Social security numbers — Number of cars owned
— —
— —
Quantitative data can be further distinguished as
discrete or continuous.
Discrete data—a finite or countable Continuous data—infinitely many
number of possibilities, such as the possible values over a continuous
number of kids in a family span/scale, such as the amount of
(“fewer” kids) milk produced by a cow
(“less” milk)
Another method of data classification involves Four Levels of Measurement:
1. Nominal—categories by name/group
(no measurements or counts)
(cannot be ordered)
Examples:
• Yes/No responses
• Eye color: brown, blue, green
• Political affiliation: Democrat, Republican, Independent Categorical
2. Ordinal—categories and some order to show relative comparisons
(differences (when subtracting) between data are not meaningful)
Examples:
• Rankings of cities: 1st—Boston, 2nd—NYC, 3rd—Miami
• Ratings of professors: Excellent, Good, Fair, Poor
3. Interval—arranged with order and meaningful differences between values
(no true zero, no natural zero starting point, no “none of the quantity present”)
Examples:
• Temperature (F°, C°): where 0° F ≠ no heat; difference
between 68° F and 71° F is 3° F which has meaning
• Years: where year 0 ≠ no time Quantitative
4. Ratio—the interval level and a natural zero starting point making ratios meaningful
Examples:
• Distance: 0 miles = no distance; 10 mi to 20 mi is half the distance
• Price: $0 = no cost; $30 to $90 is triple the cost
• Age: 0 is natural starting point
~HINT~
Use “ratio test” to compare between interval and ratio values: if “twice”
correctly describes a quantity and its double, then it is likely ratio level:
➢ 50 lbs. is twice as much weight as 25 lbs.
➢ 80° F is not twice as hot as 40° F; can only interpret as 40° hotter
We are working with “univariate data” when we are describing only one variable, such as student GPAs.
Sometimes, we compare distributions of univariate data from multiple populations, such as comparing
student GPAs between juniors and seniors.
We are working with “bivariate data” when we are examining relationships/associations between two
variables about one population, such as time seniors spend studying and their GPAs.
Once we have collected data from an observational study or an experiment, we provide a graph and a written description of the distribution(s) of the data set(s). When we look at data sets for multiple populations, we must also provide a written comparison of the distributions.
CATEGORICAL DATA
Recall that categorical (qualitative) values are non-numerical labels/names where data are placed into one of
several groups/categories, such as eye color, type of car, and preferred superpower.
UNIVARIATE DATA—data for one variable
GRAPH
We will provide a graph of the distribution of each data set to show what values a variable takes and how
often it takes those values.
Frequency tables provide a basis for constructing graphs.
Frequency Distribution—a table that lists data values along with their corresponding frequencies
(counts) or relative frequencies (percentages)
1) Must have title, two columns (class, frequency)
2) List the categorical data values under “Class.”
3) Show corresponding counts under “Frequency” or percents under “Relative Frequency.”
a. Use tally marks to represent data values for each class interval.
b. Add tallies to find frequencies, or divide frequencies by n to find relative frequencies.
EXAMPLE: Construct a frequency table to show the preferred superpowers among your class.
Preferred Superpowers for Our Class
Class Frequency Tallies Relative Frequency
Mind Reading
Invisibility
Teleportation
Telekinesis
Time-stopping
The distributions of categorical univariate data can be graphed using the following:
• Dotplot
• Bar Graph
• Pie Chart
Dotplot—dots represent observations (each can represent multiple observations (provide key); used
with few categories, small data set, small range; column heights show relative frequency
1) Must have a title.
2) Draw using only a horizontal axis (labeled in units).
3) Each value in the range of the data must be represented.
4) This type of graph should generally only be used for integer values.
EXAMPLE: STUDENTS’ FAVORITE SNACKS
Bar Graph—bars represent amount of observations; relative area (height of column) shows relative
frequency (when y-axis starts at zero); (no skewness/symmetry with categorical data)
1) Must title; must label both axes (horizontal contains categories; vertical contains counts/frequencies or
percentages/relative frequencies).
3) Draw bars of equal width, equal spacing between the bars, and space between vertical axis
and first bar.
4) Use a break // in scale of vertical axis only at the beginning of the axis.
5) Omit any categories that contain zero observations.
EXAMPLE: Enrollment in Introductory Courses at Union University
Pie Chart—circular chart divided into sectors that indicate relative sizes, in proportion to the whole
group; percentages
1) Must label with a title.
2) Use proper percentages (proportional amounts of 360°). 3) Provide a key.
EXAMPLE: Accounting of Individual Expenses
DESCRIPTION
We will provide a written description/summary of the distribution(s) of the data set(s) to describe the
overall pattern, based on shape, outliers, center, and spread (SOCS).
We will use more detail to describe quantitative data (see next section).
EXAMPLE: Describe the distribution of courses at Union University.
COMPARISON
When we look at data sets for multiple populations, we must also provide a written comparison of the distributions. Using relative frequencies is especially advantageous over using counts when comparing groups of different sizes.
The following graphs can be used:
• Parallel Dotplot
• Side-by-Side Bar Graph (Double Bar Graph for two groups)
o Multiple populations
o Multiple variables (see “Bivariate Data” section below)
Parallel Dotplot—compare groups by stacking distributions on one horizontal axis
EXAMPLE: Describe/compare distributions of ratings by daytime and nighttime workers.
Employee Job Satisfaction Ratings
Side-by-Side Bar Graph—compare groups represented by multiple bars one graph
EXAMPLE:
BIVARIATE DATA—data for two variables
GRAPH
The distributions of categorical bivariate data can be graphed using the following:
• Double Bar Graph (Side-by-Side Bar Graph) (See also Segmented Bar Graph)
• Two-way Contingency Table
Double Bar Graph—include bars coded to represent different variables (provide key)
EXAMPLE:
EXAMPLE: Personality type and Favorite color
Two-way contingency table—displays counts for two categorical variables; used for examining
relationships/associations (quantitative data can be displayed using a two-way table if the data are
converted to represent categorical data, such as 90% as A, 80% as B, etc.)
EXAMPLE: This table shows marginal totals of the distribution of each variable separately in
the right column and bottom row from a total of 240 individuals.
CAR OWNERSHIP BY SEX AND VEHICLE TYPE
From two-way tables, we can find marginal distributions and conditional distributions.
A marginal distribution is the distribution of values of one of the variables for all the individuals.
Look in the margins for marginal totals, the last column and the bottom row.
EXAMPLE: Complete the following:
a. Construct a two-way table for preferred superpowers and sex (for our class).
b. Find marginal totals, and construct a bar graph for the marginal distribution of type
of superpower for all individuals. Describe the distribution.
A conditional distribution is the distribution of the values of one of the variables among individuals
with a specific value of the other variable. Since there are two variables in a two-way table, there
are two sets of conditional distributions, one for the row variable and one for the column variable.
With conditional distributions, we can check for a potential association/relationship between the
variables (and more formally with statistical inference (in Unit IV)). Consider which conditional
distributions to compare based on which variable might help to explain/predict (explanatory variable)
changes in the other variable (response variable).
An association exists when knowledge of one of the variables helps to predicts changes in the other
variable. To determine if data support a relationship, look for differences in the response variables
between the groups.
We will look at the strength of relationships, correlation later. Remember that even a strong
association can be influenced by other variables, so do not assume causation.
EXAMPLE: Sketch two “AP Enrollment” graphs representing association and no association.
EXAMPLE: Complete the following using the superpowers table.
a. Identify which variable is likely to be the explanatory variable and which is likely to be
the response variable.
1. Explanatory variable:
2. Response variable:
b. Find the conditional distributions of superpower for each sex, and construct a double bar
graph showing the relative frequency of superpower between the sexes.
c. Is there an association between sex and preferred superpower? Give appropriate evidence
to support your answer.
(Additional example: Pew Research Center, cell phone type & age)
Simpson’s Paradox—
(baseball, simpsons, doctors)