unit ii exploring data: describing patterns and …

UNIT II EXPLORING DATA:

DESCRIBING PATTERNS AND DEPARTURES FROM PATTERNS (Chapters 1-3)

(Chapter 1)

TYPES OF DATA

Data is the information that is often collected from a smaller group, a sample, to describe or infer about a

larger group, a population.

• A statistic is a numerical measurement that describes a sample (a sub-collection of elements).

~as with a group of randomly selected doctors…

~English letters: s, x , …

• A parameter is a numerical measurement that describes a population (all elements to be studied).

~as with a census of all US adults…

~Greek letters: s, m , …

Data are the information we are collecting, represented as counts or percentages (known as frequencies or

relative frequencies, respectively).

A variable is any characteristic of individuals (cases)—people (subjects), animals, objects, or places—that

can take different values for different individuals, such as height, gas mileage, heart rate, and eye color.

The distribution of a variable shows what values a variable takes and how often it takes those values.

Variables are classified as categorical or quantitative.

Categorical data (qualitative)—values are non-numerical labels/names; data are placed into one of

several groups/categories

• Eye color: brown, blue, green, …

Quantitative data (numeric)—values are numbers; data takes numerical values for which it is

sensible to find averages

• Measurements: height, weight, length, …

EXAMPLE: Complete the lists of examples of data:

Categorical data Quantitative data

(categories with non-numerical characteristics) (numbers representing counts or measurements)

— Eye color — Heights

— Social security numbers — Number of cars owned

— —

— —

Quantitative data can be further distinguished as

discrete or continuous.

Discrete data—a finite or countable Continuous data—infinitely many

number of possibilities, such as the possible values over a continuous

number of kids in a family span/scale, such as the amount of

(“fewer” kids) milk produced by a cow

(“less” milk)

Another method of data classification involves Four Levels of Measurement:

1. Nominal—categories by name/group

(no measurements or counts)

(cannot be ordered)

Examples:

• Yes/No responses

• Eye color: brown, blue, green

• Political affiliation: Democrat, Republican, Independent Categorical

2. Ordinal—categories and some order to show relative comparisons

(differences (when subtracting) between data are not meaningful)

Examples:

• Rankings of cities: 1st—Boston, 2nd—NYC, 3rd—Miami

• Ratings of professors: Excellent, Good, Fair, Poor

3. Interval—arranged with order and meaningful differences between values

(no true zero, no natural zero starting point, no “none of the quantity present”)

Examples:

• Temperature (F°, C°): where 0° F ≠ no heat; difference

between 68° F and 71° F is 3° F which has meaning

• Years: where year 0 ≠ no time Quantitative

4. Ratio—the interval level and a natural zero starting point making ratios meaningful

Examples:

• Distance: 0 miles = no distance; 10 mi to 20 mi is half the distance

• Price: $0 = no cost; $30 to $90 is triple the cost

• Age: 0 is natural starting point

~HINT~

Use “ratio test” to compare between interval and ratio values: if “twice”

correctly describes a quantity and its double, then it is likely ratio level:

➢ 50 lbs. is twice as much weight as 25 lbs.

➢ 80° F is not twice as hot as 40° F; can only interpret as 40° hotter

We are working with “univariate data” when we are describing only one variable, such as student GPAs.

Sometimes, we compare distributions of univariate data from multiple populations, such as comparing

student GPAs between juniors and seniors.

We are working with “bivariate data” when we are examining relationships/associations between two

variables about one population, such as time seniors spend studying and their GPAs.

Once we have collected data from an observational study or an experiment, we provide a graph and a written description of the distribution(s) of the data set(s). When we look at data sets for multiple populations, we must also provide a written comparison of the distributions.

CATEGORICAL DATA

Recall that categorical (qualitative) values are non-numerical labels/names where data are placed into one of

several groups/categories, such as eye color, type of car, and preferred superpower.

UNIVARIATE DATA—data for one variable

GRAPH

We will provide a graph of the distribution of each data set to show what values a variable takes and how

often it takes those values.

Frequency tables provide a basis for constructing graphs.

Frequency Distribution—a table that lists data values along with their corresponding frequencies

(counts) or relative frequencies (percentages)

1) Must have title, two columns (class, frequency)

2) List the categorical data values under “Class.”

3) Show corresponding counts under “Frequency” or percents under “Relative Frequency.”

a. Use tally marks to represent data values for each class interval.

b. Add tallies to find frequencies, or divide frequencies by n to find relative frequencies.

EXAMPLE: Construct a frequency table to show the preferred superpowers among your class.

Preferred Superpowers for Our Class

Class Frequency Tallies Relative Frequency

Mind Reading

Invisibility

Teleportation

Telekinesis

Time-stopping

The distributions of categorical univariate data can be graphed using the following:

• Dotplot

• Bar Graph

• Pie Chart

Dotplot—dots represent observations (each can represent multiple observations (provide key); used

with few categories, small data set, small range; column heights show relative frequency

1) Must have a title.

2) Draw using only a horizontal axis (labeled in units).

3) Each value in the range of the data must be represented.

4) This type of graph should generally only be used for integer values.

EXAMPLE: STUDENTS’ FAVORITE SNACKS

Bar Graph—bars represent amount of observations; relative area (height of column) shows relative

frequency (when y-axis starts at zero); (no skewness/symmetry with categorical data)

1) Must title; must label both axes (horizontal contains categories; vertical contains counts/frequencies or

percentages/relative frequencies).

3) Draw bars of equal width, equal spacing between the bars, and space between vertical axis

and first bar.

4) Use a break // in scale of vertical axis only at the beginning of the axis.

5) Omit any categories that contain zero observations.

EXAMPLE: Enrollment in Introductory Courses at Union University

Pie Chart—circular chart divided into sectors that indicate relative sizes, in proportion to the whole

group; percentages

1) Must label with a title.

2) Use proper percentages (proportional amounts of 360°). 3) Provide a key.

EXAMPLE: Accounting of Individual Expenses

DESCRIPTION

We will provide a written description/summary of the distribution(s) of the data set(s) to describe the

overall pattern, based on shape, outliers, center, and spread (SOCS).

We will use more detail to describe quantitative data (see next section).

EXAMPLE: Describe the distribution of courses at Union University.

COMPARISON

When we look at data sets for multiple populations, we must also provide a written comparison of the distributions. Using relative frequencies is especially advantageous over using counts when comparing groups of different sizes.

The following graphs can be used:

• Parallel Dotplot

• Side-by-Side Bar Graph (Double Bar Graph for two groups)

o Multiple populations

o Multiple variables (see “Bivariate Data” section below)

Parallel Dotplot—compare groups by stacking distributions on one horizontal axis

EXAMPLE: Describe/compare distributions of ratings by daytime and nighttime workers.

Employee Job Satisfaction Ratings

Side-by-Side Bar Graph—compare groups represented by multiple bars one graph

EXAMPLE:

BIVARIATE DATA—data for two variables

GRAPH

The distributions of categorical bivariate data can be graphed using the following:

• Double Bar Graph (Side-by-Side Bar Graph) (See also Segmented Bar Graph)

• Two-way Contingency Table

Double Bar Graph—include bars coded to represent different variables (provide key)

EXAMPLE:

EXAMPLE: Personality type and Favorite color

Two-way contingency table—displays counts for two categorical variables; used for examining

relationships/associations (quantitative data can be displayed using a two-way table if the data are

converted to represent categorical data, such as 90% as A, 80% as B, etc.)

EXAMPLE: This table shows marginal totals of the distribution of each variable separately in

the right column and bottom row from a total of 240 individuals.

CAR OWNERSHIP BY SEX AND VEHICLE TYPE

From two-way tables, we can find marginal distributions and conditional distributions.

A marginal distribution is the distribution of values of one of the variables for all the individuals.

Look in the margins for marginal totals, the last column and the bottom row.

EXAMPLE: Complete the following:

a. Construct a two-way table for preferred superpowers and sex (for our class).

b. Find marginal totals, and construct a bar graph for the marginal distribution of type

of superpower for all individuals. Describe the distribution.

A conditional distribution is the distribution of the values of one of the variables among individuals

with a specific value of the other variable. Since there are two variables in a two-way table, there

are two sets of conditional distributions, one for the row variable and one for the column variable.

With conditional distributions, we can check for a potential association/relationship between the

variables (and more formally with statistical inference (in Unit IV)). Consider which conditional

distributions to compare based on which variable might help to explain/predict (explanatory variable)

changes in the other variable (response variable).

An association exists when knowledge of one of the variables helps to predicts changes in the other

variable. To determine if data support a relationship, look for differences in the response variables

between the groups.

We will look at the strength of relationships, correlation later. Remember that even a strong

association can be influenced by other variables, so do not assume causation.

EXAMPLE: Sketch two “AP Enrollment” graphs representing association and no association.

EXAMPLE: Complete the following using the superpowers table.

a. Identify which variable is likely to be the explanatory variable and which is likely to be

the response variable.

1. Explanatory variable:

2. Response variable:

b. Find the conditional distributions of superpower for each sex, and construct a double bar

graph showing the relative frequency of superpower between the sexes.

c. Is there an association between sex and preferred superpower? Give appropriate evidence

to support your answer.

(Additional example: Pew Research Center, cell phone type & age)

Simpson’s Paradox—

(baseball, simpsons, doctors)

unit ii exploring data: describing patterns and …

Documents