chapter one exploring data

Post on 01-Jan-2016

28 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Chapter One exploring data. Section 1.1 analyzing categorical data. Distribution of a categorical variable. Frequency table. Relative frequency table. - PowerPoint PPT Presentation

TRANSCRIPT

CHAPTER O

NE EXPL

ORING

DATA

SE

CT

I ON

1. 1

AN

ALY

ZI N

G C

AT

EG

OR

I CA

L DA

T A

DISTRIBUTION OF A CATEGORICAL VARIABLE

F R E Q U E N C Y T A B L E

Format Count of Stations

Adult Contemporary

1,556

Adult Standards 1,196

Contemporary Hit 569

Country 2,066

News/Talk/Information

2,179

Oldies 1,060

Religious 2,014

Rock 869

Spanish Language 750

Other Formats 1,579

Total 13,838

R E L A T I V E F R E Q U E N C Y T A B L E

Format Percent of Stations

Adult Contemporary

11.2

Adult Standards 8.6

Contemporary Hit 4.1

Country 14.9

News/Talk/Information

15.7

Oldies 7.7

Religious 14.6

Rock 6.3

Spanish Language 5.4

Other Formats 11.4

Total 99.9

In this case, the individuals are the radio stations and the variable being measured is the kind of programming that each station broadcasts. The table on the left, which we call a frequency table, displays the counts of stations in each format category. On the right, we see a relative frequency table of the data that shows the percents of stations in each format category.

DISTRIBUTION OF A CATEGORICAL VARIABLE

F R E Q U E N C Y T A B L E

Format Count of Stations

Adult Contemporary

1,556

Adult Standards 1,196

Contemporary Hit 569

Country 2,066

News/Talk/Information

2,179

Oldies 1,060

Religious 2,014

Rock 869

Spanish Language 750

Other Formats 1,579

Total 13,838

R E L A T I V E F R E Q U E N C Y T A B L E

Format Percent of Stations

Adult Contemporary

11.2

Adult Standards 8.6

Contemporary Hit 4.1

Country 14.9

News/Talk/Information

15.7

Oldies 7.7

Religious 14.6

Rock 6.3

Spanish Language 5.4

Other Formats 11.4

Total 99.9

It’s a good idea to check data for consistency. The counts should add to 13,838, the total number of stations. They do. The percents should add to 100%. In fact, they add to 99.9%. What happened? Each percent is rounded to the nearest tenth. The exact percents would add to 100, but the rounded percents only come close. This is roundoff error. Roundoff errors don’t point to mistakes in our work, just to the effect of rounding off results.

PIE CHARTS

Country

News/Talk

Oldies

ReligiousRock

Spanish

Other

Adult Contem-porary

Adult Standards

Contemporary Hit

• Pie charts are best when emphasizing each categories relation to the whole

• Pie charts are best when emphasizing each categories relation to the whole

• Bar graphs are also called bar charts

• Pie charts are best when emphasizing each categories relation to the whole

• Bar graphs are also called bar charts

• Bar graphs are also more flexible than pie charts. Both graphs can display the distribution of a categorical variable, but a bar graph can also compare any set of quantities that are measured in the same units.

BAR CHARTS

Series10

10

20

Radio Station Formats

Adult Contemporary Adult StandardsContemporary Hit CountryNews/Talk OldiesReligious RockSpanish Other

If I were to give you a list of several age groups and the percent of people in each age group that own an ipodwhat do you think would be better to display the data a pie chart or a bar graph???

Bar GraphBecause the data will not add up to a whole

it is separate data we are comparing

Bar graphs can be misleading in 2 ways…

1. If you don’t keep the widths even the proportions will be misleading

2. If you don’t start the vertical scale at zero the proprtions by comparison can also be misleading

WHAT HAPPENS WHEN WE HAVE TWO CATEGORICAL VARIABLES??

A sample of 200 children were asked which superpower they would most like to have and their gender was also recorded, let’s look at the results…

Superpower Female Male Total

Invisibility 17 13 30

Superstrength 3 17 20

Telepathy 39 5 44

Fly 36 18 54

Freeze Time 20 32 52

Total 115 85 200

This is a two-way table because it describes two categorical variables, gender and superpower preference. Superpower is the row variable because each row in the table describes a different superpower the kids chose. Gender is the column variable. The entries in the table are the counts of individuals in each preference-by-gender class.

The distributions of preference alone and gender alone are called marginal distributions because they appear at the right and bottom margins of the two-way table.

The distributions of preference alone and gender alone are called marginal distributions because they appear at the right and bottom margins of the two-way table.

The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table.

The distributions of preference alone and gender alone are called marginal distributions because they appear at the right and bottom margins of the two-way table.

The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table.

Now if we want to display the marginal distribution as percents we use the following formula: row total = 30 = 0.15 = 15%

table total 200

The distributions of preference alone and gender alone are called marginal distributions because they appear at the right and bottom margins of the two-way table.

The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table.

Now if we want to display the marginal distribution as percents we use the following formula: row total = 30 = 0.15 = 15%

table total 200

Now lets convert the whole marginal distribution into percents

Superpower Female Male Total

Invisibility 17 13 30

Superstrength 3 17 20

Telepathy 39 5 44

Fly 36 18 54

Freeze Time 20 32 52

Total 115 85 200

Superpower Total

Invisibility 15%

Superstrength 10%

Telepathy 22%

Fly 27%

Freeze Time 26%

Total 100%

Now if we were to change all the data in the female column to percents we would have the conditional distribution of preference among girls.

Now if we were to change all the data in the female column to percents we would have the conditional distribution of preference among girls.

A conditional distribution of a variable describes the values of that variable among individuals who have a specific value of another variable. There is a separate conditional distribution for each value of the other variable.

ORGANIZING A STATISTICAL PROBLEM

Although no single strategy will work on every problem, here is a four step process that can be helpful to follow

ORGANIZING A STATISTICAL PROBLEM

Although no single strategy will work on every problem, here is a four step process that can be helpful to follow

State: What’s the question that you’re trying to answer?

ORGANIZING A STATISTICAL PROBLEM

Although no single strategy will work on every problem, here is a four step process that can be helpful to follow

State: What’s the question that you’re trying to answer?

Plan: How will you go about answering the question? What Statistical techniques does this problem call for?

ORGANIZING A STATISTICAL PROBLEM

Although no single strategy will work on every problem, here is a four step process that can be helpful to follow

State: What’s the question that you’re trying to answer?

Plan: How will you go about answering the question? What Statistical techniques does this problem call for?

Do: Make graphs and carry out needed calculations.

ORGANIZING A STATISTICAL PROBLEM

Although no single strategy will work on every problem, here is a four step process that can be helpful to follow

State: What’s the question that you’re trying to answer?

Plan: How will you go about answering the question? What Statistical techniques does this problem call for?

Do: Make graphs and carry out needed calculations.

Conclude: Give your practical conclusion in the setting of the real-world problem.

Based on the survey data, can we conclude that boys and girls differ in their preference of superpower? Let’s use the four-step process to support our answer with evidence.

Based on the survey data, can we conclude that boys and girls differ in their preference of superpower? Let’s use the four-step process to support our answer with evidence.

State: What is the relationship between gender and the answer to the question “What superpower would you prefer?”

Based on the survey data, can we conclude that boys and girls differ in their preference of superpower? Let’s use the four-step process to support our answer with evidence.

State: What is the relationship between gender and the answer to the question “What superpower would you prefer?”

Plan: We suspect that gender might influence a child’s opinion about superpowers. So we will compare the conditional distributions of responses for females alone and for males alone.

Based on the survey data, can we conclude that boys and girls differ in their preference of superpower? Let’s use the four-step process to support our answer with evidence.

State: What is the relationship between gender and the answer to the question “What superpower would you prefer?”

Plan: We suspect that gender might influence a child’s opinion about superpowers. So we will compare the conditional distributions of responses for females alone and for males alone.

Do: Here is a table and side-by-side bar graph comparing the opinions of males and females. We will use percents instead of counts since the numbers of females and males are different.

State: What is the relationship between gender and the answer to the question “What superpower would you prefer?”

Plan: We suspect that gender might influence a child’s opinion about superpowers. So we will compare the conditional distributions of responses for females alone and for males alone.

Do: Here is a table and side-by-side bar graph comparing the opinions of males and females. We will use percents instead of counts since the numbers of females and males are different.

Superpower % of Females % of Males

Invisibility 15% 15%

Superstrength 3% 20%

Telepathy 34% 6%

Fly 31% 21%

Freeze Time 17% 38%

Total 100% 100%

Conclude: Based on the sample data, females were much more likely to choose telepathy than males, while males were much more likely to choose superstrength or freeze time than females. Females were slightly more likely to choose flying and equally likely to choose invisibility.

Conclude: Based on the sample data, females were much more likely to choose telepathy than males, while males were much more likely to choose superstrength or freeze time than females. Females were slightly more likely to choose flying and equally likely to choose invisibility.

We say that there is an association between two variables if specific values of one variable tend to occur in common with specific values of the other.

Conclude: Based on the sample data, females were much more likely to choose telepathy than males, while males were much more likely to choose superstrength or freeze time than females. Females were slightly more likely to choose flying and equally likely to choose invisibility.

We say that there is an association between two variables if specific values of one variable tend to occur in common with specific values of the other.

So… if Females are more likely to choose telepathy that means there is an association between the variable gender and superpower choice.

• The distribution of a categorical variable lists the categories and gives the count (frequency table) or percent (relative frequency table) of individuals that fall in each category.

SUMMARY

• The distribution of a categorical variable lists the categories and gives the count (frequency table) or percent (relative frequency table) of individuals that fall in each category.

• Pie charts and bar graphs display the distribution of a categorical variable. Bar graphs can also compare any set of quantities measured in the same units. When examining any graph, ask yourself, “ What do I see?”

SUMMARY

• The distribution of a categorical variable lists the categories and gives the count (frequency table) or percent (relative frequency table) of individuals that fall in each category.

• Pie charts and bar graphs display the distribution of a categorical variable. Bar graphs can also compare any set of quantities measured in the same units. When examining any graph, ask yourself, “ What do I see?”

• A two-way table of counts organizes data about two categorical variables. Two-way tables are often used to summarize large amounts of information by grouping outcomes into categories.

SUMMARY

• The row totals and column totals in a two-way table give the marginal distributions of the two individual variables. It is clearer to present these distributions as percents of the table total. Marginal distributions tell us nothing about the relationship between the variables.

SUMMARY

• The row totals and column totals in a two-way table give the marginal distributions of the two individual variables. It is clearer to present these distributions as percents of the table total. Marginal distributions tell us nothing about the relationship between the variables.

• Theses are two sets of conditional distributions for a two-way table: the distributions of the row variable for each value of the column variable, and the distributions of the column variable for each value of the row variable.

SUMMARY

• The row totals and column totals in a two-way table give the marginal distributions of the two individual variables. It is clearer to present these distributions as percents of the table total. Marginal distributions tell us nothing about the relationship between the variables.

• Theses are two sets of conditional distributions for a two-way table: the distributions of the row variable for each value of the column variable, and the distributions of the column variable for each value of the row variable.

• A statistical problem has a real-world setting. You can organize many problems using the four steps state, plan, do, and conclude.

SUMMARY

• To describe the association between the row and column variables, compare an appropriate set of conditional distributions. Remember that even a strong association between two categorical variables can be influenced by other variables lurking in the background.

SUMMARY

• To describe the association between the row and column variables, compare an appropriate set of conditional distributions. Remember that even a strong association between two categorical variables can be influenced by other variables lurking in the background.

SUMMARY

top related