scatterplots. learning objectives by the end of this lecture, you should be able to: – describe...

Scatterplots

Learning Objectives

By the end of this lecture, you should be able to:

– Describe what a scatterplot is– Be comfortable with the terms exaplanatory variable and

response variable. – Describe a scatterplot in terms of form, direction, and strength– Define what is meant by an outlier, and be able to Identify them

on a scatterplot– Recognize why poorly chosen scales on a scatterplot can give

misleading impressions of the data

Examining RelationshipsUp to this point, we have focused on single-variable (“univariate”) data. Eg: Women’s heights, Percentage of Hispanics in each state, SAT scores, etc.

Most statistical studies involve more than one variable. For example, a great deal of analysis goes into examining the relationship between two variables.

Example: We may be interested in the relationship between•The number of beers they consumed at a party•Blood alcohol level (BAC)

With the proper statistical tools we can try to determine things like:•IS there a relationship? I.E. Does the number of beers affect blood alcohol level?•If there is a relationship, can we predict how much each beer contributes to BAC.

A great human flaw: It is tempting to just intuitively assume that there is a relationship between two variables. However, this can lead to some highly erroneous conclusions. As humans, we LOVE to assume stuff, find patterns that don’t truly exist, and then jump to conclusions. This is a very well-known flaw in the human character and we should be aware of it. We will discuss this topic in more detail as we progress through the course.

Student Beers Blood Alcohol

S1 5 0.1

S2 2 0.03

S3 9 0.19

S4 7 0.095

S5 3 0.07

S6 3 0.02

S7 4 0.07

S8 5 0.085

S9 8 0.12

S10 3 0.04

S11 5 0.06

S12 5 0.05

S13 6 0.1

S14 7 0.09

S15 1 0.01

S16 4 0.05

Here, we have two quantitative

variables for each of 16 students

(n=16).

1) How many beers they drank,

and

2) Their blood alcohol level (BAC)

We are interested in the

relationship between the two

variables: How is one affected by

changes in the other one?

Looking for relationships between variables Start with a graph (always – whenever possible)

Look for an overall pattern deviations from the pattern (deviations such as outliers are sometimes the

most interesting part!)

If appropriate, try to provide numerical descriptions of the data and overall pattern.

Student Beers BAC

1 5 0.1

2 2 0.03

3 9 0.19

6 7 0.095

7 3 0.07

9 3 0.02

11 4 0.07

13 5 0.085

4 8 0.12

5 3 0.04

8 5 0.06

10 5 0.05

12 6 0.1

14 7 0.09

15 1 0.01

16 4 0.05

ScatterplotsIn a scatterplot, one axis is used to represent each of the variables, and the data are plotted as points on the graph.

Number of Beers(Explanatory Variable)

Blood Alcohol Content

(Response variable)

xy

Explanatory and response variablesA response variable measures or records an outcome of a study. An explanatory variable explains (“causes”) the changes in the response variable.Typically, the explanatory variable is plotted on the x axis, and the response variable is plotted on the y axis.

Terminology: Dependent / Independent

• Instead of explanatory / response, you will often encounter the terms independent and dependent used.– Independent for Explanatory– Dependent for Response

• They are pretty much interchangable, but there is a subtle difference. However, it is more accurate to use the terms explanatory and response, so I would like you to focus on those terms. – You will ocasionally see SPSS use dependent/indepdent.

Which should be the explanatory, and which the response?

• The variable that you think “causes” the change in the other variable should be the explanatory variable. – (This is why it is frequently called the ‘dependent’ variable. But as was just mentioned,

there is a subtle distinction between them which we may get to down the road).

• The variable that “responds” to a change in the explanatory variable, is, then, the response variable.

• Example:– Exercise v.s. Calories burned?

• Answer: The amount of exercise will (hopefully!) result in a change in calories burned. Whereas, burning calories, does not ‘cause’ a change in exercise. So exercise should be our explanatory variable, and calories the response variable.

– Exam Score v.s. Hours studying• Answer: We would expect that that the amount of hours studying would cause a change in

exam score rather than the othe rway around. So ‘hours studying’ would be our explanatory variable.

Describing/Interpreting scatterplots• When describing a scatterplot, we describe the relationship by examining

the form, direction, and strength of the association. We look for an

overall pattern …

– Form: linear (a straight line), curved, clusters, no

pattern

– Direction: positive, negative, no direction

– Strength: how closely the points fit the “form”

Form of an association: Linear / Nonlinear / No Relationship

Linear

Nonlinear

No relationship

A linear relationship is given a directional description of Positive or Negative

Positive association: High values of one variable tend to occur together with high values of

the other variable.

Negative association: High values of one variable tend to occur together with low values of

the other variable.

Direction of a linear association Positive or Negative

Note that we only describe the direction of the relationship when the relationship is linear.

Sometimes there isn’t any relationship: X and Y may vary, but are independent of each other. Knowing a value for X tells you nothing about the value for Y. We describe as ‘no relationship’

Scatterplot Direction: No Relationship

Scatterplot: Strength of the associationThe strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form.

With a strong relationship, you can get a pretty good estimate of y if you know x.

With a weak relationship, for any x you might get a wide range of y values.

(You could probably make a reasonable argument that the reationship of this plot isn’t even linear.)

?

?

?

This is a strong relationship. The

daily amount of gas consumed can

be predicted quite accurately for a

given temperature value.

This is a relatively weak

relationship. For a particular state

median household income, you

can’t predict the state per capita

income very well.

Describing the strength

• For now we are using the admittedly vague terms ‘strong, moderate, weak’.

• In a subsequent lecture on scatterplots, we will learn a technique for quantifying the strength.

Describing/Interpreting scatterplots• As mentioned earlier, when you are asked to interpret a scatterplot, you

should be familiar with these 3 terms in particular.

– Form: linear, curved, clusters, no pattern

– Direction: positive, negative, no direction

– Strength: how closely the points fit the “form”

– Note: Recall that if the relationship is not linear, we will not bother

to describe direction or strength.

Examples – Describe each plot

• Form: Linear, Direction: positive, Strength: strong

• Form: Linear, Direction: negative, Strength: moderate

• Form: No relationship. Note that for a given x does not tell us anything new about y. As a result, the terms ‘postive/negative’ don’t apply. Neither does the strength.

Examples• Form: Non-linear. Therefore,

we don’t bother trying to describe direction or strength.

• Form: Linear, Direction: positive, Strength: moderate

• In our next lecture on scatterplots, we will discuss a tool for quantifying the strength of the relationship.

Lying with statistics: How (not) to scale a scatterplot

Using an inappropriate scale for a scatterplot can give an incorrect impression.

Ideally, both variables should be given a similar amount of space:•Plot roughly square•Points should occupy most of the plot space

Same data in all four plots

How to scale a scatterplotSame data in all four plots

In other words, if faced with this group plots, you should be suspicious of most of them!

OutliersAn outlier is a data value that has a very low probability of occurrence (i.e., it is

unusual or unexpected).

In a scatterplot, outliers are points that fall outside of the overall pattern of the

relationship.

Not an outlier:

The upper right-hand point here is

not an outlier of the relationship—It

is what you would expect for this

many beers given the linear

relationship between beers/weight

and blood alcohol.

This point is not in line with the

others, so it is an outlier of the

relationship.

Outliers

IQ score and Grade point average

Describe in words what this plot shows.

• Looking to see if there is a relationship between IQ score and GPA.

Describe the direction, shape, and strength. Are there outliers?

• Shape: linear• Direction: positive• Strength: appears somewhat

weak

Outliers present?Appear to be outliers, but it is

hard to say.


Are there outliers present?The circled datapoints (and

perhaps some of the others too) appear to be outliers. Still, it is hard to say. How do we decide?

Recall that on a scatterplot, we consider a datapoint to be an outlier if it is way off the “line”.

If the “regression” line (the line through the points) looks like the one here, then both IQ scores (circled) would almost certainly be considered outliers.


Are there outliers present?If the regression line looks like

the one drawn here, then certainly the lower circled datapoint (and probably some of others nearby as well) would be considered outliers.


Are there outliers present?Conversely, if the regression

line looks like the one drawn here, then certainly the upper circled datapoint (and probably several of others nearby as well) would be considered outliers. But the lower one would not be.

WHICH line, then, is the “correct” regression line?

Answer: Once again, we use a mathematical model to draw a regression line. We will discuss how to do so in our next lecture on scatterplots.