scatterplots. learning objectives by the end of this lecture, you should be able to: – describe...
TRANSCRIPT
Scatterplots
Learning Objectives
By the end of this lecture, you should be able to:
– Describe what a scatterplot is– Be comfortable with the terms exaplanatory variable and
response variable. – Describe a scatterplot in terms of form, direction, and strength– Define what is meant by an outlier, and be able to Identify them
on a scatterplot– Recognize why poorly chosen scales on a scatterplot can give
misleading impressions of the data
Examining RelationshipsUp to this point, we have focused on single-variable (“univariate”) data. Eg: Women’s heights, Percentage of Hispanics in each state, SAT scores, etc.
Most statistical studies involve more than one variable. For example, a great deal of analysis goes into examining the relationship between two variables.
Example: We may be interested in the relationship between•The number of beers they consumed at a party•Blood alcohol level (BAC)
With the proper statistical tools we can try to determine things like:•IS there a relationship? I.E. Does the number of beers affect blood alcohol level?•If there is a relationship, can we predict how much each beer contributes to BAC.
A great human flaw: It is tempting to just intuitively assume that there is a relationship between two variables. However, this can lead to some highly erroneous conclusions. As humans, we LOVE to assume stuff, find patterns that don’t truly exist, and then jump to conclusions. This is a very well-known flaw in the human character and we should be aware of it. We will discuss this topic in more detail as we progress through the course.
Student Beers Blood Alcohol
S1 5 0.1
S2 2 0.03
S3 9 0.19
S4 7 0.095
S5 3 0.07
S6 3 0.02
S7 4 0.07
S8 5 0.085
S9 8 0.12
S10 3 0.04
S11 5 0.06
S12 5 0.05
S13 6 0.1
S14 7 0.09
S15 1 0.01
S16 4 0.05
Here, we have two quantitative
variables for each of 16 students
(n=16).
1) How many beers they drank,
and
2) Their blood alcohol level (BAC)
We are interested in the
relationship between the two
variables: How is one affected by
changes in the other one?
Looking for relationships between variables Start with a graph (always – whenever possible)
Look for an overall pattern deviations from the pattern (deviations such as outliers are sometimes the
most interesting part!)
If appropriate, try to provide numerical descriptions of the data and overall pattern.
Student Beers BAC
1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
ScatterplotsIn a scatterplot, one axis is used to represent each of the variables, and the data are plotted as points on the graph.
Number of Beers(Explanatory Variable)
Blood Alcohol Content
(Response variable)
xy
Explanatory and response variablesA response variable measures or records an outcome of a study. An explanatory variable explains (“causes”) the changes in the response variable.Typically, the explanatory variable is plotted on the x axis, and the response variable is plotted on the y axis.
Terminology: Dependent / Independent
• Instead of explanatory / response, you will often encounter the terms independent and dependent used.– Independent for Explanatory– Dependent for Response
• They are pretty much interchangable, but there is a subtle difference. However, it is more accurate to use the terms explanatory and response, so I would like you to focus on those terms. – You will ocasionally see SPSS use dependent/indepdent.
Which should be the explanatory, and which the response?
• The variable that you think “causes” the change in the other variable should be the explanatory variable. – (This is why it is frequently called the ‘dependent’ variable. But as was just mentioned,
there is a subtle distinction between them which we may get to down the road).
• The variable that “responds” to a change in the explanatory variable, is, then, the response variable.
• Example:– Exercise v.s. Calories burned?
• Answer: The amount of exercise will (hopefully!) result in a change in calories burned. Whereas, burning calories, does not ‘cause’ a change in exercise. So exercise should be our explanatory variable, and calories the response variable.
– Exam Score v.s. Hours studying• Answer: We would expect that that the amount of hours studying would cause a change in
exam score rather than the othe rway around. So ‘hours studying’ would be our explanatory variable.
Describing/Interpreting scatterplots• When describing a scatterplot, we describe the relationship by examining
the form, direction, and strength of the association. We look for an
overall pattern …
– Form: linear (a straight line), curved, clusters, no
pattern
– Direction: positive, negative, no direction
– Strength: how closely the points fit the “form”
Form of an association: Linear / Nonlinear / No Relationship
Linear
Nonlinear
No relationship
A linear relationship is given a directional description of Positive or Negative
Positive association: High values of one variable tend to occur together with high values of
the other variable.
Negative association: High values of one variable tend to occur together with low values of
the other variable.
Direction of a linear association Positive or Negative
Note that we only describe the direction of the relationship when the relationship is linear.
Sometimes there isn’t any relationship: X and Y may vary, but are independent of each other. Knowing a value for X tells you nothing about the value for Y. We describe as ‘no relationship’
Scatterplot Direction: No Relationship
Scatterplot: Strength of the associationThe strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form.
With a strong relationship, you can get a pretty good estimate of y if you know x.
With a weak relationship, for any x you might get a wide range of y values.
(You could probably make a reasonable argument that the reationship of this plot isn’t even linear.)
?
?
?
This is a strong relationship. The
daily amount of gas consumed can
be predicted quite accurately for a
given temperature value.
This is a relatively weak
relationship. For a particular state
median household income, you
can’t predict the state per capita
income very well.
Describing the strength
• For now we are using the admittedly vague terms ‘strong, moderate, weak’.
• In a subsequent lecture on scatterplots, we will learn a technique for quantifying the strength.
Describing/Interpreting scatterplots• As mentioned earlier, when you are asked to interpret a scatterplot, you
should be familiar with these 3 terms in particular.
– Form: linear, curved, clusters, no pattern
– Direction: positive, negative, no direction
– Strength: how closely the points fit the “form”
– Note: Recall that if the relationship is not linear, we will not bother
to describe direction or strength.
Examples – Describe each plot
• Form: Linear, Direction: positive, Strength: strong
• Form: Linear, Direction: negative, Strength: moderate
• Form: No relationship. Note that for a given x does not tell us anything new about y. As a result, the terms ‘postive/negative’ don’t apply. Neither does the strength.
Examples• Form: Non-linear. Therefore,
we don’t bother trying to describe direction or strength.
• Form: Linear, Direction: positive, Strength: moderate
• In our next lecture on scatterplots, we will discuss a tool for quantifying the strength of the relationship.
Lying with statistics: How (not) to scale a scatterplot
Using an inappropriate scale for a scatterplot can give an incorrect impression.
Ideally, both variables should be given a similar amount of space:•Plot roughly square•Points should occupy most of the plot space
Same data in all four plots
How to scale a scatterplotSame data in all four plots
In other words, if faced with this group plots, you should be suspicious of most of them!
OutliersAn outlier is a data value that has a very low probability of occurrence (i.e., it is
unusual or unexpected).
In a scatterplot, outliers are points that fall outside of the overall pattern of the
relationship.
Not an outlier:
The upper right-hand point here is
not an outlier of the relationship—It
is what you would expect for this
many beers given the linear
relationship between beers/weight
and blood alcohol.
This point is not in line with the
others, so it is an outlier of the
relationship.
Outliers
IQ score and Grade point average
Describe in words what this plot shows.
• Looking to see if there is a relationship between IQ score and GPA.
Describe the direction, shape, and strength. Are there outliers?
• Shape: linear• Direction: positive• Strength: appears somewhat
weak
Outliers present?Appear to be outliers, but it is
hard to say.
IQ score and Grade point average
Are there outliers present?The circled datapoints (and
perhaps some of the others too) appear to be outliers. Still, it is hard to say. How do we decide?
Recall that on a scatterplot, we consider a datapoint to be an outlier if it is way off the “line”.
If the “regression” line (the line through the points) looks like the one here, then both IQ scores (circled) would almost certainly be considered outliers.
IQ score and Grade point average
Are there outliers present?If the regression line looks like
the one drawn here, then certainly the lower circled datapoint (and probably some of others nearby as well) would be considered outliers.
IQ score and Grade point average
Are there outliers present?Conversely, if the regression
line looks like the one drawn here, then certainly the upper circled datapoint (and probably several of others nearby as well) would be considered outliers. But the lower one would not be.
WHICH line, then, is the “correct” regression line?
Answer: Once again, we use a mathematical model to draw a regression line. We will discuss how to do so in our next lecture on scatterplots.