6 - 3 - assumptions (12-13).srt

Upload: joe-turner

Post on 02-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 6 - 3 - Assumptions (12-13).Srt

    1/6

    Hi.Welcome back.We're up to lecture five, segment three.The topic of this lecture again iscorrelation.And in this last segment I want to talkabout some of the assumptions underlying acorrelational analysis.We won't have time in this segment tocover all the assumptions in detail.We'll come back to them later in lecturesix and later at theend of the semester when we revisit a lotof the assumptions underlying someof these common statistical procedures.So in this segmentwe're going to talk about six assumptions.The first three are listed here.If we're looking at aPearson product-moment correlation, littler, that'sused for situations where you have twovariables that are both continuous.For now, we're assuming that we have a

    normaldistribution in both x and in y.It's not necessary, of course, that youhave normal distributions to findassociations, butfor now, in this intro stats course it'seasiest to start with that assumption.We're also going to start with this sortof simple assumption, that therelationship is linear.And I'll show you that in a scatterplot.And the third one is this funny new word,Homoscedasticity,

    which is best just illustrated in ascatterplot,and I'll show you that in a moment.There are other assumptions as well.And most intro stats courses or introstats textbooksdon't really emphasize these as much as Ido.this is sort of, I emphasize these becausethis isan area of my research, is how to properlymeasure constructsin psychology.

    And measurement is a really importantissue if you're assessing correlations.So you need to know that you have reliablemeasures, thatyou have valid measures, and that you haverandom and representative samples.So I'm not going to have time to talkabout these three assumptions inthis segment, but these are the maintopics of the next lecture on measurement.

  • 8/10/2019 6 - 3 - Assumptions (12-13).Srt

    2/6

    So, let's go back to the first three.and number one that we have normaldistributions for x and y.Well, how to we detect violations of thatassumption?That's real easy, just go back toour lecture on distributions and summarystatistics.All you have to do is just plothistograms.Eyeball them.See if they're relatively normal.If it's hard to detect, well, then youcouldrun summary statistics and see how thoselook, andsee if they're normal enough to, satisfythis assumption.So that was covered in the lecture ondistributions andhistograms, and on summary statistics, andin the first lab.The second assumption, for now we'regoing to

    assume a linear relationship between x andy.Of course there could be all sorts ofrelationships be x and y that arenot linear, but for now we'll assumelinear relationships, and we'll see thatin scatterplots.And finally, there's this assumption ofHomoscedasticity.And again, let me show you that in ascatterplot.first to give you the definition, rememberin a

    scatterplot all the dots representindividual cases or individual subjects.The vertical distance between a dot andthe regression line or the predictionline is the prediction error for thatindividual also known as the residual.The idea of Homoscedasticity is thatthose residuals, are not related to x,because if they were,then we might have some sort of confoundin our study.Right, the residuals, the errors, theprediction errors

    should just be chance errors, theyshouldn't be systematic.If they're systematic then their residualswould be related tox and I'll show you examples of that in amoment.So, the best way it is, look at this, asI've said, is to look at scatterplots,but again what you want to look at is thevertical distance between each dot in the

  • 8/10/2019 6 - 3 - Assumptions (12-13).Srt

    3/6

    regression line.The best, most classic illustration ofthese assumptionsunderlying correlation, and regressionanalysis for that matter.were developed by a statistician known bythe name ofDr. Frank Anscombe in 1973, and these areso classic and sowell-known that they've become known asAnscombe's Quartet.And let me show you what they look like.What Anscombe didwhich is extremely clever, just so elegantand shows how it's socritical to look at your scatterplotsbefore you run correlationanalysis so you know what you're getting.What Anscombe did, is in allfour of these data sets, he made it sothat the correlation was exactly the same.The correlation in all four of these datasets is point eight two.So a really strong relationship between x

    and y.In fact, the variance in x, and thevariance in y,across all four data sets are exactly thesame as well.It's very clever.But look at the pictures.Clearly there are different things goingon in each of these four cases.So this first one in the upper left is ascatterplot and a correlationthat satisfies our assumptions for now.We have a normal distribution in x.

    A normal distributionin y.And we have a nice, linear relationship.And these prediction errors, if you lookat the dotsaround the regression line, they're prettyrandom across values of x.That can't be said of any of the other,data sets in Anscombe's Quartet.So if you look at the second onehere, what you're seeing is not a linearrelationshipbut a quadratic relationship.

    So, the values start out low, the go upand theystart to dip down again at the higher endof x.It's a quadratic relationship between xand y, not a linear one.We wouldn't be able to detect that withoutlooking at this scatterplot.Look at the third one, you see this slightincrease with one dot that's a

  • 8/10/2019 6 - 3 - Assumptions (12-13).Srt

    4/6

    little bit off the regression line andreally contributes tonegative prediction error, which makes upfor all the positiveprediction error in the other data pointsin that data frame.And then finally, this is one that'sactuallypretty common in psychology and actually,in neurosciencea lot of neuroscientists try to docorrelation analy, analysis with reallysmall samples and they're starting tolearn that they can't do that.and this is a good example where you haveall of your data points areright here, there's no relationshipbetween xand y if you just look here, right?So they all have the same x value, andthey have a range of y values.Yet, you've got this one extreme outlier,way up here, that's contributing to thiscorrelation.

    It's driving it up to point eight two.So again, if you just ran a correlationalanalysis in r.Just by writing core as you've learned inlab.For all four of these data setsyou would get the same exact correlationcoefficient.So this just emphasizeshow critical it is to just look at yourdata,know your data, eyeball it and see andtest these assumptions.

    Do you have linear relationships and doyou have Homoscedasticity?Those are essential when interpretingcorrelation coefficients.Now, in case it was difficult to see thesewhen I put all four of them together, nowI'mjust going to walk through each, each oneof them individually very quickly.Again, this is a really pretty picture ofa scatterplot.Because, what you see, is you have acrossthe range of x, you have

    some individuals who are below theregression line, then above,then below, then above and below again.It's just sortof random across the distribution of x.That's what we want to see.That's a homoscedastic relationshipbetween x and y.So this satisfies the assumptions.Again, here, this is clearly not a linear

  • 8/10/2019 6 - 3 - Assumptions (12-13).Srt

    5/6

    relationship.It looks quadratic.And we just see that by eye balling it.Again, this one if we look at the, theprediction errors, we haveone really big prediction error herethat's driving, thesepoints to fall, right along the regressionline or a little above.So if we looked at the relationshipbetween x andthe prediction errors, we would see thatthere's something systematic,there's a relationship between those two.That's evidence of Heteroscedasticity.It's a violation of the Homoscedasticityassumption and we wouldn'twant to go ahead with a linear correlationanalysis in this case.And then finally, this is the easiest oneto spot,this is a no brainer, you look at yourdataand you clearly have this one extreme

    outlier, if younotice, I actually had to extend the scaleout to 20,[LAUGH]the x axis on all the others ended at 15.Had to extend it out to 20 just to getthat guy on this scatterplot, and that'sclearly driving this positive correlation.What's funny in real research is a lotof researchers, when they're looking for astrong correlation.They tend not to bothered by points likethat, because it's helping their cause.

    Right?They tend to get more bothered by, youknow,points like this, if we're looking for apositive correlation.Like, people like me, on the verbal and[LAUGH]mathematical ability relationship.Right?It's, it's, it's very common to seeresearchers quickly spot those kinds ofdatapoints and discard them as outliers, but

    say," oh, this one supports my theory".Very bad to do, and as we get intomultiple regression, we'lltalk about actual procedures where you canasses whether something is a multivariateoutlier or not.Whether it's a multivaria, variate outlierthathelps your cause, or hurts your cause.So, to summarize the segment.

  • 8/10/2019 6 - 3 - Assumptions (12-13).Srt

    6/6

    There are a lot of assumptions going onwhen you're doing correlational analysis.So this is why I said.I started with the, the famous line,correlationdoes not imply causation, because everyoneknows that.But there's so much more to worry about orbe concerned about when your, you'reconsuming, correlational analysis or whenyou're, when you're conducting them.So here's just three simple assumptionsthat we talked about, normal distributionsin x and y, linear relationship between xand y, and Homoscedasticity.Then, there are even bigger assumptionsthat we'll talk about in lecture six.So, reliability, validity, and sampling,which all fall under the umbrellaof measurement issues, which is the topicof the next lecture.[BLANK_AUDIO].