exploring data (2008) - mdthinducollege.org · psychology): exploring data ... were some missing...

14
C8057 (Research Methods in Psychology): Exploring Data © Dr. Andy Field, 2008 Page 1 Exploring Data Assumptions of Parametric Data If you use a parametric test and certain assumptions are not met then the results are likely to be inaccurate. Therefore, it is very important that you check the assumptions before deciding which statistical test is appropriate. The assumptions of parametric tests based on the normal distribution are as follows: Normally distributed data: This assumption is tricky and misunderstood because it means different things in different contexts (see Field, 2009). In short, the rationale behind hypothesis testing relies on having something that is normally distributed: in some cases it’s the sampling distribution, in others the errors in the model and so on. If this assumption is not met then your statistical model or test is usually flawed in some way. The reason why this assumption is misunderstood is because people often believe that the data themselves need to be normally distributed; actually, the raw data rarely need to be normally distributed, but if they are then it usually means that the things that we actually want to be normally distributed are too. Confused? Well, it is confusing. In many statistical tests (e.g. the t‐test) we assume that the sampling distribution is normally distributed. This is a problem because we don’t have access to this distribution — we can’t simply look at its shape and see whether it is normally distributed. However, we know from the central limit theorem (see my book if you don’t know what that is) that if the sample data are approximately normal then the sampling distribution will be also. Therefore, people tend to look at their sample data to see if they are normally distributed. If so, then they have a little party to celebrate and assume that the sampling distribution (which is what actually matters) is also. We also know from the central limit theorem that in big samples the sampling distribution tends to be normal anyway — regardless of the shape of the data we actually collected (and the sampling distribution will tend to be normal regardless of the population distribution in samples of 30 or more). As our sample gets bigger then, we can be more confident that the sampling distribution is normally distributed. Homogeneity of variance: This assumption means that the variances should be the same throughout the data. In designs in which you test several groups of participants this assumption means that each of these samples come from populations with the same variance. In correlational designs, this assumption means that the variance of one variable should be stable at all levels of the other variable. Interval data: Data should be measured at least at the interval level. This means that the distance between points of your scale should be equal at all parts along the scale. For example, if you had a 10‐point anxiety scale, then the difference in anxiety represented by a change in score from 2 to 3 should be the same as that represented by a change in score from 9 to 10. Independence: This assumption is that data from different participants are independent, which means that the behaviour of one participant does not influence the behaviour of another. In repeated measures designs (in which participants are measured in more than one experimental condition), we expect scores in the experimental conditions to be non‐independent for a given participant, but behaviour between different participants should be independent. The assumptions of interval data and independent measurements are, unfortunately, tested only by common sense. These assumptions of homogeneity of variance and Normal Distribution can be tested more objectively. The remainder of this handout explain how to explore data and to test these assumptions. Using Histograms to Spot the Obvious Mistakes A biologist was worried about the potential health effects of music festivals. So, one year she went to the Download Music Festival (http://www.downloadfestival.co.uk ) and measured the hygiene of 810 concert goers over the three days of the festival. In theory each person was measured on each day but because it was difficult to track people down, there were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t worry it wasn’t licking the person’s armpit) that results in a score ranging between 0 (you smell like a rotting corpse that’s hiding up a skunk’s arse) and 5 (you smell of sweet roses on a fresh spring day). Now I know from bitter

Upload: others

Post on 24-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page1

ExploringData

AssumptionsofParametricData

If you use a parametric test and certain assumptions are not met then the results are likely to be inaccurate.Therefore, it isvery important thatyoucheck theassumptionsbeforedecidingwhichstatistical test isappropriate.Theassumptionsofparametrictestsbasedonthenormaldistributionareasfollows:

→ Normallydistributeddata:Thisassumptionistrickyandmisunderstoodbecauseitmeansdifferentthingsindifferent contexts (see Field, 2009). In short, the rationale behind hypothesis testing relies on havingsomethingthatisnormallydistributed:insomecasesit’sthesamplingdistribution,inotherstheerrorsinthemodelandsoon. Ifthisassumptionisnotmetthenyourstatisticalmodelortest isusuallyflawedinsomeway. The reason why this assumption is misunderstood is because people often believe that the datathemselvesneed tobenormallydistributed;actually, the rawdata rarelyneed tobenormallydistributed,butiftheyarethenitusuallymeansthatthethingsthatweactuallywanttobenormallydistributedaretoo.Confused? Well, it is confusing. In many statistical tests (e.g. the t‐test) we assume that the samplingdistributionisnormallydistributed.Thisisaproblembecausewedon’thaveaccesstothisdistribution—wecan’tsimplylookatitsshapeandseewhetheritisnormallydistributed.However,weknowfromthecentrallimittheorem(seemybookifyoudon’tknowwhatthatis)thatifthesampledataareapproximatelynormalthenthesamplingdistributionwillbealso.Therefore,peopletendtolookattheirsampledatatoseeiftheyare normally distributed. If so, then they have a little party to celebrate and assume that the samplingdistribution(whichiswhatactuallymatters)isalso.Wealsoknowfromthecentrallimittheoremthatinbigsamples the sampling distribution tends to be normal anyway— regardless of the shape of the data weactually collected (and the sampling distribution will tend to be normal regardless of the populationdistribution insamplesof30ormore).Asoursamplegetsbiggerthen,wecanbemoreconfidentthatthesamplingdistributionisnormallydistributed.

→ Homogeneity of variance: This assumptionmeans that the variances should be the same throughout thedata. Indesigns inwhichyou test severalgroupsofparticipants thisassumptionmeans thateachof thesesamplescomefrompopulationswiththesamevariance.Incorrelationaldesigns,thisassumptionmeansthatthevarianceofonevariableshouldbestableatalllevelsoftheothervariable.

→ Intervaldata:Datashouldbemeasuredatleastattheintervallevel.Thismeansthatthedistancebetweenpointsofyourscaleshouldbeequalatallpartsalongthescale.Forexample, ifyouhada10‐pointanxietyscale,thenthedifferenceinanxietyrepresentedbyachangeinscorefrom2to3shouldbethesameasthatrepresentedbyachangeinscorefrom9to10.

→ Independence:Thisassumptionisthatdatafromdifferentparticipantsare independent,whichmeansthatthebehaviourofoneparticipantdoesnotinfluencethebehaviourofanother.Inrepeatedmeasuresdesigns(in which participants are measured in more than one experimental condition), we expect scores in theexperimental conditions to be non‐independent for a given participant, but behaviour between differentparticipantsshouldbeindependent.

Theassumptionsofintervaldataandindependentmeasurementsare,unfortunately,testedonlybycommonsense.These assumptions of homogeneity of variance and Normal Distribution can be tested more objectively. Theremainderofthishandoutexplainhowtoexploredataandtotesttheseassumptions.

UsingHistogramstoSpottheObviousMistakes

Abiologistwasworriedaboutthepotentialhealtheffectsofmusicfestivals.So,oneyearshewenttotheDownloadMusicFestival (http://www.downloadfestival.co.uk)andmeasured thehygieneof810concertgoersover the threedays of the festival. In theory eachpersonwasmeasuredon eachday but because itwas difficult to track peopledown,thereweresomemissingdataondays2and3.Hygienewasmeasuredusingastandardisedtechnique(don’tworryitwasn’t lickingtheperson’sarmpit)thatresultsinascorerangingbetween0(yousmelllikearottingcorpsethat’s hiding up a skunk’s arse) and 5 (you smell of sweet roses on a fresh spring day). Now I know from bitter

Page 2: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page2

experiencethatsanitationisnotalwaysgreatattheseplaces(Readingfestivalseemsparticularlybad…)andsothisresearcherpredictedthatpersonalhygienewouldgodowndramaticallyoverthethreedaysofthefestival.ThedatafileiscalledDownloadFestival.sav.

ToplotahistogramweuseChartBuilder(seelastweek’shandout)whichisaccessedthroughthe menu.SelectHistograminthelistlabelledChoosefromtobringupthegalleryshowninFigure1.Thisgalleryhasfouriconsrepresentingdifferenttypesofhistogram,andyoushouldselecttheappropriateoneeitherbydouble‐clickingonit,orbydraggingitontothecanvasintheChartBuilder:

→ Simplehistogram:Usethisoptionwhenyoujustwanttoseethefrequenciesofscoresforasinglevariable(i.e.mostofthetime).

→ Stackedhistogram: Ifyouhadagroupingvariable (e.g.whethermenorwomenattendedthe festival)youcouldproduceahistograminwhicheachbarissplitbygroup.Intheexampleofgender,thisisagoodwaytocomparetherelativefrequencyofscoresacrossgroups(e.g.weretheremoresmellywomenthanmen?).

→ Frequency Polygon: This option displays the samedata as the simple histogramexcept that it uses a lineinsteadofbarstoshowthefrequency,andtheareabelowthelineisshaded.

→ PopulationPyramid:Likeastackedhistogramthisshowstherelativefrequencyofscoresintwopopulations.Itplotsthevariable(inthiscasehygiene)ontheverticalaxisandthefrequenciesforeachpopulationonthehorizontal:thepopulationsappearbacktobackonthegraph.Ifthebarseithersideofthedividinglineareequallylongthenthedistributionshaveequalfrequencies.

Figure1:Thehistogramgallery

Wearegoingtodoasimplehistogramsodouble‐clickontheiconforasimplehistogram(Figure1).TheChartBuilderdialogboxwillnowshowapreviewofthegraphinthecanvasarea.Atthemomentit’snotveryexciting(topofFigure)becausewehaven’ttoldSPSS/PASWwhichvariableswewanttoplot.Notethatthevariablesinthedataeditorarelistedontheleft‐handsideoftheChartBuilder,andanyofthesevariablescanbedraggedintoanyofthedropzones(spacessurroundedbybluedottedlines).

Ahistogramplots a single variable (x‐axis) against the frequencyof scores (y‐axis), soallweneed todo is select avariablefromthelistanddragitinto .Let’sdothisforthehygienescoresonday1ofthefestival.Clickon this variable in the list and drag it to as shown in Figure 2; you will now find the histogrampreviewedonthecanvas.Todrawthehistogramclickon .

TheresultinghistogramisshowninFigure3andthefirstthingthatshouldleapoutatyouisthatthereappearstobeonecasethat isverydifferenttotheothers.Allofthescoresappeartobesquisheduponeendofthedistributionbecause they are all less than5 (yieldingwhat is knownas a leptokurtic distribution!) except forone,whichhas avalue of 20! This is anoutlier: a score very different to the rest. Outliers bias themean and inflate the standarddeviationandscreeningdataisanimportantwaytodetectthem.What’soddaboutthisoutlieristhatithasascoreof20,whichisabovethetopofourscale(rememberourhygienescalerangedonlyfrom0‐5)andsoitmustbeamistake(or thepersonhadobsessivecompulsivedisorderandhadwashed themselves intoa stateofextremecleanliness).However,with810cases,howonearthdowefindoutwhichcaseitwas?Youcouldjustlookthroughthedata,butthatwouldcertainlygiveyouaheadacheandsoinsteadwecanuseaboxplot.

SimpleHistogram

StackedHistogram

FrequencyPolygonHistogra

m

PopulationPyramid

Page 3: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page3

Figure2:DefiningahistogramintheChartBuilder

Figure3

Boxplots

You encountered boxplots or box‐whisker diagrams in first year. At the centre of the plot is themedian, which issurroundedbyaboxthetopandbottomofwhicharethelimitswithinwhichthemiddle50%ofobservationsfall(theinterquartile range).Stickingoutof thetopandbottomof theboxaretwowhiskerswhichextendtothemostandleastextremescoresrespectively.

Access theChartBuilder ( ), selectBoxplot in the list labelledChoose from tobringup thegalleryshowninFigure4.Therearethreetypesofboxplotyoucanchoose:

→ Simpleboxplot:Usethisoptionwhenyouwanttoplotaboxplotofasinglevariable,butyouwantdifferentboxplots produced for different categories in thedata (for thesehygienedatawe couldproduce separateboxplotsformenandwomen).

→ Clustered boxplot: This option is the same as the simple boxplot except that you can select a secondcategorical variable onwhich to split thedata. Boxplots for this second variable are produced in differentcolours.Forexample,wemighthavemeasuredwhetherourfestival‐goerwasstaying inatentoranearby

ClickontheHygieneDay1variableanddragittothis

DropZone.

Page 4: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page4

hotel during the festival. We could produce boxplots not just for men and women, but within men andwomenwecouldhavedifferent‐colouredboxplots for thosewho stayed in tentsand thosewho stayed inhotels.

→ 1‐DBoxplot:Usethisoptionwhenyoujustwanttoseeaboxplotforasinglevariable.(Thisdiffersfromthesimpleboxplotonlyinthatnocategoricalvariableisselectedforthex‐axis.)

Figure4:Theboxplotgallery

In the data file of hygiene scores we also have information about the gender of the concert‐goer. Let’s plot thisinformation aswell. Tomake our boxplot of the day 1 hygiene scores formales and females, double‐click on thesimple boxplot icon (Figure 4), then from the variable list select the hygiene day 1 score variable and drag it into

andselectthevariablegenderanddragitto .ThedialogshouldnowlooklikeFigure5—notethatthevariablenamesaredisplayedinthedropzones,andthecanvasnowdisplaysapreviewofourgraph(e.g.therearetwoboxplotsrepresentingeachgender).Clickon to produce the graph.

Figure5:Completeddialogboxforasimpleboxplot Figure 6: Boxplot of hygiene scores on day 1 of theDownloadFestivalsplitbygender

TheresultingboxplotisshowninFigure6.Itshowsaseparateboxplotforthemenandwomeninthedata.Youmayrememberthatthewholereasonthatwegotintothisboxplotmalarkeywastohelpustoidentifyanoutlierfromourhistogram(ifyouhaveskippedstraighttothissectionthenyoumightwanttobacktrackabit).Theimportantthingtonoteisthattheoutlierthatwedetectedinthehistogramisshownupasanasterisk(*)ontheboxplotandnexttoitisthenumberofthecase(611)that’sproducingthisoutlier.(Wecanalsotellthatthiscasewasafemale.)Ifwegoto

thedataeditor (dataview),wecan locatethiscasequicklybyclickingon andtyping611 in thedialogboxthatappears.Thattakesusstraighttocase611.Lookingatthiscaserevealsascoreof20.02,whichisprobablyamistyping

SimpleBoxplot

ClusteredBoxplot

1‐DBoxplot

Page 5: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page5

of2.02.We’dhavetogobacktotherawdataandcheck.We’llassumewe’vecheckedtherawdataanditshouldbe2.02,soreplacethevalue20.02withthevalue2.02beforewecontinuethisexample.

Nowthatwe’veremovedthemistakelet’sre‐plotthehistogram.Whilewe’reatitweshouldplothistogramsforthedatafromday2andday3ofthefestivalaswell.

Figure 7 shows the resulting three histograms. The first thing to note is that the data fromday 1 look a lotmorehealthynowwe’veremovedthedatapointthatwasmis‐typed.Infactthedistributionisamazinglynormal‐looking:itisnicelysymmetricalanddoesn’tseemtoopointyorflat—thesearegoodthings!However,thedistributionsfordays2and3arenotnearlysosymmetrical. Infact,theybothlookpositivelyskewed.Thissuggeststhatgenerallypeoplebecamesmellierasthefestivalprogressed.Theskewoccursbecauseasubstantialminorityinsistedonupholdingtheirlevelsofhygiene(againstallodds!)overthecourseofthefestival(babywet‐wipesareindispensableIfind).However,theseskeweddistributionsmightcauseusaproblemifwewanttouseparametrictests.Inthenextsectionwe’lllookatwaystotrytoquantifytheskewnessandkurtosis(i.e.howpointyorflatthedistributionis)ofthesedistributions.

Figure7:HistogramsofthehygienescoresoverthethreedaysoftheDownloadFestival

Quantifyingnormalitywithnumbers

SkewandKurtosis

To further explore the distribution of the variables, we can use the frequencies command (). The main dialog box is shown in Figure 8. The variables in the data

editor are listedon the left‐hand side, and they canbe transferred to thebox labelledVariable(s) by clickingon a

variable(orhighlightingseveralwiththemouse)andthenclickingon .IfavariablelistedintheVariable(s)boxisselectedusingthemouse,itcanbetransferredbacktothevariablelistbyclickingonthearrowbutton(whichshouldnowbepointingintheoppositedirection).Bydefault,SPSS/PASWproducesafrequencydistributionofallscoresintableform.However,therearetwootherdialogboxesthatcanbeselectedthatprovideotheroptions.Thestatisticsdialogboxisaccessedbyclickingon ,andthechartsdialogboxisaccessedbyclickingon .

Page 6: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page6

Thestatisticsdialogboxallowsyoutoselectseveralwaysinwhichadistributionofscorescanbedescribed,suchasmeasures of central tendency (mean,mode,median),measures of variability (range, standard deviation, variance,quartile splits), measures of shape (kurtosis and skewness). To describe the characteristics of the data we shouldselect themean, mode, median, standard deviation, variance and range. To check that a distribution of scores isnormal,weneedtolookatthevaluesofkurtosisandskewness.Thechartsoptionprovidesasimplewaytoplotthefrequencydistributionofscores(asabarchart,apiechartorahistogram).We’vealreadyplottedhistogramsofourdatasowedon’tneed toselect theseoptions,butyoucoulduse theseoptions in futureanalyses.Whenyouhaveselectedtheappropriateoptions,returntothemaindialogboxbyclickingon .Onceinthemaindialogbox,clickon toruntheanalysis.

Figure8

SPSS/PASWOutput1showsthetableofdescriptivestatisticsforthethreevariablesinthisexample.Fromthistable,wecanseethat,onaverage,hygienescoreswere1.77(outof5)onday1ofthefestival,butwentdownto0.96and0.98onday2and3respectively.Theotherimportantmeasuresforourpurposesaretheskewnessandthekurtosis,bothofwhichhaveanassociatedstandarderror.

The values of skewness and kurtosis should be zero in a normal distribution. Positive values of skewnessindicateapile‐upofscoresontheleftofthedistribution,whereasnegativevaluesindicateapile‐upontheright.

Positivevaluesofkurtosisindicateapointydistributionwhereasnegativevaluesindicateaflatdistribution.

Thefurtherthevalueisfromzero,themorelikelyitisthatthedataarenotnormallydistributed.

However,theactualvaluesofskewnessandkurtosisarenot,inthemselves,informative.Instead,weneedtotakethevalueandconvertittoaz‐scorebysubtractingthemean(whichis,inbothcaseszero)anddividingbythestandarderror.

Page 7: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page7

Intheaboveequations,thevaluesofS(skewness)andK(kurtosis)andtheirrespectivestandarderrorsareproducedby SPSS/PASW. These z‐scores can be compared against known values for the normal distribution shown in Field(2009).Anabsolutevaluegreaterthan1.96issignificantatp<.05,above2.58issignificantatp<.01andabsolutevaluesaboveabout3.29aresignificantatp<.001.Largesampleswillgiverisetosmallstandarderrorsandsowhensamplesizesarebigsignificantvaluesarisefromevensmalldeviationsfromnormality.Inmostsamplesit’sOktolookforvaluesabove1.96,however, in largesamplesthiscriterionshouldbeincreasedto2.58or3.29andinverylargesamples,becauseoftheproblemofsmallstandarderrorsthatI’vedescribed,nocriterionshouldbeappliedatall!

UsingthevaluesinSPSS/PASWOutput1,calculatethez‐scoresforskewnessandKurtosisforeachdayoftheDownloadfestival.

YourAnswers:

SPSS/PASWOutput1

TheKolmogorov‐SmirnovTest

TheKolmogorov‐SmirnovandShapiro‐Wilk testscomparethescores inthesampletoanormallydistributedsetofscoreswiththesamemeanandstandarddeviation.

→ Ifthetestisnon‐significant(p>.05)ittellsusthatthedistributionofthesampleisnotsignificantlydifferentfromanormaldistribution(i.e.itisprobablynormal).

→ If,however, thetest issignificant (p< .05)thenthedistribution inquestion issignificantlydifferentfromanormaldistribution(i.e.itisnon‐normal).

These tests seem great: in one easy procedure they tell us whether our scores are normally distributed (nice!).However,theyhavetheirlimitationsbecausewithlargesamplesizesitisveryeasytogetsignificantresultsfromsmalldeviationsfromnormality.Thetakehomemessageisthatbyallmeansusethesetests,butplotyourdataaswellandtrytomakeaninformeddecisionabouttheextentofnon‐normality.

Page 8: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page8

The Kolmogorov‐Smirnov (K‐S from now on) test can be accessed through the explore command (

).Figure9showsthedialogboxesfortheexplorecommand.First,enteranyvariablesof interest in thebox labelledDependent List byhighlighting themon the left‐hand sideand transferringthembyclickingon (ordraggingthemacross).Forthisexample,selectthehygienescoresforeachday.It isalsopossibletoselectafactor(orgroupingvariable)bywhichtosplittheoutput(so,wecouldtransfergenderintotheboxlabelled Factor List and SPSS/PASW will produce exploratory analysis for each group—a bit like the split filecommand).Ifyouclickon adialogboxappears,butthedefaultoptionisfine.Themoreinterestingoptionforourpurposes isaccessedbyclickingon . Inthisdialogboxselecttheoption ,andthiswillproduceboththeK‐Stest.Bydefault,SPSS/PASWwillproduceboxplots(splitaccordingtogroupifafactorhasbeen specified)and stemand leafdiagramsaswell. If you clickon you’ll see that,bydefault, SPSS/PASW‘excludescaseslistwise’.Thismeansthatifacaseofdataismissingforanyofthevariableswehaveselecteditwillexcludethatcaseentirely!Thisisnotdesirableherebecauseforhygieneonday1(forexample)wewanttoseethedistributionofallpeople,not just thepeoplewhoweretestedonall threedays.So,changethisoptionto ‘Excludecasespairwise’ and then for each variableweanalyse, SPSS/PASWwill excludepeopleonly if there is notdata forthemon that particular variable. Click on to return to themain dialog box and then click to run theanalysis.

Figure9

AretheresultsoftheK‐Stestssurprisinggiventhehistogramswehavealreadyseen?

YourAnswers:

Page 9: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page9

ThefirsttableproducedbySPSS/PASWcontainsdescriptivestatistics(meanetc.)andshouldhavethesamevaluesasthetablesobtainedusingthefrequenciesprocedureearlier.The importanttable isthatoftheKolmogorov‐Smirnovtest(SPSS/PASWOutput2).Thistableincludestheteststatisticitself,thedegreesoffreedom(whichshouldequalthesamplesize)andthesignificancevalueofthistest.Rememberthatasignificantvalue(Sig. lessthan.05) indicatesadeviation fromnormality.TheK‐S test ishighlysignificant for the last twodistributions indicating that theyarenotnormal.

SPSS/PASWOutput2

The test statistic for the K‐S test is denoted byD and so we can report these results as: Thehygienescoresonday1,D(810)=0.03,ns,didnotsignificantlydeviatefromnormality;however,day2,D(264)=0.12,p< .001,andday3,D(123)=0.14,p< .001scoresweresignificantlynon‐normal.Thenumbersinbracketsarethedegreesoffreedom(df)fromthetable.

HomogeneityofVariance

Different statistical procedures have their own unique way to look for homogeneity of variance: in correlationalanalysis sucha regressionwe tend tousegraphs (see lectures/handoutsonMultipleRegression)and forgroupsofdatawetendtouseatestcalledLevene’stest.Levene’stestteststhehypothesisthatthevariancesinthegroupsareequal(i.e.thedifferencebetweenthevariancesiszero).Therefore,

→ IfLevene’stestissignificantatp≤.05thenwecanconcludethatthenullhypothesisisincorrectandthatthevariances are significantly different—therefore, the assumption of homogeneity of variances has beenviolated.

→ If,however, Levene’s test isnon‐significant (i.e.p > .05) thenwemustaccept thenullhypothesis that thedifferencebetweenthevariancesiszero—thevariancesareroughlyequalandtheassumptionistenable.

Although Levene’s test can be selected as an option inmany of the statistical tests that require it, it can also beexaminedwhenyou’reexploringdata (andstrictlyspeaking it’sbetter toexamine itnowthanwaituntilyourmainanalysis).However,aswiththeK‐Stest(andothertestsofnormality),whenthesamplessizeislarge,smalldifferencesingroupvariancescanproduceaLevene’stestthatissignificantwhenthevariancesarenotparticularlydifferent.

Whatistheassumptionofhomogeneityofvariance?

YourAnswers:

WecangetLevene’stestusingtheexploremenuthatweusedintheprevioussection.Forthisexample,wellusethechick flick data from last week (if you didn’t save your own file last week, you can find the data in the file

Page 10: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page10

ChickFlick.sav).Oncethedataareloaded,use toopenthedialogboxin Figure7. Transfer thearousal variable from the liston the left‐hand side to thebox labelledDependent List: by

clickingthe next tothisbox,andbecausewewanttosplit theoutputbythegroupingvariable tocomparethe

variances,selectthevariablegenderandtransferittotheboxlabelledFactorListbyclickingontheappropriate .Thenclickon toopentheotherdialogbox inFigure10.TogetLevene’stestweneedtoselectoneoftheoptionswhereitsaysSpreadvs.LevelwithLevene’stest.Ifyouselect theLevene’stestiscarriedoutontherawdata(agoodplacetostart).Whenyou’vefinishedwiththisdialogboxclickon toreturntothemainexploredialogboxandthenclick toruntheanalysis.

Figure10

SPSS/PASWOutput3

SPSS/PASWOutput3showstheresults. I’mnotgoingtodwellonthisbecausewewillcomeacrossLevene’stest infutureseminars,sowe’lljustnotethatLevene’stestisnonsignificant(valuesinthecolumnlabelledSig.aremorethan.05)indicatingthatthevariancesarenotsignificantlydifferent(i.e.theyaresimilarandthehomogeneityofvarianceassumptionistenable).

Levene’stestcanbedenotedwiththeletterFandtherearetwodifferentdegreesoffreedom.Assuchyoucanreportit,ingeneralform,asF(df1,df2)=value,sig.So,wecouldsaythevariancesareequal,F(1,38)=0.87,ns.

CorrectingproblemsintheData

Ifyourdataarenotnormallydistributed,therearethingsyoucandototrytocorrecttheproblem.Themainthingistransforming the data. Although some students often (understandably) think that transforming data sounds dodgy(thephrase‘fudgingyourresults’springstosomepeople’sminds!),infactitisn’tbecauseyoudothesamethingtoallof your data. As such, transforming the data won’t change the relationships between variables (the relativedifferences between people for a given variable stay the same), however, it does change the differences betweendifferentvariables(becauseitchangestheunitsofmeasurement).Therefore,evenifyouhaveonlyonevariablethathasaskeweddistribution,youshouldstilltransformanyothervariables inyourdataset ifyou’regoingtocompare

Page 11: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page11

differencesbetweenthatvariableandothersthatyouintendtotransform.So,forexample,withourhygienedata,wemightwanttolookathowhygienelevelschangedacrossthethreedays(thatis,comparethemeanonday1tothemeansondays2and3).

Thedataforday2and3wereskewedandneedtobetransformed,butbecausewemightlatercomparethedatatoscoresonday1,wewouldalsohavetotransformtheday1data(eventhoughitisn’tskewed).Ifwedon’tchangetheday1dataaswell,thenanydifferencesinhygienescoreswefindfromday1today2or3willsimplybeduetoustransformingonevariableandnottheothers.

Whenyouhavepositivelyskeweddata(aswedo)thatneedscorrecting,therearethreetransformationsthatcanbeuseful(andasyou’llseethesecanbeadaptedtodealwithnegativelyskeweddatatoo):

1. Log Transformation (log(Xi)): Taking the logarithm of a set of numbers squashes the right tail of thedistribution.Assuchit’sagoodwaytoreducepositiveskew.However,youcan’tgetaLogvalueofzeroornegativenumbers,soifyourdataincludeanyzerosorproducenegativenumbersyouneedtoaddaconstanttoallofthedatabeforeyoudothetransformation.Forexample,ifyouhavezerosinthedatathendolog(Xi+1),orifyouhavenegativenumbersaddwhatevervaluemakesthesmallestnumberinthedatasetpositive.

2. SquareRootTransformation(√Xi):Takingthesquarerootoflargevalueshasmoreofaneffectthantakingthesquarerootofsmallvalues.Consequently, taking thesquarerootofeachofyourscoreswillbringanylarge scores closer to the centre—rather like the log transformation. As such, this can be a usefulway toreduceapositively‐skeweddata;however,youstillhavethesameproblemwithnegativenumbers(negativenumbersdon’thaveasquareroot).

3. Reciprocal Transformation (1/Xi): Dividing 1 by each score also reduces the impact of large scores. Thetransformedvariablewillhavealowerlimitof0(verylargenumberswillbecomeclosetozero).Onethingtobear in mind with this transformation is that it reverses the scores in the sense that scores that wereoriginally large in thedatasetbecomesmall (close tozero)after the transformation,but scores thatwereoriginallysmallbecomebigafterthetransformation.Forexample,imaginetwoscoresof1and10,afterthetransformation theybecome1/1=1,and1/10=0.1: thesmall scorebecomesbigger than the largescoreafterthetransformation.However,youcanavoidthisbyreversingthescoresbeforethetransformation,byfindingthehighestscoreandchangingeachscoretothehighestscoreminusthescoreyou’relookingat.So,youdoatransformation1/(XHighest–Xi).

Anyoneofthesetransformationscanalsobeusedtocorrectnegativelyskeweddata,butfirstyou’dhavetoreversethescores(that is,subtracteachscorefromthehighestscoreobtained)—ifyoudothis,don’tforgettoreversethescoresbackafterwards!

UsingSPSS/PASW’sComputecommand

Thecomputecommandenablesustocarryoutvariousfunctionsoncolumnsofdatainthedataeditor.Sometypicalfunctionsareaddingscoresacrossseveralcolumns,takingthesquarerootofthescoresinacolumn,orcalculatingthe

meanofseveralvariables.Toaccessthecomputedialogbox,usethemousetoselect .TheresultingdialogboxisshowninFigure11;ithasalistoffunctionsontheright‐handside,acalculator‐likekeyboardinthecentreandablankspacethatI’velabelledthecommandarea.ThebasicideaisthatyoutypeanameforanewvariableinthearealabelledTargetVariableandthenyouwritesomekindofcommandinthecommandareatotellSPSS/PASWhowtocreatethisnewvariable.Youuseacombinationofexistingvariablesselectedfromthelistontheleftandnumericexpressions.So,forexample,youcoulduseitlikeacalculatortoaddvariables(i.e.addtwocolumnsinthedataeditortomakeathird).Therearehundredsofbuilt‐infunctionsthatSPSS/PASWhasgroupedtogether.Inthe dialog box it lists these groups in the area labelled Function group; upon selecting a function group, a list ofavailable functionswithin thatgroupwillappear in thebox labelledFunctionsandSpecialVariables. If youselectafunction, thenadescriptionof that functionappears in thegreybox indicated inFigure11.Youcanenter variablenames into the command area by selecting the variable required from the variables list and then clicking on .Likewise,youcanselectacertainfunctionfromthelistofavailablefunctionsandenteritintothecommandareabyclickingon .

The basic procedure is to first type a variable name in the box labelled Target Variable. You can then click onandanotherdialogboxappears,whereyoucangivethevariableadescriptive label,andwhereyoucan

specifywhetheritisanumericorstringvariable(seeyourhandoutfromweek1).Thenwhenyouhavewrittenyourcommand for SPSS/PASW to execute, click on to run the command and create the new variable. There are

Page 12: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page12

functions for calculatingmeans, standarddeviationsandsumsof columns.We’regoing touse thesquare rootandlogarithmfunctions,whichareusefulfortransformingdatathatareskewed(seeField,2009,formoredetail).

Figure11:Dialogboxforthecomputefunction

LogTransformation

Let’s return to our Download festival data. To transform these data, first open the main compute dialog box by

selecting .Enter thename logday1 into thebox labelledTargetVariableandthenclickon and give the variable a more descriptive name such as Log transformed hygiene scores for day 1 of

Download festival. In the listbox labelledFunctiongroup clickonArithmeticandthen in thebox labelledFunctionsandSpecialVariablesclickonLg10(thisisthelogtransformationtobase10,Lnisthenaturallog)andtransferittothecommandareabyclickingon .Whenthecommandistransferred,itappearsinthecommandareaas‘LG10(?)’andthequestionmark shouldbe replacedwitha variablename (which canbe typedmanuallyor transferred from thevariables list). So replace the questionmarkwith the variableday1 by either selecting the variable in the list anddraggingitacross,clickingon ,orjusttyping‘day1’wherethequestionmarkis.

There isavalueof0 in theoriginaldata (onday2),andthere isno logarithmof thevalue0.Toovercomethisweshouldaddaconstant toouroriginalscoresbeforewetakethe logof thosescores.Anyconstantwilldo,providedthatitmakesallofthescoresgreaterthan0.Inthiscaseourlowestscoreis0inthedatasetsowecansimplyadd1toallofthescoresandthatwillensurethatallscoresaregreaterthanzero.Todothis,makesurethecursorisstillinside the brackets and click on and then . The final dialog box should look like Figure 11. Note that theexpressionreadsLG10(day1+1);thatis,SPSS/PASWwilladdonetoeachoftheday1scoresandthentakethelogoftheresultingvalues.Clickon tocreateanewvariablelogday1containingthetransformedvalues.

Command area

Variable list

Categories of

functions

Functions within the selected category

Use this dialog box to select certain

cases in the data

Description of selected function

Page 13: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page13

Nowyouhaveagoatcreatingsimilarvariables logday2and logday3 for theday2andday3data!

SquareRootTransformation

Tousethesquareroottransformation,wecouldrunthroughthesameprocess,byusinganamesuchassqrtday1andselecting the SQRT(numexpr) function from the list. This will appear in the box labelled Numeric Expression: asSQRT(?),andyoucansimplyreplacethequestionmarkwiththevariableyouwanttochange—inthiscaseday1.ThefinalexpressionwillreadSQRT(day1).

Tryrepeatingthisforday2andday3tocreatevariablescalledsqrtday2andsqrtday3.

ReciprocalTransformation

Finally,ifwewantedtouseareciprocaltransformationonthedatafromday1,wecoulduseanamesuchasrecday1andthensimplyclickon andthen .Ordinarilyyouwouldselectthevariablenamethatyouwanttotransformfromthelistanddragitacross,clickon orjusttypethenameofthevariable.However,theday2datacontainazerovalueandifwetrytodivide1by0thenwe’llgetanerrormessage(youcan’tdivideby0).Assuchweneedtoaddaconstanttoourvariable justaswedidforthe logtransformation.Anyconstantwilldo,but1 isaconvenientnumberforthesedata.So, insteadofselectingthevariablewewanttotransform,clickon .Thisplacesapairofbrackets into the box labelledNumeric Expression; thenmake sure the cursor is between these two brackets andselectthevariableyouwanttotransformfromthelistandtransferitacrossbyclickingon (ortypethenameofthevariable manually). Now click on and then (or type + 1 using your keyboard). The box labelled NumericExpression should now contain the text 1/(day1 + 1). Click on to create a new variable containing thetransformedvalues.

Tryrepeatingthisforday2andday3tocreatevariablescalledrecday2andrecday3.

TheEffectofTransformingData

PlotHistogramsofyourtransformedvalues!

Let’s have a look at our newdistributions (Figure 12).Compare these with theuntransformed distributions inFigure 7. Now, you can seethat all three transformationshave cleaned up the hygienescores for day 2: the positiveskew is gone (the square roottransformation in particularhas been useful). However,

Page 14: Exploring data (2008) - mdthinducollege.org · Psychology): Exploring Data ... were some missing data on days 2 and 3. Hygiene was measured using a standardised technique (don’t

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page14

becauseourhygienescoresonday 1 were more or lesssymmetrical to begin with,theyhavenowbecomeslightlynegatively skewed for the logand square roottransformation, and positivelyskewed for the reciprocaltransformation1! Ifwe’reusingscores from day 2 alone thenwe could use the transformedscores,however,ifwewanttolook at the change in scoresthen we’d have to weigh upwhether the benefits of thetransformation for the day 2scoresoutweighstheproblemsitcreatesintheday1scores—dataanalysiscanbefrustratingsometimes!

Figure12:Day1(left)andday22(right)hygienescoresfromtheDownloadFestival

MultipleChoiceTest

Gotohttp://www.uk.sagepub.com/field3e/MCQ.htmandtestyourselfonthemultiplechoicequestionsforChapter5. If youget anywrong, re‐read thishandout (or Field, 2009,Chapter5) anddo themagainuntil youget themallcorrect.

Thishandoutcontainsmaterialfrom:

Field,A.P.(2009).DiscoveringstatisticsusingSPSS:andsexanddrugsandrock‘n’roll(3rdEdition).London:Sage.

ThismaterialiscopyrightAndyField(2000,2005,2009).

1Thereversaloftheskewforthereciprocaltransformationisbecause,asImentionedearlier,thereciprocalhastheeffectofreversingthescores.