![Page 1: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/1.jpg)
StatisticalThinkingBasedonC.J.WildandM.Pfannkuch (1999).StatisticalthinkinginEmpiricalEnquiry,InternationalStatisticalReview,67(3):223-265.
+ProfessorMattWaite’snotes
![Page 2: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/2.jpg)
BasicIdeas
• Thoughtprocessesinvolvedinstatisticalproblemsolving• Fromproblemformulationtoconclusions
• Afour-dimensionalframeworkforstatisticalthinkinginempiricalenquiry• Investigativecycle• Interrogativecycle• Typesofthinking• Dispositions
• Centralelement:“variation”
![Page 3: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/3.jpg)
Four-DimensionalFramework
![Page 4: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/4.jpg)
Dimension1:TheInvestigativeCycle• Concernedwithabstractingandsolvingastatisticalproblemgroundedinalarger”real”problem
• BasedonthePPDACmodel(Problem,Plan,Data,Analysis,Conclusions)
![Page 5: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/5.jpg)
Dimension2:TypesofThinking• Variation• Thinkingwhichisstatisticalisconcernedwithlearninganddecisionmakingunderuncertainty
• forthepurposesofexplanation,prediction,orcontrol
![Page 6: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/6.jpg)
Dimension2:MoreonVariation|Sources
![Page 7: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/7.jpg)
Dimension2:MoreonVariation|Prediction,Explain,Control
![Page 8: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/8.jpg)
Dimension2:SummaryonVariation
• Special-cause vs.commoncausevariation• Usefulwhenlookingforcauses
• Explained vs.unexplainedvariation• Usefulwhenexploringdata&buildingamodelforthem
• Suppositions• Variationisanobservablereality
• Somevariationcanbeexplained;othervariationcannot beexplainedoncurrentknowledge• Random variationisthewayinwhichstatisticiansmodelunexplainedvariation
• Thisunexplainedvariationmayinpartorinwholebeproducedbytheprocessofobservationthroughrandomsampling
• Randomnessisaconvenient humanconstructwhichisusedtodealwithvariationinwhichpatternscannotbedetected
![Page 9: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/9.jpg)
CorrelationisNOTcausation
![Page 10: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/10.jpg)
Dimension3:TheInterrogativeCycle• Appliesatmacrolevels
• Appliesalsoatverydetailedlevelsofthinking• Recursive• Subcyclesareinitiatedwithinmajorcycles
![Page 11: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/11.jpg)
Dimension4:Dispositions
• Whenauthorsbecomeintenselyinterestedinaproblemorare,aheightenedsensitivityandawarenessdevelopstowardsinformationontheperipheriesofourexperiencethatmightberelatedtotheproblem• Peoplearemostobservantinareastheyfindmostinteresting
• Engagementintensitieseachdispositionalelement
![Page 12: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/12.jpg)
TypesofAnalytics
• Descriptive• Describingcharacteristicsorpropertiesinthedata
• Predictive• Predictingthetypesofoutcomesgivennewsetsofdata,usuallybasedonaclassifiertrainedusinglabelled,existingdatasets
• Prescriptive• Decidingonthebestrouteoroptionordecisiontomakegivendata
![Page 13: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/13.jpg)
TypesofData
• Categorical (cf.wikipedia)• Variable thatcantakeononeofalimited,andusuallyfixednumberofpossiblevalues,assigningeachindividualorotherunitofobservationtoaparticulargroupor nominalcategory onthebasisofsome qualitativeproperty
• The bloodtype ofaperson:A,B,ABorO• Thestatethatapersonlivesin• The politicalparty thatavotermightvotefor• Thetypeofarock: igneous, sedimentary or metamorphic• Ordinal data?
• Numerical• Canbesubdividedintodiscretedata(thingsthatcanbecounted)andcontinuousdata(allpossiblenumbers).
• # ofchildren,age,scores,temperatures,etc.
![Page 14: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/14.jpg)
DescriptiveStatistics
• Therearethreemaingroupsofdescriptives• Thedistribution• Workswellwithcategoricaldata.Howmanyofeachthingisthere?
• Thecentraltendency• Onlyworkswithnumericaldata.Whatisthemean,medianandmode?
• Thedispersion• Onlyworkswithnumericaldata.Howspreadoutisthedata?
![Page 15: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/15.jpg)
DescriptiveStatistics:Distribution
• Groupingandcountingbycategoricaldata– groupandcountbytown,orzipcodeorsomethinglikethat• Oftencalledafrequencydistribution• Histogram
• Withnumericaldata,minimum andmaximum valuesareuseful
![Page 16: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/16.jpg)
DescriptiveStatistics:CentralTendency
• Mean• Averageornorm:allupallvaluestofindatotal,andthendividethetotalbythenumberofvalues
• Median• Middlevalue:Sortallvaluesintoorder,andthemedianisthemiddlevalue;ifthereare2valuesinthemiddle,findthemeanofthesetwo
• Mode• Mostfrequentvalue:Counthowmanyeachvalueappears,themodeisthevaluethatappearsthemost• Canhavemorethanonemode
![Page 17: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/17.jpg)
DescriptiveStatistics:Dispersion
• Mean• Averageornorm:allupallvaluestofindatotal,andthendividethetotalbythenumberofvalues
• Median• Middlevalue:Sortallvaluesintoorder,andthemedianisthemiddlevalue;ifthereare2valuesinthemiddle,findthemeanofthesetwo
• Mode• Mostfrequentvalue:Counthowmanyeachvalueappears,themodeisthevaluethatappearsthemost• Canhavemorethanonemode
![Page 18: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/18.jpg)
DescriptiveStatistics:Dispersion
• Range• Differencebetweenthelowestandhighestvalues• Subjecttoextremes(e.g.,outliers)
• Standarddeviation• Itistherelationthatasetofscoreshastothemean• Subjecttoskewness indistribution
• ForaGaussian/normaldistribution• 68%ofallvalueswillbewithin1standarddeviation• 95%willbewithin3standarddeviation
![Page 19: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/19.jpg)
DirtyData• Missing data
• Blanksinthedatabaseorspreadsheet.• Datamissingfromaperiodoftime.• Missingstates,counties,zipcodes.
• Wrong data• Wrongtype– numberswheretheyshouldbetextandviceversa• Sharpcurves– trendsthatcontinuenormallythatsuddenlyjumpinoneyear• Conflictingdatawithinadatasetoracrossdatasets(race,percentages,etc)
• Unusable data• Non-standardizeddata• Inconsistentdata• Abbreviations• Unitconsistency
![Page 20: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/20.jpg)
Correlation
• Pearsoncorrelationcoefficients(orPearsonproduct-momentcorrelationcoefficient)• ItisameasureofhowLINEARLYrelatedtwoentitiesare.• HowoftenisachangeinArelatedtoachangeinB?Andisthatpositiveornegative?
![Page 21: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/21.jpg)
Correlation:Forapopulation
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
StandarddeviationofX;standarddeviationofY
![Page 22: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/22.jpg)
Correlation:Forasample
![Page 23: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/23.jpg)
Correlation:Whatitmeans?
• Itisbasedonarangefrom-1to1.• 1=perfectpositivecorrelation• Agoesup1,Bgoesup1• Intherealworld,almostneverhappensoutsideofamistake
• 0=nocorrelationatall• 0rarelyeverhappens• NEARzerohappensallthetime
• -1=perfectnegativecorrelation• Agoesup1,Bgoesdown1• Itisjustlike1:rare,probablyamistake
![Page 24: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/24.jpg)
Significance:t-test
• The t-test isany statisticalhypothesistest inwhichthe teststatistic followsa Student's t-distribution underthe null hypothesis.• A t-testismostcommonlyappliedwhentheteststatisticwouldfollowa normal distribution ifthevalueofa scalingterm intheteststatisticwereknown• Whenthescalingtermisunknownandisreplacedbyanestimatebasedonthe data,theteststatistics(undercertainconditions)followaStudent's t distribution• The t-testcanbeused,forexample,todetermineiftwosetsofdataare significantly differentfromeachother
https://en.wikipedia.org/wiki/Student%27s_t-test
![Page 25: Statistical Thinking - Computer Science and Engineering](https://reader030.vdocuments.site/reader030/viewer/2022012703/61a52cebdd97de4f6a136b68/html5/thumbnails/25.jpg)
Significance:p-value&nullhypothesis• Inthecontextof nullhypothesis testing:toquantifytheideaof statisticalsignificance ofevidence• Inessence,aclaimisassumedvalidifitscounter-claimisimprobable
• Theonlyhypothesisthatneedstobespecifiedinthistestandwhichembodiesthecounter-claimisreferredtoasthe nullhypothesis• i.e.,thehypothesistobenullified
• Aresultissaidtobe statisticallysignificant ifitallowsustoreject thenullhypothesis• Thestatisticallysignificantresultshouldbehighlyimprobableifthenullhypothesisisassumedtobetrue
• Therejectionofthenullhypothesisimpliesthatthecorrecthypothesisliesinthelogicalcomplementofthenullhypothesis
• Caveat:Unlessthereisasinglealternativetothenullhypothesis,therejectionofnullhypothesisdoesnot telluswhichofthealternativesmightbethecorrectone
https://en.wikipedia.org/wiki/Student%27s_t-test