histogram - wikipedia, the free encyclopedia.pdf
TRANSCRIPT
-
Histogram
Firstdescribedby
KarlPearson
Purpose Toroughlyassesstheprobabilitydistributionofagivenvariablebydepictingthefrequenciesofobservationsoccurringincertainrangesofvalues
HistogramFromWikipedia,thefreeencyclopedia
Ahistogramisagraphicalrepresentationofthedistributionofdata.Itisanestimateoftheprobabilitydistributionofacontinuousvariable(quantitativevariable)andwasfirstintroducedbyKarlPearson.[1]Toconstructahistogram,thefirststepisto"bin"therangeofvaluesthatis,dividetheentirerangeofvaluesintoaseriesofsmallintervalsandthencounthowmanyvaluesfallintoeachinterval.Arectangleisdrawnwithheightproportionaltothecountandwidthequaltothebinsize,sothatrectanglesabouteachother.Ahistogrammayalsobenormalizeddisplayingrelativefrequencies.Itthenshowstheproportionofcasesthatfallintoeachofseveralcategories,withthesumoftheheightsequaling1.Thebinsareusuallyspecifiedasconsecutive,nonoverlappingintervalsofavariable.Thebins(intervals)mustbeadjacent,andusuallyequalsize.[2]Therectanglesofahistogramaredrawnsothattheytoucheachothertoindicatethattheoriginalvariableiscontinuous.[3]
Histogramsgivearoughsenseofthedensityofthedata,andoftenfordensityestimation:estimatingtheprobabilitydensityfunctionoftheunderlyingvariable.Thetotalareaofahistogramusedforprobabilitydensityisalwaysnormalizedto1.Ifthelengthoftheintervalsonthexaxisareall1,thenahistogramisidenticaltoarelativefrequencyplot.
Ahistogramcanbethoughtofasasimplistickerneldensityestimation,whichusesakerneltosmoothfrequenciesoverthebins.Thisyieldsasmootherprobabilitydensityfunction,whichwillingeneralmoreaccuratelyreflectdistributionoftheunderlyingvariable.Thedensityestimatecouldbeplottedasanalternativetothehistogram,andisusuallydrawnasacurveratherthanasetofboxes.
AvariablebinwidthhistogramwasintroducedbyDenbyandMallows(2009)(http://pubs.research.avayalabs.com/pdfs/ALR2007003paper.pdf).ExamplesofthisaredisplayedonCensusbureaudatabelow.
Anotheralternativeistheaverageshiftedhistogram(http://www.stat.rice.edu/~scottdw/stat550/HW/hw4/c05.pdf)whichisfasttocompute,andgetsasmoothcurveestimateofthedensitywithoutusingkernels.
Thehistogramisoneofthesevenbasictoolsofqualitycontrol.[4]
Histogramsareoftenconfusedwithbarcharts.Abarchartisaplotofcategoricalvariables,andthediscontinuityshouldbeindicatedbyhavinggapsbetweentherectangles.Oftenthisisneglectedwhichmayleadtoabarchartbeingconfusedforahistogram.
-
Anexamplehistogramoftheheightsof31BlackCherrytrees.
Contents
1Etymology2Examples3Mathematicaldefinition
3.1Cumulativehistogram3.2Numberofbinsandwidth
4Seealso5References6Furtherreading7Externallinks
Etymology
Theetymologyofthewordhistogramisuncertain.SometimesitissaidtobederivedfromtheGreekhistos'anythingsetupright'(asthemastsofaship,thebarofaloom,ortheverticalbarsofahistogram)andgramma'drawing,record,writing'.ItisalsosaidthatKarlPearson,whointroducedthetermin1891,derivedthenamefrom"historicaldiagram".[5]
Examples
Thisisatoyexample
-
Bin Count3.5 92.5 321.5 1090.5 1800.5 1321.5 342.5 43.5 9
Thelanguageusedtodescribethepatternsinahistogramaresymmetric,skewedleftorright,unimodal,bimodalormultimodal.
Symmetric,unimodal
Skewedright
Skewedleft
Bimodal
Multimodal
Symmetric
Itisagoodideatoplotyourdataonseveraldifferentbinwidthstolearnmoreaboutit.Hereisanexampleontipsgiveninarestaurant.
-
Tipsusinga$1binwidth,skewedright,unimodal
Tipsusinga10cbinwidth,stillskewedright,multimodalwithmodesat$and50camounts,indicatesrounding,alsosomeoutliers
Hereareacouplemoreexamples.
PricesofhousessoldinAmesin2009,exhibitssomerightskew.
Acesbyplayersinagrandslamtennistournament,facettedbygender.Therearemoreacesinthemensgame.
TheU.S.CensusBureaufoundthattherewere124millionpeoplewhoworkoutsideoftheirhomes.[6]Usingtheirdataonthetimeoccupiedbytraveltowork,Table2belowshowstheabsolutenumberofpeoplewhorespondedwithtraveltimes"atleast30butlessthan35minutes"ishigherthanthenumbersforthecategoriesaboveandbelowit.Thisislikelyduetopeopleroundingtheirreportedjourneytime.Theproblemofreportingvaluesassomewhatarbitrarilyroundednumbersisacommonphenomenonwhencollectingdatafrompeople.
-
Histogramoftraveltime(towork),US2000census.Areaunderthecurveequalsthetotalnumberofcases.ThisdiagramusesQ/widthfromthetable.
Databyabsolutenumbers
Interval Width Quantity Quantity/width
0 5 4180 836
5 5 13687 2737
10 5 18618 3723
15 5 19634 3926
20 5 17981 3596
25 5 7190 1438
30 5 16369 3273
35 5 3212 642
40 5 4122 824
45 15 9200 613
60 30 6461 215
90 60 3435 57
Thishistogramshowsthenumberofcasesperunitintervalastheheightofeachblock,sothattheareaofeachblockisequaltothenumberofpeopleinthesurveywhofallintoitscategory.Theareaunderthecurverepresentsthetotalnumberofcases(124million).Thistypeofhistogramshowsabsolutenumbers,withQinthousands.
-
Histogramoftraveltime(towork),US2000census.Areaunderthecurveequals1.ThisdiagramusesQ/total/widthfromthetable.
Databyproportion
Interval Width Quantity(Q) Q/total/width
0 5 4180 0.0067
5 5 13687 0.0221
10 5 18618 0.0300
15 5 19634 0.0316
20 5 17981 0.0290
25 5 7190 0.0116
30 5 16369 0.0264
35 5 3212 0.0052
40 5 4122 0.0066
45 15 9200 0.0049
60 30 6461 0.0017
90 60 3435 0.0005
Thishistogramdiffersfromthefirstonlyintheverticalscale.Theareaofeachblockisthefractionofthetotalthateachcategoryrepresents,andthetotalareaofallthebarsisequalto1(thefractionmeaning"all").Thecurvedisplayedisasimpledensityestimate.Thisversionshowsproportions,andisalsoknownasaunitareahistogram.
Inotherwords,ahistogramrepresentsafrequencydistributionbymeansofrectangleswhosewidthsrepresentclassintervalsandwhoseareasareproportionaltothecorrespondingfrequencies:theheightof
-
Anordinaryandacumulativehistogramofthesamedata.Thedatashownisarandomsampleof10,000pointsfromanormaldistributionwithameanof0andastandarddeviationof1.
eachistheaveragefrequencydensityfortheinterval.Theintervalsareplacedtogetherinordertoshowthatthedatarepresentedbythehistogram,whileexclusive,isalsocontiguous.(E.g.,inahistogramitispossibletohavetwoconnectingintervalsof10.520.5and20.533.5,butnottwoconnectingintervalsof10.520.5and22.532.5.Emptyintervalsarerepresentedasemptyandnotskipped.)[7]
Mathematicaldefinition
Inamoregeneralmathematicalsense,ahistogramisafunctionmithatcountsthenumberofobservationsthatfallintoeachofthedisjointcategories(knownasbins),whereasthegraphofahistogramismerelyonewaytorepresentahistogram.Thus,ifweletnbethetotalnumberofobservationsandkbethetotalnumberofbins,thehistogrammimeetsthefollowingconditions:
Cumulativehistogram
Acumulativehistogramisamappingthatcountsthecumulativenumberofobservationsinallofthebinsuptothespecifiedbin.Thatis,thecumulativehistogramMiofahistogrammjisdefinedas:
Numberofbinsandwidth
Thereisno"best"numberofbins,anddifferentbinsizescanrevealdifferentfeaturesofthedata.GroupingdataisatleastasoldasGraunt'sworkinthe17thcentury,butnosystematicguidelinesweregiven[8]untilSturges'sworkin1926.[9]
Usingwiderbinswherethedensityislowreducesnoiseduetosamplingrandomnessusingnarrowerbinswherethedensityishigh(sothesignaldrownsthenoise)givesgreaterprecisiontothedensityestimation.Thusvaryingthebinwidthwithinahistogramcanbebeneficial.Nonetheless,equalwidthbinsarewidelyused.
Sometheoreticianshaveattemptedtodetermineanoptimalnumberofbins,butthesemethodsgenerallymakestrongassumptionsabouttheshapeofthedistribution.Dependingontheactualdatadistributionandthegoalsoftheanalysis,differentbinwidthsmaybeappropriate,soexperimentationisusuallyneededtodetermineanappropriatewidth.Thereare,however,varioususefulguidelinesandrulesofthumb.[10]
-
Thenumberofbinskcanbeassigneddirectlyorcanbecalculatedfromasuggestedbinwidthhas:
Thebracesindicatetheceilingfunction.
Squarerootchoice
whichtakesthesquarerootofthenumberofdatapointsinthesample(usedbyExcelhistogramsandmanyothers).[11]
Sturges'formula
Sturges'formula[9]isderivedfromabinomialdistributionandimplicitlyassumesanapproximatelynormaldistribution.
Itimplicitlybasesthebinsizesontherangeofthedataandcanperformpoorlyifn
-
WikimediaCommonshasmediarelatedtoHistograms.
where isthesamplestandarddeviation.Scott'snormalreferencerule[14]isoptimalforrandomsamplesofnormallydistributeddata,inthesensethatitminimizestheintegratedmeansquarederrorofthedensityestimate.[8]
FreedmanDiaconis'choice
TheFreedmanDiaconisruleis:[15][8]
whichisbasedontheinterquartilerange,denotedbyIQR.Itreplaces3.5ofScott'srulewith2IQR,whichislesssensitivethanthestandarddeviationtooutliersindata.
ChoicebasedonminimizationofanestimatedL2[16]riskfunction
where and aremeanandbiasedvarianceofahistogramwithbinwidth , and.
Remark
Agoodreasonwhythenumberofbinsshouldbeproportionalto isthefollowing:supposethatthedataareobtainedas independentrealizationsofaboundedprobabilitydistributionwithsmoothdensity.Thenthehistogramremainsequallyruggedas tendstoinfinity.If isthewidthofthedistribution(e.g.,thestandarddeviationortheinterquartilerange),thenthenumberofunitsinabin(thefrequency)isoforder andtherelativestandarderrorisoforder .Comparingtothenextbin,the
relativechangeofthefrequencyisoforder providedthatthederivativeofthedensityisnonzero.Thesetwoareofthesameorderif isoforder ,sothat isoforder .
Thissimplecubicrootchoicecanalsobeappliedtobinswithnonconstantwidth.
Seealso
DatabinningDensityestimation
Kerneldensityestimation,asmootherbutmorecomplexmethodofdensityestimation
FreedmanDiaconisrule
-
ImagehistogramParetochartSevenBasicToolsofQualityVoptimalhistograms
References
1. ^Pearson,K.(1895)."ContributionstotheMathematicalTheoryofEvolution.II.SkewVariationinHomogeneousMaterial".PhilosophicalTransactionsoftheRoyalSocietyA:Mathematical,PhysicalandEngineeringSciences186:343414.Bibcode:1895RSPTA.186..343P(http://adsabs.harvard.edu/abs/1895RSPTA.186..343P).doi:10.1098/rsta.1895.0010(http://dx.doi.org/10.1098%2Frsta.1895.0010).
2. ^Howitt,D.andCramer,D.(2008)StatisticsinPsychology.PrenticeHall3. ^CharlesStangor(2011)"ResearchMethodsForTheBehavioralSciences".Wadsworth,CengageLearning.
ISBN9780840031976.4. ^NancyR.Tague(2004)."SevenBasicQualityTools"(http://www.asq.org/learnaboutquality/sevenbasic
qualitytools/overview/overview.html).TheQualityToolbox.Milwaukee,Wisconsin:AmericanSocietyforQuality.p.15.Retrieved20100205.
5. ^M.EileenMagnello(December2006)."KarlPearsonandtheOriginsofModernStatistics:AnElasticianbecomesaStatistician"(http://www.rutherfordjournal.org/article010107.html).TheNewZealandJournalfortheHistoryandPhilosophyofScienceandTechnology.1volume.OCLC682200824(https://www.worldcat.org/oclc/682200824).
6. ^US2000census(http://www.census.gov/prod/2004pubs/c2kbr33.pdf).7. ^Dean,S.,&Illowsky,B.(2009,February19).DescriptiveStatistics:Histogram.Retrievedfromthe
ConnexionsWebsite:http://cnx.org/content/m16298/1.11/
8. ^abcScott,DavidW.(1992).MultivariateDensityEstimation:Theory,Practice,andVisualization.NewYork:JohnWiley.
9. ^abSturges,H.A.(1926)."Thechoiceofaclassinterval".JournaloftheAmericanStatisticalAssociation:6566.JSTOR2965501(https://www.jstor.org/stable/2965501).
10. ^e.g.5.6"DensityEstimation",W.N.VenablesandB.D.Ripley,ModernAppliedStatisticswithS(2002),Springer,4thedition.ISBN0387954570.
11. ^EXCEL2007:Histogram(http://cameron.econ.ucdavis.edu/excel/ex11histogram.html)12. ^OnlineStatisticsEducation:AMultimediaCourseofStudy(http://onlinestatbook.com/).ProjectLeader:David
M.Lane,RiceUniversity(chapter2"GraphingDistributions",section"Histograms")13. ^DoaneDP(1976)Aestheticfrequencyclassication.AmericanStatistician,30:18118314. ^Scott,DavidW.(1979)."Onoptimalanddatabasedhistograms".Biometrika66(3):605610.
doi:10.1093/biomet/66.3.605(http://dx.doi.org/10.1093%2Fbiomet%2F66.3.605).15. ^Freedman,DavidDiaconis,P.(1981)."Onthehistogramasadensityestimator:L2theory".Zeitschriftfr
WahrscheinlichkeitstheorieundverwandteGebiete57(4):453476.doi:10.1007/BF01025868(http://dx.doi.org/10.1007%2FBF01025868).
-
WikimediaCommonshasmediarelatedtoHistogram.
LookuphistograminWiktionary,thefreedictionary.
16. ^Shimazaki,H.Shinomoto,S.(2007)."Amethodforselectingthebinsizeofatimehistogram"(http://www.mitpressjournals.org/doi/abs/10.1162/neco.2007.19.6.1503).NeuralComputation19(6):15031527.doi:10.1162/neco.2007.19.6.1503(http://dx.doi.org/10.1162%2Fneco.2007.19.6.1503).PMID17444758(https://www.ncbi.nlm.nih.gov/pubmed/17444758).
Furtherreading
Lancaster,H.O.AnIntroductiontoMedicalStatistics.JohnWileyandSons.1974.ISBN0471512508
Externallinks
JourneyToWorkandPlaceOfWork
(http://www.census.gov/population/www/socdemo/journey.html)(locationofcensusdocumentcitedinexample)Smoothhistogramforsignalsandimagesfromafewsamples(http://www.mathworks.com/matlabcentral/fileexchange/30480histconnect)Histograms:Construction,AnalysisandUnderstandingwithexternallinksandanapplicationtoparticlePhysics.(http://quarknet.fnal.gov/toolkits/ati/histograms.html)AMethodforSelectingtheBinSizeofaHistogram(http://2000.jukuin.keio.ac.jp/shimazaki/res/histogram.html)Interactivehistogramgenerator(http://www.shodor.org/interactivate/activities/histogram/)Matlabfunctiontoplotnicehistograms(http://www.mathworks.com/matlabcentral/fileexchange/27388plotandcomparenicehistogramsbydefault)DynamicHistograminMSExcel(http://excelandfinance.com/histograminexcel/)Histogramconstruction(http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1)andmanipulation(http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_PowerTransformFamily_Graphs)usingJavaapplets,andcharts(http://www.socr.ucla.edu/htmls/SOCR_Charts.html)onSOCR
Retrievedfrom"http://en.wikipedia.org/w/index.php?title=Histogram&oldid=641376957"
-
Categories: Statisticalchartsanddiagrams Qualitycontroltools EstimationofdensitiesNonparametricstatistics
Thispagewaslastmodifiedon7January2015at08:42.TextisavailableundertheCreativeCommonsAttributionShareAlikeLicenseadditionaltermsmayapply.Byusingthissite,youagreetotheTermsofUseandPrivacyPolicy.WikipediaisaregisteredtrademarkoftheWikimediaFoundation,Inc.,anonprofitorganization.