it d ti tintroduction to descriptive...

Post on 14-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

I t d ti tIntroduction to DescriptiveDescriptive StatisticsStatistics17.871Spring 2012

Key measuresKey measuresDescribing data

Moment Non-mean based measure

Center Mean Mode, median

Spread Variance Range,Spread(standard deviation) Interquartile range

Skew Skewness --

Peaked Kurtosis --

Key distinctionKey distinctionPopulation vs. Sample Notation

Population vs SamplePopulation vs. SampleGreeks Romans

β bμ, σ, β s, b

MMean

n

xi

Xi 1

n

Variance, Standard Deviation ofVariance, Standard Deviation of a Population

n

ix 22

,)(i n1

,

n

ix 2)(

i

i

nx

1

)(

V i S D f S lVariance, S.D. of a Sample

sxni 2

2

,)( ni

1,

1

xni

2)( Degrees of freedom

sn

xi

i

1 1)(

Bi d tBinary data

1 timeof proportion1)( xXprobX

)1()1(2 xxsxxs xx

Example of this, using today’s NBC N /M i P ll i Mi hiNews/Marist Poll in Michigan

gen santorum = 1 if candidate==“Santorum”

replace santorum = 0 if candidate~=“Santorum”th d

Candidate Pct.Santorum 35Romney 37

the command summsantorum produces

Mean = 35

yPaul 13Gingrich 8[U t d [7] Mean .35

Var = .35(1-.35)=.2275 S.d. = . 4769696

[Unaccounted for]

[7]

N l di t ib ti lNormal distribution example

IQ SATFrequency

Height

“N k ” “No skew” “Zero skew” Symmetrical Symmetrical Mean = median = mode Value

22/)(1 22/)(

21)(

xexf

SkewnessSkewnessAsymmetrical distribution

Income Contribution to

did t

Frequency

candidates Populations of

countriescountries “Residual vote” rates

“Positive skew” “Right skew”

Value

Right skew

Distribution of the average $$ of dividends/tax return (in K’s)

1.5

1D

ensi

ty

Hyde County SD

.5D Hyde County, SD

Mitsubishi i-MiEV(which is supposed to be all electric)

0

0 5 10 15dividends_pc

6.0

8.1

.02

.04

.06

Den

sity

0

0 50 100 150var1

Fuel economy of cars for sale in the US

SkewnessSkewnessAsymmetrical distribution

GPA of MIT studentsFrequency

“Negative skew” “Left skew”

Value

Placement of Republican PartyPlacement of Republican Party on 100-point scale

.08

.06

ty.0

2.0

4D

ensi

t0

.

0 20 40 60 80 100place on ideological scale - republican partyplace on ideological scale republican party

SkSkewnessFrequency

Value

Placement of Republican PartyPlacement of Republican Party on 100-point scale

.1.0

6.0

8si

ty02

.04

Den

s0

.0

0 20 40 60 80 100place on ideological scale - democratic party

Mean = 26.8; median = 25; mode = 25

K t iKurtosisF

k > 3 leptokurticFrequency

k = 3 mesokurtic

k < 3 platykurtic

Value

.15

.05

.1D

ensi

ty0

0 20 40 60 80 100place on ideological scale - yourself

1

Mean s.d. Skew. Kurt.Self-placement

55.1 26.4 -0.14 2.21

.04

.06

.08

.1D

ensi

ty

Rep. pty. 26.8 21.2 0.87 3.59Dem. pty 74.7 21.8 -1.18 4.29

0.0

2

0 20 40 60 80 100place on ideological scale - democratic party

.06

.08

.1en

sity

0.0

2.0

4D

e

0 20 40 60 80 100place on ideological scale - republican party

Source: Cooperative Congressional Election Study, 2008

N l di t ib tiNormal distribution

Skewness = 0 Kurtosis = 3

22/)(1)( xexf )(

2)(

exf

M d b t th lMore words about the normal curve

The z-scoreor the“standardized score”standardized score

z x xzx x

Commands in STATA forCommands in STATA for univariate statistics summarize varname summarize varname detail summarize varname, detail histogram varname, bin() start() width()

density/fraction/frequency normaldensity/fraction/frequency normal graph box varnames tabulate

E l f Fl id tExample of Florida voters

Question: does the age of voters vary by race?

Combine Florida voter extract files, 2008

gen new_birth_date=date(birth_date,"MDY") gen birth_year=year(new_b)

2010 bi h gen age= 2010-birth_year

L k t di t ib ti f bi thLook at distribution of birth year.0

2.0

2501

.015

Den

sity

.005

.00

1850 1900 1950 2000birth_year

E l b ti dExplore age by voting mode

. table race if birth_year>1900,c(mean age)

----------------------race | mean(age)

----------+-----------1 | 45.612292 | 42 899162 | 42.899163 | 42.69524 | 45.097185 | 52.08628|6 | 44.773929 | 40.86704

----------------------3 = Black4 = Hispanic5 Whit5 = White

G h bi thGraph birth year

.03

.02

.01

Den

sity

0

0 20 40 60 80 100

. hist age if birth_year>1900

0 20 40 60 80 100age

(bin=71, start=9, width=1.3802817)

Divide into “bins” so that each barDivide into bins so that each bar represents 1 year

.015

.02

.01

Den

sity

.005

hi t if bi th 1900 idth(1)

0

0 20 40 60 80 100age

. hist age if birth_year>1900,width(1)

Add ti k t 10 i t lAdd ticks at 10-year intervalshistogram totalscore, width(1) xlabel(-.2 (.1) 1)

.015

.02

.01

.D

ensi

ty.0

050

20 30 40 50 60 70 80 90 100age

S perimpose the normal c r eSuperimpose the normal curve (with the same mean and s.d. as the empirical distribution)

i i i i l lhist age if birth_year>1900,wid(1) xlabel(20 (10) 100) normal

5.0

2.0

1.0

15D

ensi

ty.0

050

20 30 40 50 60 70 80 90 100age

. summ age if birth_year>1900,det

age-------------------------------------------------------------

Percentiles Smallest1% 18 91% 18 95% 21 1610% 24 16 Obs 1261211425% 34 16 Sum of Wgt. 12612114

50% 48 Mean 49.47549Largest Std. Dev. 19.01049

75% 63 10775% 63 10790% 77 107 Variance 361.398695% 83 107 Skewness .262949699% 91 107 Kurtosis 2.222442

Histograms by raceHistograms by racehist age if birth year>1900&race>=3&race<=5,wid(1) hist age if birth_year 1900&race 3&race 5,wid(1)

xlabel(20 (10) 100) normal by(race).0

1.0

2.0

3

3 4

3 = Black4 = Hispanic5 = White

0.0

2.0

3

20 30 40 50 60 70 80 90 100

5Den

sity

5 = White

0.0

1.

20 30 40 50 60 70 80 90 100

D it

age

Densitynormal age

Graphs by race

M i i ith hi tMain issues with histograms

Proper level of aggregation Non-regular data categories Non regular data categories

Draw the previous graph with a boxDraw the previous graph with a box plot

0

graph box age if birth_year>1900

}100 }1.5 x IQR

50ag

e

Upper quartileMedianLower quartile

} Inter-quartilerange

}0

q

Draw the box plots for the differentDraw the box plots for the different races

graph box age if birth_year>1900&race>=3&race<=5,by(race)

3 4

5010

0

3 = Black4 = Hispanic5 = White

010

0

5age

5 = White

050

Graphs by race

Draw the box plots for the different races using “over” optionggraph box age if birth_year>1900&race>=3&race<=5,over(race)

100

3 = Black4 = Hispanic5 = White

50ag

e

5 = White

0

3 4 5

A note about histograms withA note about histograms with unnatural categories

From the Current Population Survey (2000), Voter and Registration Survey

How long (have you/has name) lived at this address?

-9 No Response-3 Refused-2 Don't know-1 Not in universe1 Less than 1 month2 1-6 months3 7-11 months3 7 11 months4 1-2 years5 3-4 years6 5 years or longer

Solution, Step 1, pMap artificial category onto “natural” midpointnatural midpoint

-9 No Response missing-3 Refused missingg-2 Don't know missing-1 Not in universe missing1 Less than 1 month 1/24 = 0.0422 1 6 months 3 5/12 = 0 292 1-6 months 3.5/12 = 0.293 7-11 months 9/12 = 0.754 1-2 years 1.55 3-4 years 3.56 5 years or longer 10 (arbitrary)

recode live length (min/ 1 = )(1= 042)(2= 29)(3= 75)(4=1 5)(5=3 5)(6=10)recode live_length (min/-1 =.)(1=.042)(2=.29)(3=.75)(4=1.5)(5=3.5)(6=10)

G h f d d d tGraph of recoded datahistogram longevity, fraction

.557134

histogram longevity, fraction

.557134

Frac

tion

longevity0 1 2 3 4 5 6 7 8 9 10

0

D it l t f d tDensity plot of dataTotal area of last bar = 557Total area of last bar = .557Width of bar = 11 (arbitrary)Solve for: a = w h (or).557 = 11h => h = .051

longevity0 1 2 3 4 5 6 7 8 9 10

0

15

D it l t t l tDensity plot template

Category Fraction X-min X-max X-lengthHeight

(density)< 1 mo 0156 0 1/12 082 19*< 1 mo. .0156 0 1/12 .082 .19

1-6 mo. .0909 1/12 ½ .417 .22

7 11 mo 0430 ½ 1 500 097-11 mo. .0430 ½ 1 .500 .09

1-2 yr. .1529 1 2 1 .15

3 4 1404 2 4 2 073-4 yr. .1404 2 4 2 .07

5+ yr. .5571 4 15 11 .05

* = .0156/.082

Three words about pie charts:Three words about pie charts: don’t use them

S h t’ ith thSo, what’s wrong with them

For non-time series data, hard to get a comparison among groups; the eye is very p g g p ; y ybad in judging relative size of circle slices

For time series data hard to grasp cross- For time series, data, hard to grasp crosstime comparisons

Some words about graphicalSome words about graphical presentation Aspects of graphical integrity (following

Edward Tufte, Visual Display of , p yQuantitative Information)Main point should be readily apparentMain point should be readily apparentShow as much data as possibleWrite clear labels on the graphWrite clear labels on the graphShow data variation, not design variation

There is a difference between graphs inThere is a difference between graphs in research and publication

Publish OK

.01

.02

.03

3 4

0.0

2.0

3

20 30 40 50 60 70 80 90 100

5Den

sity

2.0

3

Black Hispanic

0.0

1

20 30 40 50 60 70 80 90 100

Densit

age 0.0

1.0

2

White Totalnsity

Densitynormal age

Graphs by race

.01

.02

.03

White Total

De

0

20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100Age

Graphs by race

Do not publish

top related