sociology 5811: lecture 2: datasets and simple descriptive statistics copyright © 2005 by evan...

45
Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Upload: elvin-hart

Post on 04-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Sociology 5811:Lecture 2: Datasets and Simple

Descriptive Statistics

Copyright © 2005 by Evan Schofer

Do not copy or distribute without permission

Page 2: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Announcements

• 1. Lab meets Monday 1:25, Blegen 440

• 2. Course lecture notes at:– http://www.soc.umn.edu/~schofer– Click on “Soc 5811”, go to “Course Files”

Page 3: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

From Measurement to Datasets

• Suppose we:– 1. Choose a unit of analysis– 2. Choose a measurement strategy– 3. Take measurements on relevant cases

• Result: We end up with sets of measurements on a group of cases

• Q: What next?

• A: Data is often organized in a spread sheet:– Rows contain all measurements on each case– Columns reflect sets of measurements or “variables”

Page 4: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Datasets: Example

• Suppose we measured 5 people regarding views on gun control and gun ownership:

Person Views on Gun Control

# Guns owned

1 Favor 0

2 Oppose 3

3 Favor 0

4 Favor 1

5 Oppose 1

Rows contain all info on each

person (a case)

Columns contain all measurements

on a particular topic (a variable)

Page 5: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

From Measurement to Datasets

• Issue: To facilitate data analysis, it is best to enter data as numbers, rather than text– Often called “coding” data

• Less good option: Use text words “Favor” and “Oppose” in our gun control dataset

• Better option: Convert “Favor” and “Oppose” to numeric values– Example: 1 = favor, 0 = oppose

• Advantage: more computation options

• Disadvantage: Data is harder to interpret by eye

Page 6: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Datasets: Recoded Example

• In this dataset, “Favor” was recoded to 1, “Oppose” to zero.

Person Views on Gun Control

# Guns owned

1 1 0

2 0 3

3 1 0

4 1 1

5 0 1

Note that it is harder to visually

determine the meaning of the variable. You

have to remember what the numbers

mean…

Page 7: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Review

• Measurement: The task of gathering information that characterizes or represents a social phenomena

• Q: What is “Unit of Analysis”?– Answer: The type of thing which we are collecting

information about

• Q: What are 3 measurement scales? Examples?• Nominal

• Ordinal

• Interval / Ratio

Page 8: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Review: Measurement Problems

• Problems that arose in survey given last class:

• Question 10: What transportation do you generally use to get to class– Answer: Both “car” and “public transportation”

• Question 9: How many miles away do you live?– Answer: 4 blocks

• Question 6 (Liberal or conservative, from 1-10)– Answer: “3 or 4”

• How many CDs do you own?– Answer: “Over 100”

Page 9: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Today’s Class: Describing Information

• Tools for describing a single variable:

• List, Frequency lists, charts, histograms

• Characterization of “Typical” cases– Ex: Mean (“average”), Mode, Median

• Characterization of Variation– Ex: Min, Max, Variance, etc.

Page 10: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Listing Variables• Lists: Values of a variable for all cases

• Looking at the “raw data”

• Report command in SPSS– Or just look at data in the SPSS data editor

• Advantages:– Easy– Gives a rich description – you can see every case

• Disadvantages:– Cumbersome for large datasets– If data involves complex coding, you may not be able

to interpret it visually

Page 11: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Frequency Lists

• Frequency Lists: Tables that show how many cases take on a particular value– Also called “frequencies”, “frequency distributions”

• Examples:– Congressional vote. How many “Yes” vs “No”?– Social class: How many = low, middle, upper?– Age: How many = 1 years old, 2 years, … 100 years?

• Relevant SPSS Command: Frequencies

Page 12: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Example from SPSS• Note: Men coded as 1, Women coded as 2

GENDER Freq. % Valid% Cuml. %

1.00 6 33.335.335.3 2.00 11 61.164.7100.0

Total 17 94.4100.0

Missing:Systm 1 5.6

Total 18 100.0

Page 13: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Frequency Lists

• Advantages:– Useful for large datasets– Fairly rich description of data – once you get used to

reading them…

• Disadvantages:– Unlike a list, you can’t see which case is which or

compare with other variables– Best for nominal and some ordinal variables only– Not useful if all values are unique, such as: rank

orderings, many continuous variables

Page 14: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Visual Representations: Bar Charts

• “Bar Chart”– Essentially a visual representation of a frequency list– Height of bars represent number of cases– For nominal & some ordinal variables only

• Again, rank orders and continuous measures don’t work

• “Pie Chart”– Similar, but divides up a circle to show frequency

• All Accessible within Frequencies Menu– Just click Chart button– Or, look under Graphs menu

Page 15: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

SPSS Bar Chart

GENDER

2.001.00Missing

Cou

nt

12

10

8

6

4

2

0

Page 16: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

A Similar Approach: Pie Chart

GENDER

Missing

2.00

1.00

Page 17: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Graphing Continuous Measures• Issue: Continuous variables have an infinite

possible number of unique values. • Cases rarely have the exact same value

• Bar chart would have many bars of height 1• What would you do about zeros?

• Solution: use “grouped data”• Sets of similar values must be “grouped”

– Lumped together by constant intervals– Note: Information is destroyed in the process

• Result: A “Histogram”– Height of bar represents number of cases within a

given range of values

Page 18: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Histogram: Age (5-year interval)

AGE OF RESPONDENT

90.0

85.0

80.0

75.0

70.0

65.0

60.0

55.0

50.0

45.0

40.0

35.0

30.0

25.0

20.0

AGE OF RESPONDENT

Fre

qu

en

cy

300

200

100

0

Std. Dev = 17.81

Mean = 45.4

N = 1533.00

This doesn’t mean that 200 cases are exactly 30 years old… Rather, 200 cases fall in the 5-year interval around age 30

(from 27.5 and 32.5)

Page 19: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Histograms: Interval Width

• Previous example: People were grouped by age, within 5-year intervals– Bars represented ages 17.5-22.5, 22.5-27.5 and so on

• It is also possible to group people within 1 year intervals – or 50 year intervals– Small interval = more bars in the histogram– Wide interval = fewer bars in the histogram

• WARNING: Histograms look very different depending on how wide you set the intervals

Page 20: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Histogram: Age (1-year interval)

20 40 60 80

AGE OF RESPONDENT

10

20

30

40

Co

un

t

Page 21: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Histogram: Age (20-year interval)

20 40 60 80

AGE OF RESPONDENT

0

200

400

600

Co

un

t

Page 22: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Histograms: Interval Width• Changing the number of “bars” in the histogram

alters the appearance of the graph• Wide intervals/few bars results in greater simplification of

data

• Suggestion– 1. Try different intervals

• In SPSS, go to “interactive histogram”

– 2. Don’t over-interpret a crude histogram

• Another example: National Wealth– Unit of analysis = country– Variable = GDP per capita, a measure of wealth

Page 23: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Histogram: Wide Intervals

Penn 56 RGDPCH 1990

18750.015250.011750.08250.04750.01250.0

100

80

60

40

20

0

Std. Dev = 4915.68

Mean = 4810.4

N = 152.00

National Wealth 1990

Page 24: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Histogram: Narrow Intervals

Penn 56 RGDPCH 1990

20000.0

18000.0

16000.0

14000.0

12000.0

10000.0

8000.0

6000.0

4000.0

2000.0

0.0

Penn 56 RGDPCH 1990F

req

ue

ncy

50

40

30

20

10

0

Std. Dev = 4915.68

Mean = 4810.4

N = 152.00

National Wealth 1990

Page 25: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Histograms

• Advantages:

• 1. Useful for even continuous measures

• 2. Preserves information on distribution of variable

• Both peaks and zeros are apparent

• Disadvantages:

• 1. Interval width can be a problem• Too Wide results in loss of information

• Too Narrow results in too many bars – unreadable.

Page 26: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Interpreting Histograms: Age• Try to interpret: What is this sample like?

AGE OF RESPONDENT

90.0

85.0

80.0

75.0

70.0

65.0

60.0

55.0

50.0

45.0

40.0

AGE OF RESPONDENTF

req

ue

ncy

400

300

200

100

0

Std. Dev = 10.74

Mean = 56.3

N = 1533.00

Page 27: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Interpreting Histograms: Age• Try to interpret this histogram:

AGE OF RESPONDENT

60.0

55.0

50.0

45.0

40.0

35.0

30.0

25.0

20.0

15.0

10.0

AGE OF RESPONDENTF

req

ue

ncy

400

300

200

100

0

Std. Dev = 10.74

Mean = 26.3

N = 1533.00

Page 28: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Interpreting Histograms: Age• Try to interpret this histogram:

AGE OF RESPONDENT

100.095.0

90.085.0

80.075.0

70.065.0

60.055.0

50.045.0

40.035.0

30.025.0

20.0

AGE OF RESPONDENTF

req

ue

ncy

70

60

50

40

30

20

10

0

Std. Dev = 23.06

Mean = 59.9

N = 1537.00

Page 29: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Measures of “Central Tendency”

• Often, it is important to assess the “typical” values of a variable

• Examples: – We may wish to know how much money the typical

family earns– We may wish to know the age of the typical person in

our dataset

• Solution: Conduct calculations to determine what values are “typical

• However, this isn’t as easy as it sounds– Consider some examples…

Page 30: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

What is the “Center”?

Penn 56 RGDPCH 1990

20000.0

18000.0

16000.0

14000.0

12000.0

10000.0

8000.0

6000.0

4000.0

2000.0

0.0

Penn 56 RGDPCH 1990F

req

ue

ncy

50

40

30

20

10

0

Std. Dev = 4915.68

Mean = 4810.4

N = 152.00

National Wealth 1990

Page 31: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

What is the “Center”?

GENDER

2.001.00Missing

Cou

nt

12

10

8

6

4

2

0

Page 32: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

The “Mode”

• The Mode = the value representing the largest number of cases -- called the “Modal” value

• Useful for Nominal, Ordinal variables

• Only useful for Continuous variables if you have grouped data into a histogram

• Otherwise, all values may very likely be unique

• Issue: Mode is not very helpful (even misleading) in certain circumstances

• Ex: If there are many peaks, or a single unusual one

• Ex: If the variable is distributed quite evenly.

Page 33: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Mode: Example

GENDER

2.001.00Missing

Cou

nt

12

10

8

6

4

2

0

Here, the mode is 2 (which corresponds

to “female”)

Page 34: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Mode: Example

AGE OF RESPONDENT

90.0

85.0

80.0

75.0

70.0

65.0

60.0

55.0

50.0

45.0

40.0

35.0

30.0

25.0

20.0

AGE OF RESPONDENT

Fre

qu

en

cy

300

200

100

0

Std. Dev = 17.81

Mean = 45.4

N = 1533.00

Here, the mode is 30 (though it might be

different if the histogram had a

different interval width)

Page 35: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Mode: Example• In this case, the mode (45) is not helpful

AGE OF RESPONDENT

100.095.0

90.085.0

80.075.0

70.065.0

60.055.0

50.045.0

40.035.0

30.025.0

20.0

AGE OF RESPONDENTF

req

ue

ncy

70

60

50

40

30

20

10

0

Std. Dev = 23.06

Mean = 59.9

N = 1537.00

Page 36: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Median

• The Median = the value of the “middle case”• Equal number of cases fall higher or lower

• Can be used for ordinal, continuous variables

• Advantages:• 1. Not influenced by unusual peaks

• 2. Useful even in very even distributions

• Disadvantages:• 1. Not useful for data spread in two distinct “clumps.”

Page 37: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Median Example

AGE OF RESPONDENT

90.0

85.0

80.0

75.0

70.0

65.0

60.0

55.0

50.0

45.0

40.0

35.0

30.0

25.0

20.0

AGE OF RESPONDENT

Fre

qu

en

cy

300

200

100

0

Std. Dev = 17.81

Mean = 45.4

N = 1533.00

The median case is 42 years old. Half are older, half are

younger!

Page 38: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Mean – “Average”

• The most well-known way of assessing the “middle”

• Calculated by adding values of all cases, then dividing by the total number of cases

• Advantages:• Useful for continuous measures

• Not overly influenced by any single peak

• Disadvantages:• Can be influenced by extreme values.

Page 39: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Calculating the Mean: Variables

• Each column of a dataset is considered a variable

• We’ll refer to a column generically as “Y”

Person # Guns owned

1 0

2 3

3 0

4 1

5 1

The variable “Y”

Note: The total number of cases in

the dataset is referred to as “N”.

Here, N=5.

Page 40: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Equation of Mean: Notation• Each case can be

identified a subscript• Yi represents “ith” case of

variable Y• i goes from 1 to N• Y1 = value of Y for first

case in spreadsheet• Y2 = value for second

case, etc.• YN = value for last case

Person # Guns owned (Y)

1 Y1 = 0

2 Y2 = 3

3 Y3 = 0

4 Y4 = 1

5 Y5 = 1

Page 41: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Calculating the Mean

• Equation:

• 1. Mean of variable Y represented by Y with a line on top – called “Y-bar”

• 2. Equals sign means equals: “is calculated by the following…”

• 3. N refers to the total number of cases for which there is data

• Summation () – will be explained next…

N

i

iYN

Y1

1

Page 42: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Equation of Mean: Summation

• Sigma (Σ): Summation– Indicates that you should add up a series of numbers

The thing on the right is the

item to be added

repeatedly

N

i

iY1

The things on top and bottom tell you how many times to add up Y-sub-i…

AND what numbers to

substitute for i.

Page 43: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Equation of Mean: Summation

• 1. Start with bottom: i = 1.– The first number to add is Y-sub-1

N

i

iY1

1Y 2Y 5Y3Y 4Y

• 2. Then, allow i to increase by 1 – The second number to add is i = 2, then i = 3

• 3. Keep adding numbers until i = N– In this case N=5, so stop at 5

Page 44: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Equation for the Mean: Example

Case Num CD’s

1 20 Y1

2 40 Y2

3 0 Y3

4 70 Y4

N

N

i

YYYYYi

...321

1

• Variable: Number of CD’s… How many CD’s does a person own?

Page 45: Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Equation of the Mean: Example

4321

1

YYYYYN

i

i 1307004020

5.321304

11

1

N

i

iYN

Y