ctsc topics in managing data seminar series -...

45
CTSC Topics in Managing Data Seminar Series Data Cleaning Part 1: Ranges and Distributions of Numeric Variables Thank you for attending today’s seminar! Please make sure to sign the registration sheet Please complete an evaluation form and turn it in at the end of the seminar Next Seminar : Wednesday, May 12, 2010, noon to 1 pm, Wolstein 6136 Topic : Questionnaire Design Part 3: Getting Data from Paper to Computer

Upload: leduong

Post on 21-Apr-2018

220 views

Category:

Documents


4 download

TRANSCRIPT

CTSC Topics in Managing Data Seminar Series

Data Cleaning Part 1: Ranges and Distributions of Numeric Variables

Thank you for attending today’s seminar!

Please make sure to sign the registration sheet

Please complete an evaluation form and turn it in at the end of the seminar

Next Seminar: Wednesday, May 12, 2010, noon to 1 pm, Wolstein 6136

Topic: Questionnaire Design Part 3: Getting Data from Paper to Computer

Data Cleaning Part 1: Ranges and Distributions of Numeric Variables

Topics in Managing Data: A CTSC Seminar Series for Research Personnel

Sue Caban [email protected] Johnson [email protected] for Clinical Investigation April 14, 2010

3

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Introduction

• Housekeeping Items

• Topics in Managing Data Seminar Series

• Data Cleaning Seminars (n=4)

4

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Housekeeping: During This Seminar …

• Think of yourself and the role you play in your research group

• Think of one or more of your research projects / studies that you support

• Think of ways that you work with data cleaning, and how this information applies to you

• Please ask questions!

5

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Topics in Managing Data Seminar Series• The CTSC (Clinical and Translational Science

Collaborative) Members institutions recognize the need for training in the area of research data management and working with research data

• This is part of a series of monthly seminars that will cover a variety of topics related to working with research data– Questionnaire Design– Data Cleaning– SQL Workshops / database design– SAS Workshops / data management programming

6

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Data Cleaning Seminar Series• Data Cleaning seminars are designed to:

– Present basic principles and some specific examples– Present techniques for preventing and cleaning up

messy data

• “Data Cleaning” is a very broad topic– April: Numeric Variables– June: Missing Data– October: Date and Time Data– December: Character Data

7

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Data Cleaning Seminar Series• There is always some element of untidiness in

any dataset

• Data Cleaning can be particularly problematic in heterogeneous populations or with highly skewed data, e.g.,– Study that includes both adult and children

participants– Study that includes participants with serious com-

morbidities, obesity, etc.

8

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Overview: Today’s Material

• Data Cleaning Process• Summary• Questions and Discussion

9

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Big Picture: Data Cleaning

• Most research data are collected and stored as numbers, which is what we analyze

• Most studies have a clear primary aim and statistical plan

• A study may have additional secondary and tertiary data points collected

10

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Big Picture: Data Cleaning

• Use pragmatic judgment when choosing how aggressively to examine your data

– “Not all data points are created equal. Some are special, including those related to your study’s primary aims.”

– “All questions are equally irrelevant until they suddenly become of interest.”

• House cleaning for company: Having friends over vs. selling your house

11

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Data Cleaning Process: Overview• Determining range of expected values

• Prevention

• Diagnosis

• Treatment

12

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Determining Range of Expected Values• Every variable has an expected set of values

• Categorical Variables– Set of discrete values into which responses should

fall– Include valid response codes– Remember to include any “missing” codes you have

defined– Tip: Use a different code to distinguish “pending” or

“flagged” responses from “permanently missing”

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Determining Range of Expected Values

14

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Determining Range of Expected Values

• Continuous Variables– Have an expected range of values into

which responses should fall

– Expected vs. UnexpectedPlausible – but not expected

15

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

145.0 150.0 155.0 160.0 165.0 170.0 175.0 180.0 185.0 190.0 195.0 200.0 205.0

0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Perc

ent

htcm

• Plausible but outlying value for height

16

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Determining Range of Expected Values

• Continuous Variables– Valid vs. Invalid

Physiologically plausible – BMI of 5000

What would it take to have a BMI of 5000?

17

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

• What would it take to have a BMI of 5000?– Height: 12 inches– Weight: 1024 pounds– (only occurs in role playing games!)

Determining Range of Expected Values

18

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Determining Range of Expected Values

• Continuous Variables– Clinically / medically relevant

Systolic BP of 150 may not be unexpected and may not be vastly different statistically from the rest of the sample, but may have clinical relevance

19

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Determining Range of Expected Values

• Continuous Variables:

– Confirmed– Plausible– Busted

20

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Determining Range of Expected Values

• Expected set of values can change conditionally

• Be sure to check for internal consistency between and among questions in your data (and across data sources)– Questions within skip patterns

– Questions that do not apply to all respondents

21

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Prevention

How do we keep bad values out of the data set from the start?

• Design the form well– Bubbles rather than write-in number for

categorical responses

– Appropriate skip rules (e.g., instructions for males to skip section on menopause)

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Prevention: Design Form Well: Before

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Prevention: Design Form Well: After

24

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Prevention

How do we keep bad values out of the data set from the start?

• Review the forms prior to data entry– Look over the form at or soon after data

collection

– Review and confirm sections that should be completed differentially

25

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Prevention

How do we keep bad values out of the data set from the start?

• During data entry– Database design: Choose correct data type

– Build checks into data entry systemDrop-down lists of acceptable values

Checks or contraints for ranges

Skip patterns

26

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Prevention

How do we keep bad values out of the data set from the start?

• During data entry (continued)– Use codes to differentiate types of missing

repsonses-8: Not Applicable

-9: Permanently Missing

-1: Problem (Needs to be reviewed)

27

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis

• How do we identify bad values once they are in the database?

28

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis1. Check for values outside of expected range.

• Examples:– BMI expected to lie between 15 and 50– Query the database for BMI values less than 15 or

greater than 50.

– Response to question can be 1, 2, 3, 4– Query for values that are not in that list.

29

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis1. Check for values outside of expected range.

• Implementation:

SASproc print data=d;

var id bmi;where bmi < 15 or bmi > 50;

run;

30

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis1. Check for values outside of expected range.

• Implementation:

SQLselect id, bmifrom dwhere bmi < 15 or bmi > 50;

31

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis2. Check for values within the expected range

that are still questionable.

– Frequencies– High and Low values– Statistical Outliers

32

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis2. Check for values within the expected range

that are still questionable.– Frequencies

• More useful for categorical responses than continuous ones

33

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis2. Check for values within the expected range

that are still questionable.– Frequencies

• SAS: FREQ Procedure

34

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis2. Check for values within the expected range

that are still questionable.– Frequencies

• SQL: GROUP BY

35

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis2. Check for values within the expected range

that are still questionable.– High and Low Values

• Looking at the highest or lowest 10% of values.

• SAS: UNIVARIATE Procedure• SQL: SELECT TOP n, ORDER BY

36

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis2. Check for values within the expected range

that are still questionable.– Statistical Outliers

• “Extreme” observations based on the overall distribution of that variable.

• Deviation from center (Mean or Median)• Visualization (histogram, scatter plot)

37

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis2. Check for values within the expected range

that are still questionable.– Statistical Outliers

145.0 150.0 155.0 160.0 165.0 170.0 175.0 180.0 185.0 190.0 195.0 200.0 205.0

0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Perc

ent

htcm

38

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Diagnosis3. Bivariate checks of internal consistency

• Be sure to check for internal consistency between and among questions in your data (and across data sources)

– Questions within skip patterns– Questions that do not apply to all respondents

39

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Treatment

• What to do when we find bad values

Is the value real?

correctannotate

yes no

40

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Treatment• What to do when we find bad values

– Confirm that the value is real• Verify entry

– Make sure database matches source doc• If possible, verify collection

– Ask lab to confirm– Ask lab to rerun test– Ask participant to confirm– Check external sources if available

– Ethical issues?– Bias?

41

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Treatment

• What to do when we find bad values– Confirm that the value is real– Correct if possible– Otherwise note the value is real

• Create a table of values that have been confirmed

• Anticipate that your analyst will ask you about it anyway!

42

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Summary

43

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

Questions and Discussion

44

Topics in Managing Data: Data Cleaning Part 1: Numeric Variables

References

Cody’s Data Cleaning Techniques Using SAS Software. Ron Cody. SAS Press, Cary, NC. 1999.

Revised version (second edition) published in 2008 is available

CTSC Topics in Managing Data Seminar Series

Data Cleaning Part 1: Ranges and Distributions of Numeric Variables

Thank you for attending today’s seminar!

Please make sure to sign the registration sheet

Please complete an evaluation form and turn it in at the end of the seminar

Next Seminar: Wednesday, May 12, 2010, noon to 1 pm, Wolstein 6136

Topic: Questionnaire Design Part 3: Getting Data from Paper to Computer