ctsc topics in managing data seminar series -...
TRANSCRIPT
CTSC Topics in Managing Data Seminar Series
Data Cleaning Part 1: Ranges and Distributions of Numeric Variables
Thank you for attending today’s seminar!
Please make sure to sign the registration sheet
Please complete an evaluation form and turn it in at the end of the seminar
Next Seminar: Wednesday, May 12, 2010, noon to 1 pm, Wolstein 6136
Topic: Questionnaire Design Part 3: Getting Data from Paper to Computer
Data Cleaning Part 1: Ranges and Distributions of Numeric Variables
Topics in Managing Data: A CTSC Seminar Series for Research Personnel
Sue Caban [email protected] Johnson [email protected] for Clinical Investigation April 14, 2010
3
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Introduction
• Housekeeping Items
• Topics in Managing Data Seminar Series
• Data Cleaning Seminars (n=4)
4
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Housekeeping: During This Seminar …
• Think of yourself and the role you play in your research group
• Think of one or more of your research projects / studies that you support
• Think of ways that you work with data cleaning, and how this information applies to you
• Please ask questions!
5
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Topics in Managing Data Seminar Series• The CTSC (Clinical and Translational Science
Collaborative) Members institutions recognize the need for training in the area of research data management and working with research data
• This is part of a series of monthly seminars that will cover a variety of topics related to working with research data– Questionnaire Design– Data Cleaning– SQL Workshops / database design– SAS Workshops / data management programming
6
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Data Cleaning Seminar Series• Data Cleaning seminars are designed to:
– Present basic principles and some specific examples– Present techniques for preventing and cleaning up
messy data
• “Data Cleaning” is a very broad topic– April: Numeric Variables– June: Missing Data– October: Date and Time Data– December: Character Data
7
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Data Cleaning Seminar Series• There is always some element of untidiness in
any dataset
• Data Cleaning can be particularly problematic in heterogeneous populations or with highly skewed data, e.g.,– Study that includes both adult and children
participants– Study that includes participants with serious com-
morbidities, obesity, etc.
8
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Overview: Today’s Material
• Data Cleaning Process• Summary• Questions and Discussion
9
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Big Picture: Data Cleaning
• Most research data are collected and stored as numbers, which is what we analyze
• Most studies have a clear primary aim and statistical plan
• A study may have additional secondary and tertiary data points collected
10
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Big Picture: Data Cleaning
• Use pragmatic judgment when choosing how aggressively to examine your data
– “Not all data points are created equal. Some are special, including those related to your study’s primary aims.”
– “All questions are equally irrelevant until they suddenly become of interest.”
• House cleaning for company: Having friends over vs. selling your house
11
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Data Cleaning Process: Overview• Determining range of expected values
• Prevention
• Diagnosis
• Treatment
12
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Determining Range of Expected Values• Every variable has an expected set of values
• Categorical Variables– Set of discrete values into which responses should
fall– Include valid response codes– Remember to include any “missing” codes you have
defined– Tip: Use a different code to distinguish “pending” or
“flagged” responses from “permanently missing”
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Determining Range of Expected Values
14
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Determining Range of Expected Values
• Continuous Variables– Have an expected range of values into
which responses should fall
– Expected vs. UnexpectedPlausible – but not expected
15
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
145.0 150.0 155.0 160.0 165.0 170.0 175.0 180.0 185.0 190.0 195.0 200.0 205.0
0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Perc
ent
htcm
• Plausible but outlying value for height
16
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Determining Range of Expected Values
• Continuous Variables– Valid vs. Invalid
Physiologically plausible – BMI of 5000
What would it take to have a BMI of 5000?
17
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
• What would it take to have a BMI of 5000?– Height: 12 inches– Weight: 1024 pounds– (only occurs in role playing games!)
Determining Range of Expected Values
18
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Determining Range of Expected Values
• Continuous Variables– Clinically / medically relevant
Systolic BP of 150 may not be unexpected and may not be vastly different statistically from the rest of the sample, but may have clinical relevance
19
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Determining Range of Expected Values
• Continuous Variables:
– Confirmed– Plausible– Busted
20
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Determining Range of Expected Values
• Expected set of values can change conditionally
• Be sure to check for internal consistency between and among questions in your data (and across data sources)– Questions within skip patterns
– Questions that do not apply to all respondents
21
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Prevention
How do we keep bad values out of the data set from the start?
• Design the form well– Bubbles rather than write-in number for
categorical responses
– Appropriate skip rules (e.g., instructions for males to skip section on menopause)
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Prevention: Design Form Well: Before
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Prevention: Design Form Well: After
24
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Prevention
How do we keep bad values out of the data set from the start?
• Review the forms prior to data entry– Look over the form at or soon after data
collection
– Review and confirm sections that should be completed differentially
25
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Prevention
How do we keep bad values out of the data set from the start?
• During data entry– Database design: Choose correct data type
– Build checks into data entry systemDrop-down lists of acceptable values
Checks or contraints for ranges
Skip patterns
26
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Prevention
How do we keep bad values out of the data set from the start?
• During data entry (continued)– Use codes to differentiate types of missing
repsonses-8: Not Applicable
-9: Permanently Missing
-1: Problem (Needs to be reviewed)
27
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis
• How do we identify bad values once they are in the database?
28
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis1. Check for values outside of expected range.
• Examples:– BMI expected to lie between 15 and 50– Query the database for BMI values less than 15 or
greater than 50.
– Response to question can be 1, 2, 3, 4– Query for values that are not in that list.
29
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis1. Check for values outside of expected range.
• Implementation:
SASproc print data=d;
var id bmi;where bmi < 15 or bmi > 50;
run;
30
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis1. Check for values outside of expected range.
• Implementation:
SQLselect id, bmifrom dwhere bmi < 15 or bmi > 50;
31
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis2. Check for values within the expected range
that are still questionable.
– Frequencies– High and Low values– Statistical Outliers
32
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis2. Check for values within the expected range
that are still questionable.– Frequencies
• More useful for categorical responses than continuous ones
33
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis2. Check for values within the expected range
that are still questionable.– Frequencies
• SAS: FREQ Procedure
34
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis2. Check for values within the expected range
that are still questionable.– Frequencies
• SQL: GROUP BY
35
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis2. Check for values within the expected range
that are still questionable.– High and Low Values
• Looking at the highest or lowest 10% of values.
• SAS: UNIVARIATE Procedure• SQL: SELECT TOP n, ORDER BY
36
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis2. Check for values within the expected range
that are still questionable.– Statistical Outliers
• “Extreme” observations based on the overall distribution of that variable.
• Deviation from center (Mean or Median)• Visualization (histogram, scatter plot)
37
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis2. Check for values within the expected range
that are still questionable.– Statistical Outliers
145.0 150.0 155.0 160.0 165.0 170.0 175.0 180.0 185.0 190.0 195.0 200.0 205.0
0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Perc
ent
htcm
38
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Diagnosis3. Bivariate checks of internal consistency
• Be sure to check for internal consistency between and among questions in your data (and across data sources)
– Questions within skip patterns– Questions that do not apply to all respondents
39
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Treatment
• What to do when we find bad values
Is the value real?
correctannotate
yes no
40
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Treatment• What to do when we find bad values
– Confirm that the value is real• Verify entry
– Make sure database matches source doc• If possible, verify collection
– Ask lab to confirm– Ask lab to rerun test– Ask participant to confirm– Check external sources if available
– Ethical issues?– Bias?
41
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
Treatment
• What to do when we find bad values– Confirm that the value is real– Correct if possible– Otherwise note the value is real
• Create a table of values that have been confirmed
• Anticipate that your analyst will ask you about it anyway!
44
Topics in Managing Data: Data Cleaning Part 1: Numeric Variables
References
Cody’s Data Cleaning Techniques Using SAS Software. Ron Cody. SAS Press, Cary, NC. 1999.
Revised version (second edition) published in 2008 is available
CTSC Topics in Managing Data Seminar Series
Data Cleaning Part 1: Ranges and Distributions of Numeric Variables
Thank you for attending today’s seminar!
Please make sure to sign the registration sheet
Please complete an evaluation form and turn it in at the end of the seminar
Next Seminar: Wednesday, May 12, 2010, noon to 1 pm, Wolstein 6136
Topic: Questionnaire Design Part 3: Getting Data from Paper to Computer