1. preparing research datasets data request data cleaning dataset preparation documentation beverly...

27
1

Post on 20-Dec-2015

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

1

Page 2: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Preparing Research Datasets

Data Request Data Cleaning Dataset Preparation Documentation

Beverly Musick2

Page 3: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Research Data RequestA concept proposal is a detailed plan for a

research project and typically includes:

• The aims of the study• Associated hypotheses• Statistical analysis plan• Description of the cohort• Specific data/variables needed

(example Hunt Proposal)3

Page 4: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

4

Page 5: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Steps to Fulfill Research Data Request1. Identify and resolve questions regarding

requirements (may be repeated)2. Determine data source and variables

needed3. Define any derived variables4. Define cohort and subset of

visits/observations to be included5. Data cleaning6. Dataset preparation7. Documentation

5

Page 6: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Raw Data CleaningFor data that are stored in Access, Excel, or text files, data cleaning should begin with the original table, spreadsheet or file.

•Back-up the original data files.•Eliminate blank records and any records used for testing.•Locate duplicate records and resolve.•For numeric variables, identify outliers by sorting and reviewing the overall minimum and maximum. This is particularly useful for continuous variables such as dates, ages, weights etc.•For categorical variables such as gender or marital status, sorting will reveal invalid response codes or use of mixed case (f, F, m, M for gender).•Review the frequency of missing data when records are sorted. Does it make sense that x records have no value for variable y?

6

Page 7: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Raw Data to SAS DatasetsCreate a SAS program that converts the database file(s) to

permanent SAS dataset(s). • For Access or Excel files can use ‘Proc Import’

PROC IMPORT OUT= WORK.demog DATATABLE= "tblDEMOG"

DBMS=ACCESS REPLACE; DATABASE="I:\Projects\Kenya\CFAR\

cfar.mdb"; dbpwd=‘password' ;

RUN;

• For text files can write specific input statement data copd ; infile 'c:\kenya\hiv\copd.txt' ; input @1 patientid $9. @@ ;run ; 7

Page 8: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

SAS Dataset Creation

• Merge or append (concatenate) tables as necessary.

• Double-check the merging process by looking at the number of observations in each dataset before and after the merge.

• The number of records is dependent on the overlap among the datasets. This relationship should be known in advance and the expected outcome confirmed.

8

Page 9: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Merge Exampledata vis1 ; set h.visitDemo(keep=patient_id apptdate age weight height bmi

cd4) ; if patient_id in (1,2,3,4,5) ;run ;NOTE: There were 933654 observations read from the data set H.visitDemo.NOTE: The data set WORK.VISIT has 71 observations and 7 variables.

data vis2 ; set h.visitDemo(keep=patient_id apptdate clinic hgb sao2) ; if patient_id in (4,5,6) ;run ;NOTE: There were 933654 observations read from the data set H.visitDemo.NOTE: The data set WORK.VIS2 has 46 observations and 5 variables.

data bothvis ; merge visit vis2 ; by patient_id apptdate ;run ;NOTE: There were 71 observations read from the data set WORK.VIS1.NOTE: There were 46 observations read from the data set WORK.VIS2.NOTE: The data set WORK.BOTHVIS has 83 observations and 10 variables. 9

Page 10: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

SAS Dataset Creation (cont.)

• Confirm that the total number of variables in the merged dataset is correct.

• The number should be the sum of all variables minus the (number of key fields*(number of datasets in merge minus 1)).

In the previous example: 7 + 5 – 2*(2-1) = 10

• If the number of variables is less than this, then you know that you have the same variable(s) in one or more of the datasets. This should be strictly avoided.

10

Page 11: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Merge Exampledata vis1 ; set h.visitDemo(keep=patient_id apptdate age weight height bmi

cd4) ; if patient_id in (1,2,3,4,5) ;run ;NOTE: There were 933654 observations read from the data set H.visitDemo.NOTE: The data set WORK.VISIT has 71 observations and 7 variables.

data vis2 ; set h.visitDemo(keep=patient_id apptdate clinic hgb sao2) ; if patient_id in (4,5,6) ;run ;NOTE: There were 933654 observations read from the data set H.visitDemo.NOTE: The data set WORK.VIS2 has 46 observations and 5 variables.

data bothvis ; merge visit vis2 ; by patient_id apptdate ;run ;NOTE: There were 71 observations read from the data set WORK.VIS1.NOTE: There were 46 observations read from the data set WORK.VIS2.NOTE: The data set WORK.BOTHVIS has 83 observations and 10 variables. 11

Page 12: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

SAS Dataset Creation (cont.)Always review the SAS log ERROR, WARNING, and NOTE messages.

The following messages are often overlooked but do require action:

"NOTE: MERGE statement has more than one data set with repeats of BY values.“• This indicates that one or more of the datasets that you are trying to

merge contains multiple observations that are not uniquely distinguishable based on the variables listed in the by statement. Merged dataset will contain spurious and unexpected results. Further processing should not continue until this note has been resolved.

“NOTE: Variable _____ is uninitialized”“NOTE: Variable _____ has never been referenced”

• These indicate that the variable has not been properly defined. Many times a variable name has just been misspelled.

“NOTE: Character values have been converted to numeric values…”• This indicates that SAS has automatically converted a character

variable to numeric. Because unexpected results can occur, it’s best to do the conversion manually with the input or put function.

“WARNING: Multiple lengths were specified for the BY variable _____ by input data sets. This may cause unexpected results.”• This indicates that the by variable is not consistent across all data sets.

12

Page 13: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

SAS Dataset Creation (cont.)To create permanent datasets for analysis:

• Recode missing values used in the raw data tables/files to appropriate SAS missing values. For example, if 9's were used to indicate missing data for numeric fields in a data table then these should be converted to .'s.

• Calculate appropriate summary scores (ex. AUDIT-3, BMI)

• Calculate differences between dates such as time from enrollment to ART initiation.

• Label all calculated and created variables.

• Attach formats to the variable values where necessary.13

Page 14: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Cleaning Data in SASCreate a cleanup program.

• Generate frequencies, means, and univariates to better understand the dataset and to check for invalid data.

• Plot the data.

• For the numeric and date fields look at minimums and maximums to verify all values are within expected range.

• Locate duplicate records and resolve.

• Compare fields when appropriate (i.e. dob and age, confirm date of initial visit < date of follow-up).

14

Page 15: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Cleaning Data in SAS (cont.)

• Identify important fields such as summary scores and verify their values.

• Merge all longitudinal datasets to identify date inconsistencies, variable format inconsistencies, and to locate missing questionnaires.

• Merge cross-sectional (demographics) dataset with longitudinal datasets to identify subjects in one but not the other.

15

Page 16: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

SAS Program Files

• Save all logs and outputs from SAS programs especially when creating analysis datasets for publication

• Naming conventions – studyx.sas, studyx.log, studyx.lst

• Only the program that generates the permanent dataset should overwrite it.

• Never overwrite a permanent dataset (even with a proc sort) from any other program.

16

Page 17: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Documentation• Internally document SAS programs. At minimum

include file name, location, purpose, author, date, and revisions.

• May be helpful to include the names of any permanent SAS datasets created within the program

• All SAS printouts should have at least one title, which includes the project name (ex. title ‘Treatment Interruptions Analysis Dataset’ ;)

• It’s helpful to use the footnote option to display the path and file name of the SAS program on the listing (ex. options footnote ‘R:\AMPATH\Research\Braitstein\TxInt\TxInterrupt.sas’ ; )

17

Page 18: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Documentation (cont.)• If any variable values have been formatted,

include a copy of the “proc format” section in the documentation.

• Generate form keys.

• Provide a description of any variables included in the datasets that are not found on the form keys.

18

Page 19: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Summary Score DocumentationDetailed algorithms of how summary scores are calculated should include the following:

a. which variables are used to calculate which summary scores

b. which variables (if any) are recoded and howc. what is the minimum number of non-missing

items needed to calculate the scored. how are missing values addressed. Typically

when calculating a total or sum score the mean should be imputed for missing data. If the summary score is a mean itself then the missing data can be ignored. In both of these cases it is essential that c. above is followed and that summary scores are coded as missing if there is insufficient data to calculate.

e. what is the meaning of the score and how is it scaled. Indicate the possible range and how a high score differs from a low score. For example include something like “Higher score indicates more depression”.

19

Page 20: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Dataset Cover SheetNotes on Analysis Datasets

Project Name: Principal Investigator: Date of Original Data Request {please attach}: Datasets Created: Name and Location of (SAS) Program used to Generate Datasets: Creation Date: Created By: Biostatisticians: Cohort: Derived Variables (name of variable, coding, precise description): SAS Formats:proc format Preliminary Statistics:

20

Page 21: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

PracticumCreate a Data Cover Sheet for dataset created

during Programming Standards Practicum (male patients with at least 2 CD4’s)

21

Page 22: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

SAS General Notes

• If the study is longitudinal, at least two datasets are needed: one containing the demographics and other information which does not change over time; and one containing the data for multiple time points.

• Never put cross-sectional variables such as gender in the longitudinal dataset.

• Format all date fields with 4-digit year (ddmmyy10. or date9.)

• Choose data type numeric whenever possible.

22

Page 23: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Distributing SAS Datasets

• If possible, have another data manger review the datasets and documentation before distributing

• The following should be included:– The form keys– All appropriate SAS datasets (should have the extension .sas7bdat)– Dataset Cover Sheet– Latest Data Request Form– Any other documents or notes which would further explain the data.

23

Page 24: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Distributing SAS Datasets (cont.)

In most cases the following should not be distributed:

• Any Protected Health Information (PHI) such as study subject’s name, address, phone numbers, social security number, hospital id number. Date of birth should only be included if absolutely necessary. But usually age can be calculated and given instead.

• SAS generation programs. These often contain PHI.

24

Page 25: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

File Maintenance & ArchivingFor your own records at minimum, you should have:

• A copy of everything you give to the biostatistician or investigator and the date given.

• A copy of the log of all the SAS programs especially those that create any

permanent SAS datasets which were passed along to others

• Grant protocols, meeting notes, scoring algorithms, instructions for data entry, corrections made, etc.

• It may be helpful to maintain a subdirectory that exactly mirrors the subdirectory of the pc where the data is actually being entered. This subdirectory would include all the RDMS programs, format files, and tables.

• For longitudinal studies in particular, it is important to archive datasets and SAS programs/logs, which were used for analysis for abstracts, papers, grant proposals, and other publications.

25

Page 26: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Data Managers Working with Investigators and Biostatisticians

• Attend study meetings• Date all documents and meeting notes• Comment on proposed study changes• Understand the statistical analysis plan• Review statistical reports (preferably before

presented to research team)• Review and critique abstracts/manuscriptsYour contribution is EXTREMELY important!

26

Page 27: 1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

27