a beginner's quickstart guide to cleaning data with stata

A Beginner’s Quickstart Guide to Cleaning Data with Stata - Kirkwood P. Donavin

Stata Basics

1. Create a do-file: Open the Stata application, click on the button labeled New Do-file Editor or pressCTRL+9. Below is a common way to begin your do-file. Copy and paste it into your new do-file.

/*Author: MyName

Notes: */

**************************************************************************

/* The following block of commands go at the start of nearly all do files*/

clear

cd "C:\Users\kirkwood.donavin\Google Drive\DATA"

log using logs\INSERT_DOFILE_NAME.log, append

display "$S_DATE $S_TIME"

**************************************************************************

After creating your new do-file, save it as DATA_NAME_clean.do. Save this inside a folder called do_files

(see below for file structure instructions). At the beginning of your file, DATA_NAME_clean.do, set your currentdirectory with the Stata command cd using the address of another folder called DATA. Here is an example:

cd "C:\Users\kirkwood.donavin\Google Drive\DATA"

You will need to copy DATA’s address from a Windows OS Explorer address bar or using the Apple OSGet Info command. Paste the address within quotes following the command cd. Place the do_files folderin DATA. You also might want a folder called logs. Create it and place it in DATA as well. When calling a filefrom one of these or other subfolders within the current directory, DATA, you must specify which subdirectorywith the syntax subdirname\filename (see below).

2. Opening and Saving a .dta file

.dta files are the Stata-specific data file type. Save uncleaned .dta files in a folder called raw and place thiswithin your DATA folder. If the data set is called DATA_NAME_raw.dta, then you call it into Stata with thefollowing command in your do-file:

use raw\DATA_NAME_raw.dta

You may save a cleaned data file in a folder called clean as the following:

save clean\DATA_NAME_clean.dta

And, that’s it. You’re ready to use Stata now!

3. Useful Stata Commands: The following are generally useful when working in Stata and you will benefit fromusing them while cleaning your data.

• do - In order to do anything in Stata, you must do it. Seems simple right? Within your do-file, selectthe code you’d like to do and click the button called Execute Selection (do) or press CTRL+D. If youclick the do button without selecting code, you will do all the code in the file.

• help - The help [command] will return instructions for use of any command along with examples. Usethis whenever you don’t understand how to use a command. Note that brackets are not included whenusing the code (e.g. help preserve).

• preserve & restore - The preserve command allows you to subsequently do anything you want tothe data. Then using the restore command will revert the data to the state it was in at the time ofpreserve. This is handy for experimenting with code.

• rename & label - The command rename [varname] [newname] will rename [varname] to [newname]

(e.g. rename d02 diar). The command label var [varname] ["newlabel"] or simply will renamethe variable label (e.g. label var diar "The number of times [NAME] has had diarrhea in the past

three months"). The label allows us to be more descriptive regarding what the variable measures. Fora survey measure, the label should be the question from the survey, verbatim.


Figure 1: File Structure

.

DATA

raw

clean

do files

logs

• destring - The destring [varnames], replace command replaces variables that are in string format(basically text) to a numeric format (so you can manipulate numbers). Some variables cannot be de-stringed as is because the variable contains non-numeric values. That just means there are some stringvalues in the measures. We will have to find those strings, figure out if it’s information you want to keepand then convert those words to numbers or missing values based on what you decide. You’ll have to dosome critical thinking here.

• quietly - The quietly command hides any Stata output from the output window (main window). Itis used before another Stata command such as quietly destring. . . This may be handy if you arerunning a lot of code and don’t care to see all of the output (e.g. you are changing 50 variable namesand labels).

• /*Commenting Out*/ - We will want to write many things in our Stata do-files that you don’t wantStata to try to use as code. Anything between /* and */ will be ignored by Stata. Use this to writeyour name at the top of your do-file and to write within the do-file any outliers or errors you findin the data.

• sort - The sort [varlist] command will sort variables from least to greatest. sort will begin withthe first variable written, and within that variable, it will sort the second variable written, and so forth.

• browse - The browse (or br) [varlist] [if] command will show the variable values in the data browserthat meet the if condition. For example, br ID hhmemberid fever if fever > 92 opens up the databrowser and shows the household ID, household memberid and number of cases of fever in the past 3months for responses over 92 cases only (i.e. fever everyday for the past 3 months).

File Organization:

As mentioned previously, you will want one main folder, call it what you like, such as DATA and four sub-folders,raw, clean, do_files and logs. Place the .dta files of raw data into your raw folder. Once you have cleaned a.dta file, save it in clean. Save your do-files and logs in their respective folders.

Outlier & Error Detection Strategies:

There are three strategies you may employ to find errors and outliers.

1. Summary Statistics: The Stata command summarize or just su returns simple summary statistics about thefollowing variable (or variable list). For instance, below I summarize the variable diar, the number incidencesof diarrhea individuals have experienced in the previous three months. Notice the maximum value is 99, whichindicated an “unknown” response. But what is the actual maximum number of diarrhea incidences?

sum diar

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

diar | 3989 2.002507 13.18031 0 99

Next, an if statement has been added to restrict the variable below 90 to remove “unknown” responses.Nothing looks out of the ordinary here.


sum diar if diar<90

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

diar | 3917 .2195558 .8654623 0 14

Note that you may summarize all your variables at once by simply including their names in the variable listfollowing sum diar fever vomit . . . malaria. This will save time when using this first strategy, though thenext two strategies will need to be done one variable at a time.

2. Histograms: The Stata command histogram, or simply hist, will create a histogram with the followingvariable. Histograms will show the density of values for a given variable. In Figure 2, I create a histogramfor diar. Note that the same technique was utilized to remove “unknown” responses as above. The secondhistogram stops at 15 indicating the maximum value is somewhere nearby (something you know from usingsummarize earlier). Both histograms do not raise any concerns about errors or outliers for diar.

Figure 2: Histograms

hist diar

0.1

.2.3

.4D

ensi

ty

0 20 40 60 80 100D02

hist diar if diar < 90

0.5

11.

52

2.5

Den

sity

0 5 10 15D02

3. Tabulation: The Stata command tabulate, or simply tab is much like a histogram in that it returns densitiesof the values of a variable. However, no graph is produced. Instead a table reports the variable values, theirfrequencies, and their proportion relative to all observations of the variable.

Notice below in Figure 3 that you do not need to restrict the values of diar to clearly view the data. You cansee the minimum and maximum and their frequencies. You can see how many missing values this variable had,11 (requires missing option). You can see that just under 90% of the responses indicated zero incidences ofdiarrhea in the past three months. This final strategy does not raise any concerns so you are done evaluatingthis variable.

Cleaning Procedure:

Follow these steps to clean your section of the data.

1. Rename variables something intuitive of your choosing. The best names are a string of the shortest abbrevi-ations you can think of without sacrificing clarity (e.g. mar_stat for a variable containing individual maritalstatus).

2. Re-label variables in a completely descriptive manner to minimize confusion in others who may need to usethe dataset. If a survey was used to collect the data use survey questions exactly as they were written.

3. Evaluate your section’s variables for errors and outliers using the above strategies (you may need to destring).

4. Record all possible outliers and errors within your do-file using a standard format such as the one below.This will make it easier for someone else to review your cleaning work.


Figure 3: Tabulation

tab diar, missing

D02 | Freq. Percent Cum.

------------+-----------------------------------

. | 11 0.29 0.29

0 | 3,334 89.05 89.34

1 | 129 3.45 92.79

10 | 1 0.03 92.82

14 | 2 0.05 92.87

2 | 70 1.87 94.74

3 | 78 2.08 96.82

4 | 27 0.72 97.54

5 | 19 0.51 98.05

6 | 1 0.03 98.08

7 | 5 0.13 98.21

8 | 1 0.03 98.24

99 | 66 1.76 100.00

------------+-----------------------------------

Total | 3,744 100.00

/* SUSPECTED OUTLIERS & ERRORS

ID - Member ID - Variable Number = error

8045 - n/a - diar = $1

16027 - 2 - diar = 90

*/

5. Check the data source for possible corrections to outliers and errors. If the data were collected with a survey,this includes looking at hard copies of the survey for the outlier or error in questions. If the outlier/error is adata entry error, it may be corrected with the replace command. The following corrects a miss-entered datapoint in an example dataset for community #3, household #1, household member #1:

replace diar=2 if ID==3001 & hhmemberid==1

6. After checking for data entry error, you may remove possible outliers in a Stata do-file using the replace

command and the “.” missing value symbol. The following removes a possible erroneous data point in anexample dataset for community #3, household #1, household member #1:

replace diar=. if ID==3001 & hhmemberid==1

Additionally do this for any “unknown” responses in your variables (e.g. 99, 98, 999 etc.). Such values areinterpreted by Stata as actual observations of the variable. For example, diar==99 indicates to Stata thatthe individual had diarrhea 99 times in the past three months, even though it was meant to indicate diarrheaoccurred (or didn’t) an unknown number of times. Here is an example:

replace diar=. if diar==99

The above code will replace all values of diar that were recorded as unknown on the survey.

General Notes:

• You should save extra copies of your data in a separate folder. If something is removed that shouldn’t be,you can get it back. The Most Important Thing: make sure to save all such replacements in your Statado-file.


• To look at the data in a .dta, press the Data Editor (Browse) button in the main screen. Please makea habit of opening up the Browse option rather than Data Editor (Edit). We don’t want to accidentallymake changes to the set.

a beginner's quickstart guide to cleaning data with stata

Data & Analytics