stata boot camp day 3: advanced data manipulation + summaries nate mercaldo sarah fletcher mercaldo...

37
STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Upload: nathan-glenn

Post on 29-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

STATA Boot Camp Day 3:Advanced Data Manipulation + Summaries

Nate MercaldoSarah Fletcher Mercaldo

Andrew Wiese

Page 2: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Objectives

• Advanced Data Management/Manipulation– missing data – operators– a few cool functions to create new variables

• Multivariate Statistical Summaries– generate multi-way tables (counts, means, other summaries)– generate multivariate figures (+ modifying figure aesthetics)

• Reproducible Research (do files)

Page 3: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Setup

• Start log file!

• Load birth weight data (birthweight_v2.dta or birthweight_v2_11.dta – if not using Stata 14)

• Variables– bwt, low = child’s birth weight, indicator of low birth weight– age, race, smoke, height, weight = mother’s age, race, smoking

status, height and weight

Page 4: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Exercise 1

• Generate bmi = weight (lb) / height (in)2 * (703)

• Summarize bmi– min, max, mean (sd), median (IQR)

• Create a histogram of bmi

Page 5: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Exercise 1 variable | min max mean sd p50 iqr-------------+------------------------------------------------------------ bmi | .0007453 1716.137 106.1197 360.3435 22.39452 3.201514--------------------------------------------------------------------------

Page 6: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Huh?

• Anything odd about the generated summaries?• Typically we receive data sets that have NOT been “cleaned”

. summarize height weight bmi

Variable | Obs Mean Std. Dev. Min Max-------------+--------------------------------------------------------- height | 189 590.7354 2229.642 61 9999 weight | 189 656.9788 2213.965 106 9999 bmi | 189 106.1197 360.3435 .0007453 1716.137

Page 7: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Missing Values

• Stata codes missing values as a .

• Researchers code missing values as all sorts of things (e.g., -99, -9, 9, 99, NA, ?)

• Guesses on how missing values were coded in the lbw file?

• How can we replace these values with .?

Page 8: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Aside: Operators

• Arithmetic+ (addition), - (subtraction), * (multiplication), / (division), ^ (power)

• Relational> (greater than), < (less than), >= (greater than or equal),

<= (less than or equal), == (equal), != (not equal)

• Logical& (and), | (or, pipe – look above enter key), and ! (not)

Page 9: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Missing Values

• How can we replace these values with .?

• Use logical operators (or logical expressions)!

generate height2 = height if height < 9999

generate weight2 = weight if weight < 9999

replace height = . if height==9999replace weight =. if weight == 9999

Page 10: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Missing Values• Recompute bmi and summarize variable | min max mean sd p50 iqr-------------+------------------------------------------------------------ bmi | 16.87565 28.52808 22.50763 2.037052 22.39452 2.842493--------------------------------------------------------------------------

Page 11: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Other uses of relational / logical operators• Restrict data to a specific group

– list if age <= 15 | age >= 45 (list “at risk” mothers)– drop if smoke == 1 (remove smokers, BEWARE!)

• Generating new variables– See Exercise 2

• Compute summaries for a specific group– summarize bmi if smoke==0

• Used A LOT when creating custom tables/figures

Page 12: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Exercise 2

• Categorize BMI Generate a new variable called overweight– overweight, 25 <= bmi <= 30 (note – max bmi = 29 in this dataset)

• Summarize birth weight by overweight variable

overweight | min max mean sd p50 iqr-----------+------------------------------------------------------------ 0 | 709 4990 2964.701 698.1659 2977 1035.5 1 | 1330 3941 2903.8 882.4416 2977 1723-----------+------------------------------------------------------------ Total | 709 4990 2959.598 712.6656 2977 1063------------------------------------------------------------------------

Page 13: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Functions to Generate New Variables• Data > Create or change data > Create new variable (extended)

• Categorize continuous variablesegen bmicat = cut(bmi), at(0,18.5,25,30,100) icodes

• Group variablesegen racesmoke = group(race smoke)

• Create indicator/dummy variables quietly tabulate bmicat, generate(bmicat_)table bmicat; table racesmoke race smoke; table bmi bmicat

Page 14: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Multivariate Summaries • Day 2 – we looked at a lot of univariate (or marginal) summaries

• Generally we are more interested in multivariate summaries, say identifying factors associated low birth weight infants.

• Using operators to compute summaries (by hand) can be tedious – it would be helpful to have Stata do all the heavy lifting (e.g., cut command).

Page 15: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Multivariate Tabular Summaries• Possible factors associated with low birth weight infants

– age, smoke, bmi (bmicat)

• How can we summarize these variables by low?– Continuous: age, bmi [range, mean, sd, quantiles]– Categorical: smoking status and bmicat (frequencies/proportions)

• Statistics > Summaries, tables, and tests >– Summary and descriptive statistics– Other tables > Compact table of summary statistics

Page 16: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Multivariate Tabular Summaries• Compact table of summary stats, Options (wide table)

tabstat age smoke bmi, statistics( mean sd median iqr) by(low) longstub

low stats | age smoke bmi------------------+------------------------------0 mean | 23.66154 .3384615 22.49244 sd | 5.584522 .4750169 2.044822 p50 | 23 0 22.29633 iqr | 9 1 2.746094------------------+------------------------------1 mean | 22.30508 .5084746 22.54381 sd | 4.511496 .5042195 2.038627 p50 | 22 1 22.48364 iqr | 6 1 2.973585------------------+------------------------------

Page 17: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Multivariate Tabular Summaries• Suppose we are interested in testing to see is an association

between smoking and low birth weight– Statistics > Summaries, tables and tests > Frequency tables > two-way

tables with measures of association. tabulate smoke low, chi2 | low smoke | 0 1 | Total-----------+----------------------+---------- 0 | 86 29 | 115 1 | 44 30 | 74 -----------+----------------------+---------- Total | 130 59 | 189 Pearson chi2(1) = 4.9237 Pr = 0.026

– Statistics > Summaries, tables and tests > Classical tests of hypotheses

Page 18: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Exercise 3• Compute the associations (tables and χ2) between smoking and

low birth weight by race (hint: command from Day 1?)

-> race = 1

| low smoke | 0 1 | Total-----------+----------------------+---------- 0 | 40 4 | 44 1 | 33 19 | 52 -----------+----------------------+---------- Total | 73 23 | 96

Pearson chi2(1) = 9.8556 Pr = 0.002

Page 19: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Multivariate Graphics

www.ats.ucla.edu/Stat/stata/library/GraphExamples/default.htm

Box-Plot Scatterplot

Default Customized

Page 20: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Examples

Stata can make lots of plots – but that does not mean you should!

http://www.surveydesign.com.au/tipsgraphs.html

Page 21: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Multivariate Plots

• Type of plot depends on the TYPES of variables– Categorical/categorical

• Tables

– Categorical/Continuous • Box plots, histograms

– Continuous/Continuous• Scatter/bubble plots

Page 22: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Multivariate Plots: Categorical / Continuous

– Box Plots• Graphics > Box plot >

main variable = continuous, Categories Tab > Group 1 = categorical• graph box bwt, over(smoke)

– Histograms• Graphics > Histogram

main variable – continuous, By Tab > categorical• histogram bwt, frequency bin(10) by(smoke)

Page 23: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Multivariate Plots: Continuous/ Continuous

– Scatter plots (=bubble plots with varying sizes of points)• Graphics > Twoway graph > Create > Basic Plots > Scatter

Y variable= continuous, X variable = continuous • twoway scatter bwt age, sort

– Other add-ons: lowess smoothers• Graphics > Twoway graph > Create > Advanced Plots > Lowess Line

Y variable= continuous, X variable = continuous • twoway (scatter bwt age, sort) (lowess bwt age)

Page 24: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Exercise 4

• Summarize the birth weight by smoking status and race– Create a boxplot of birth weight by smoking status – Create a boxplot of birth weight by race– Create a boxplot of birth weight by smoking status AND race

• Summarize maternal age and birth weight (as a group)– Create a scatter plot of age by birth weight– Add smoothers by smoking status (red: smoke=1, black: smoke=0)

Page 25: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Exercise 4

Page 26: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Exercise 4

Page 27: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Can we improve the aesthetics of these plots?

Page 28: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Improving Aesthetics of a Plot

Plots are comprised of points/symbols, lines, text, labels, legends, …

Stata defaults are fine for preliminary analyses or for homework, but modifications are needed for publications (or reflect personal style)

- Provide examples of how to:- Add/modify text: titles, x-/y-axes, legends … - Modify plotting symbols: color, size, symbol, …- Modify plotting lines: color, width, type, …- Modify colors: histograms, box-plots, …

Page 29: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Modifying Aesthetics: Text

• Birth weight by maternal race and smoking status

Page 30: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Modifying Aesthetics: Text - CODE

• Birth weight by maternal race and smoking statusgraph box bwt, over(racesmoke)

label define rslab 1 "White: NS" 2 "White: S" 3 "Black: NS" 4 "Black: S" 5 "Other: NS" 6 "Other: S"label values racesmoke rslabgraph box bwt, over(racesmoke) ytitle("Birth Weight") title("Infant Birth Weight by Maternal Race and Smoking Status") subtitle("subtitle") caption("caption") note("note")

Page 31: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Modifying Aesthetics: Symbols

• Birth weight by maternal age and smoking status

http://www.stata.com/manuals13/g-3marker_options.pdf

Page 32: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Modifying Aesthetics: Symbols - CODE

• Birth weight by maternal age and smoking status

twoway scatter bwt age

twoway (scatter bwt age if smoke==0, mcolor(black) msize(small) msymbol(diamond)) (scatter bwt age if smoke==1, mcolor(red) msize(large)), legend(order(1 "Non-Smoker" 2 "Smoker"))

Note: it is NOT recommended to use all options simultaneously!

Page 33: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Modifying Aesthetics: Lines

• Birth weight by maternal age and smoking status

http://www.stata.com/manuals13/g-3line_options.pdf

Page 34: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Modifying Aesthetics: Lines- CODE

• Birth weight by maternal age and smoking status

twoway (scatter bwt age) (lowess bwt age)

twoway (scatter bwt age) (lowess bwt age if smoke==0, lcolor(black) lwidth(thin) lpattern(dash)) (lowess bwt age if smoke==1, lcolor(red) lwidth(thick))

Note: it is NOT recommended to use all options simultaneously!

(see .do file for code) http://www.stata.com/manuals13/g-3marker_options.pdf

Page 35: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Modifying Aesthetics: Colors

• Birth weight by maternal age and smoking status

Page 36: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Modifying Aesthetics: Colors - CODE

• Birth weight by maternal age and smoking statushistogram bwthistogram bwt, bin(35) frequency fcolor(sandb) lcolor(lavender) lwidth(thick)

label define slab 1 “Smoker” 0 “Non-Smoker” label values smoke slabgraph box bwt, over(smoke)graph box bwt, over(smoke) box(1, fcolor(chocolate) lcolor(pink)) graph box bwt, over(smoke) scheme(s2mono)

(see .do file for code) http://www.stata.com/manuals13/g-3marker_options.pdf

Page 37: STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese

Reproducible Research

• Do file – What is a do file?

File that contains all code (w/comments)

– Benefits of do file?Record of all data manipulationsRecord of everything you do to generate an analysis (summary, figure)

– How do do files differ from log files?

• What if I told you that height was in cm and not inches? How long would it take you to redo all the analysis from today?