20130308 preparing data for modeling in r
TRANSCRIPT
Preparing data for modeling in
2013-03-08 @HSPHKazuki Yoshida, M.D. MPH-CLE student
FREEDOMTO KNOW
Group Website is at:
http://rpubs.com/kaz_yos/useR_at_HSPH
Open R Studio
Create a new scriptand save it.
http://www.umass.edu/statdata/statdata/data/
lowbwt.dat
http://www.umass.edu/statdata/statdata/data/lowbwt.txthttp://www.umass.edu/statdata/statdata/data/lowbwt.dat
We will use lowbwt dataset used inBIO213 Applied Regression for Clinical Research
NAME: ! LOW BIRTH WEIGHT DATA (LOWBWT.DAT)KEYWORDS: Logistic RegressionSIZE: 189 observations, 11 variables
SOURCE: Hosmer and Lemeshow (2000) Applied Logistic Regression: Second ! Edition. These data are copyrighted by John Wiley & Sons Inc. and must ! be acknowledged and used accordingly. Data were collected at Baystate! Medical Center, Springfield, Massachusetts during 1986.
DESCRIPTIVE ABSTRACT:
The goal of this study was to identify risk factors associated withgiving birth to a low birth weight baby (weighing less than 2500 grams).Data were collected on 189 women, 59 of which had low birth weight babiesand 130 of which had normal birth weight babies. Four variables which werethought to be of importance were age, weight of the subject at her lastmenstrual period, race, and the number of physician visits during the firsttrimester of pregnancy.
NOTE:
This data set consists of the complete data. A paired data setcreated from this low birth weight data may be found in lowbwtm11.dat anda 3 to 1 matched data set created from the low birth weight data may befound in mlowbwt.dat.
http://www.umass.edu/statdata/statdata/data/lowbwt.txt
http://www.umass.edu/statdata/statdata/data/lowbwt.txt
LIST OF VARIABLES:
Columns Variable Abbreviation-----------------------------------------------------------------------------2-4 Identification Code ID 10 Low Birth Weight (0 = Birth Weight >= 2500g, LOW 1 = Birth Weight < 2500g) 17-18 Age of the Mother in Years AGE 23-25 Weight in Pounds at the Last Menstrual Period LWT 32 Race (1 = White, 2 = Black, 3 = Other) RACE 40 Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE 48 History of Premature Labor (0 = None 1 = One, etc.) PTL 55 History of Hypertension (1 = Yes, 0 = No) HT 61 Presence of Uterine Irritability (1 = Yes, 0 = No) UI 67 Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.) 73-76 Birth Weight in Grams BWT-----------------------------------------------------------------------------
http://www.umass.edu/statdata/statdata/data/lowbwt.txt
PEDAGOGICAL NOTES: These data have been used as an example of fitting a multiplelogistic regression model.
STORY BEHIND THE DATA: Low birth weight is an outcome that has been of concern to physiciansfor years. This is due to the fact that infant mortality rates and birthdefect rates are very high for low birth weight babies. A woman's behaviorduring pregnancy (including diet, smoking habits, and receiving prenatal care)can greatly alter the chances of carrying the baby to term and, consequently,of delivering a baby of normal birth weight. The variables identified in the code sheet given in the table have beenshown to be associated with low birth weight in the obstetrical literature. Thegoal of the current study was to ascertain if these variables were importantin the population being served by the medical center where the data werecollected.
References:
1. Hosmer and Lemeshow, Applied Logistic Regression, Wiley, (1989).
lbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", head = T, skip = 4)
Load dataset from web
header = TRUEto pick up
variable names
skip 4 rows
lbw[c(10,39), "BWT"] <- c(2655, 3035)
“Fix” dataset
Replace data pointsto make the dataset identical
to BIO213 dataset10th,39th
rows
BWT column
Lower case variable names
names(lbw) <- tolower(names(lbw))
Convert variable names to lower case
Put them back into variable names
See overview
library(gpairs)gpairs(lbw)
RecodingChanging and creating variables
Why?
Different variable forms mean different modeling
assumptions!
Variable form and assumption
n Continuous variables:
n Linearity assumption
n Categorical variables:
n No residual confounding assumption
Relabel race: 1, 2, 3 to White, Black, Other
lbw$race.cat <- factor(lbw$race, levels = 1:3, labels = c("White","Black","Other"))
Using this variable as continuous is meaning less!!
Take race variable
Order levels 1, 2, 3Make 1 reference level
Label levels 1, 2, 3 as White, Black, Other
Create new variable named
race.cat
Dichotomize ptl
lbw$preterm <- factor(ifelse(lbw$ptl >= 1, "1+", "0"))
Change to categorical
If condition is true, then “1+”
if not (else) “0”ifelse function give either one of two values
condition
Change 0,1 binary to No,Yes binary
lbw$smoke <- factor(ifelse(lbw$smoke == 1, "Yes", "No")) lbw$ht <- factor(ifelse(lbw$ht == 1, "Yes", "No"))lbw$ui <- factor(ifelse(lbw$ui == 1, "Yes", "No"))lbw$low <- factor(ifelse(lbw$low == 1, "Yes", "No"))
equality is tested by ==, not =
if 1, return “Yes”
if not, return “No”
cutting a continuous variableinto categories
lbw$ftv.cat <- cut(lbw$ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
-Inf Inf0 1 2 3 4 5 6] ] ](None Normal Many
breaks = c(-Inf, 0, 2, Inf)
labels = c("None","Normal","Many")
Breaks at
Label them as
4 bounds for 3 categories
Make “Normal” the reference level
lbw$ftv.cat <- relevel(lbw$ftv.cat, ref = "Normal")
“Normal” as reference level
within() allows direct use of variable names
lbw <- within(lbw, {
## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")
## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))
## Categorize smoke ht ui smoke <- factor(smoke, levels = 0:1, labels = c("No","Yes")) ht <- factor(ht, levels = 0:1, labels = c("No","Yes")) ui <- factor(ui, levels = 0:1, labels = c("No","Yes"))})
You can specify variables with variable name only. No need for lbw$
within() method
model formula
outcome ~ predictor1 + predictor2 + predictor3
formula
SAS equivalent: model outcome = predictor1 predictor2 predictor3;
age ~ zyg
In the case of t-test
continuous variable to be compared
grouping variable to separate groups
Variable to be explained
Variable used to explain
Y ~ X1 + X2
linear sum
n . All variables except for the outcome
n + X2 Add X2 term
n - 1 Remove intercept
n X1:X2 Interaction term between X1 and X2
n X1*X2 Main effects and interaction term
Y ~ X1 + X2 + X1:X2
Interaction term
Main effects Interaction
Y ~ X1 * X2
Interaction term
Main effects & interaction
Y ~ X1 + I(X2 * X3)
On-the-fly variable manipulation
New variable (X2 times X3) created on-the-fly and used
Inhibit formula interpretation. For math
manipulation
lm.full <- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm , data = lbw)
Fit a model
lm.full
See model object
Call: command repeated
Coefficient for each variable
summary(lm.full)
See summary
Call: command repeated
Model F-test
Residual distribution
Dummy variables created
R^2 and adjusted R^2
Coef/SE = t
ftv.catNone No 1st trimester visit people compared to Normal 1st trimester visit people (reference level)
ftv.catMany Many 1st trimester visit people compared to Normal 1st trimester visit people (reference level)
race.catBlack Black people compared to White people (reference level)
race.catOther Other people compared to White people (reference level)
confint(fit.lm)
Confidence intervals
Lower boundary
Upper boundary
Confidence intervals