20130308 preparing data for modeling in r

44
Preparing data for modeling in 2013-03-08 @HSPH Kazuki Yoshida, M.D. MPH-CLE student FREEDOM TO KNOW

Upload: kazuki-yoshida

Post on 27-Jun-2015

2.320 views

Category:

Documents


13 download

TRANSCRIPT

Page 1: 20130308 Preparing data for modeling in R

Preparing data for modeling in

2013-03-08 @HSPHKazuki Yoshida, M.D. MPH-CLE student

FREEDOMTO  KNOW

Page 2: 20130308 Preparing data for modeling in R

Group Website is at:

http://rpubs.com/kaz_yos/useR_at_HSPH

Page 3: 20130308 Preparing data for modeling in R

Open R Studio

Page 4: 20130308 Preparing data for modeling in R

Create a new scriptand save it.

Page 6: 20130308 Preparing data for modeling in R

lowbwt.dat

http://www.umass.edu/statdata/statdata/data/lowbwt.txthttp://www.umass.edu/statdata/statdata/data/lowbwt.dat

We will use lowbwt dataset used inBIO213 Applied Regression for Clinical Research

Page 7: 20130308 Preparing data for modeling in R

NAME: ! LOW BIRTH WEIGHT DATA (LOWBWT.DAT)KEYWORDS: Logistic RegressionSIZE: 189 observations, 11 variables

SOURCE: Hosmer and Lemeshow (2000) Applied Logistic Regression: Second ! Edition. These data are copyrighted by John Wiley & Sons Inc. and must ! be acknowledged and used accordingly. Data were collected at Baystate! Medical Center, Springfield, Massachusetts during 1986.

DESCRIPTIVE ABSTRACT:

The goal of this study was to identify risk factors associated withgiving birth to a low birth weight baby (weighing less than 2500 grams).Data were collected on 189 women, 59 of which had low birth weight babiesand 130 of which had normal birth weight babies. Four variables which werethought to be of importance were age, weight of the subject at her lastmenstrual period, race, and the number of physician visits during the firsttrimester of pregnancy.

NOTE:

This data set consists of the complete data. A paired data setcreated from this low birth weight data may be found in lowbwtm11.dat anda 3 to 1 matched data set created from the low birth weight data may befound in mlowbwt.dat.

http://www.umass.edu/statdata/statdata/data/lowbwt.txt

Page 8: 20130308 Preparing data for modeling in R

http://www.umass.edu/statdata/statdata/data/lowbwt.txt

LIST OF VARIABLES:

Columns Variable Abbreviation-----------------------------------------------------------------------------2-4 Identification Code ID 10 Low Birth Weight (0 = Birth Weight >= 2500g, LOW 1 = Birth Weight < 2500g) 17-18 Age of the Mother in Years AGE 23-25 Weight in Pounds at the Last Menstrual Period LWT 32 Race (1 = White, 2 = Black, 3 = Other) RACE 40 Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE 48 History of Premature Labor (0 = None 1 = One, etc.) PTL 55 History of Hypertension (1 = Yes, 0 = No) HT 61 Presence of Uterine Irritability (1 = Yes, 0 = No) UI 67 Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.) 73-76 Birth Weight in Grams BWT-----------------------------------------------------------------------------

Page 9: 20130308 Preparing data for modeling in R

http://www.umass.edu/statdata/statdata/data/lowbwt.txt

PEDAGOGICAL NOTES: These data have been used as an example of fitting a multiplelogistic regression model.

STORY BEHIND THE DATA: Low birth weight is an outcome that has been of concern to physiciansfor years. This is due to the fact that infant mortality rates and birthdefect rates are very high for low birth weight babies. A woman's behaviorduring pregnancy (including diet, smoking habits, and receiving prenatal care)can greatly alter the chances of carrying the baby to term and, consequently,of delivering a baby of normal birth weight. The variables identified in the code sheet given in the table have beenshown to be associated with low birth weight in the obstetrical literature. Thegoal of the current study was to ascertain if these variables were importantin the population being served by the medical center where the data werecollected.

References:

1. Hosmer and Lemeshow, Applied Logistic Regression, Wiley, (1989).

Page 10: 20130308 Preparing data for modeling in R

lbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", head = T, skip = 4)

Load dataset from web

header = TRUEto pick up

variable names

skip 4 rows

Page 11: 20130308 Preparing data for modeling in R

lbw[c(10,39), "BWT"] <- c(2655, 3035)

“Fix” dataset

Replace data pointsto make the dataset identical

to BIO213 dataset10th,39th

rows

BWT column

Page 12: 20130308 Preparing data for modeling in R

Lower case variable names

names(lbw) <- tolower(names(lbw))

Convert variable names to lower case

Put them back into variable names

Page 13: 20130308 Preparing data for modeling in R

See overview

Page 14: 20130308 Preparing data for modeling in R

library(gpairs)gpairs(lbw)

Page 15: 20130308 Preparing data for modeling in R
Page 16: 20130308 Preparing data for modeling in R

RecodingChanging and creating variables

Page 17: 20130308 Preparing data for modeling in R

Why?

Page 18: 20130308 Preparing data for modeling in R

Different variable forms mean different modeling

assumptions!

Page 19: 20130308 Preparing data for modeling in R

Variable form and assumption

n Continuous variables:

n Linearity assumption

n Categorical variables:

n No residual confounding assumption

Page 20: 20130308 Preparing data for modeling in R

Relabel race: 1, 2, 3 to White, Black, Other

lbw$race.cat <- factor(lbw$race, levels = 1:3, labels = c("White","Black","Other"))

Using this variable as continuous is meaning less!!

Take race variable

Order levels 1, 2, 3Make 1 reference level

Label levels 1, 2, 3 as White, Black, Other

Create new variable named

race.cat

Page 21: 20130308 Preparing data for modeling in R

Dichotomize ptl

lbw$preterm <- factor(ifelse(lbw$ptl >= 1, "1+", "0"))

Change to categorical

If condition is true, then “1+”

if not (else) “0”ifelse function give either one of two values

condition

Page 22: 20130308 Preparing data for modeling in R

Change 0,1 binary to No,Yes binary

lbw$smoke <- factor(ifelse(lbw$smoke == 1, "Yes", "No")) lbw$ht <- factor(ifelse(lbw$ht == 1, "Yes", "No"))lbw$ui <- factor(ifelse(lbw$ui == 1, "Yes", "No"))lbw$low <- factor(ifelse(lbw$low == 1, "Yes", "No"))

equality is tested by ==, not =

if 1, return “Yes”

if not, return “No”

Page 23: 20130308 Preparing data for modeling in R

cutting a continuous variableinto categories

lbw$ftv.cat <- cut(lbw$ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))

-Inf Inf0 1 2 3 4 5 6] ] ](None Normal Many

breaks = c(-Inf, 0, 2, Inf)

labels = c("None","Normal","Many")

Breaks at

Label them as

4 bounds for 3 categories

Page 24: 20130308 Preparing data for modeling in R

Make “Normal” the reference level

lbw$ftv.cat <- relevel(lbw$ftv.cat, ref = "Normal")

“Normal” as reference level

Page 25: 20130308 Preparing data for modeling in R

within() allows direct use of variable names

Page 26: 20130308 Preparing data for modeling in R

lbw <- within(lbw, {

## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")

## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))

## Categorize smoke ht ui smoke <- factor(smoke, levels = 0:1, labels = c("No","Yes")) ht <- factor(ht, levels = 0:1, labels = c("No","Yes")) ui <- factor(ui, levels = 0:1, labels = c("No","Yes"))})

You can specify variables with variable name only. No need for lbw$

within() method

Page 27: 20130308 Preparing data for modeling in R

model formula

Page 28: 20130308 Preparing data for modeling in R

outcome ~ predictor1 + predictor2 + predictor3

formula

SAS equivalent: model outcome = predictor1 predictor2 predictor3;

Page 29: 20130308 Preparing data for modeling in R

age ~ zyg

In the case of t-test

continuous variable to be compared

grouping variable to separate groups

Variable to be explained

Variable used to explain

Page 30: 20130308 Preparing data for modeling in R

Y ~ X1 + X2

linear sum

Page 31: 20130308 Preparing data for modeling in R

n . All variables except for the outcome

n + X2 Add X2 term

n - 1 Remove intercept

n X1:X2 Interaction term between X1 and X2

n X1*X2 Main effects and interaction term

Page 32: 20130308 Preparing data for modeling in R

Y ~ X1 + X2 + X1:X2

Interaction term

Main effects Interaction

Page 33: 20130308 Preparing data for modeling in R

Y ~ X1 * X2

Interaction term

Main effects & interaction

Page 34: 20130308 Preparing data for modeling in R

Y ~ X1 + I(X2 * X3)

On-the-fly variable manipulation

New variable (X2 times X3) created on-the-fly and used

Inhibit formula interpretation. For math

manipulation

Page 35: 20130308 Preparing data for modeling in R

lm.full <- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm , data = lbw)

Fit a model

Page 36: 20130308 Preparing data for modeling in R

lm.full

See model object

Page 37: 20130308 Preparing data for modeling in R

Call: command repeated

Coefficient for each variable

Page 38: 20130308 Preparing data for modeling in R

summary(lm.full)

See summary

Page 39: 20130308 Preparing data for modeling in R

Call: command repeated

Model F-test

Residual distribution

Dummy variables created

R^2 and adjusted R^2

Coef/SE = t

Page 40: 20130308 Preparing data for modeling in R

ftv.catNone No 1st trimester visit people compared to Normal 1st trimester visit people (reference level)

ftv.catMany Many 1st trimester visit people compared to Normal 1st trimester visit people (reference level)

Page 41: 20130308 Preparing data for modeling in R

race.catBlack Black people compared to White people (reference level)

race.catOther Other people compared to White people (reference level)

Page 42: 20130308 Preparing data for modeling in R

confint(fit.lm)

Confidence intervals

Page 43: 20130308 Preparing data for modeling in R

Lower boundary

Upper boundary

Confidence intervals

Page 44: 20130308 Preparing data for modeling in R