the rise in multiple births in the u.s.: an analysis of a hundred-million birth records with r

39
Revolution Confidential The R ise in Multiple B irths in the U.S.: An Analysis of a Hundred-Million B irth R ecords with R Susan I. R anney, Ph.D. J S M 2012

Upload: revolution-analytics

Post on 25-May-2015

351 views

Category:

Technology


0 download

DESCRIPTION

Presentation by Sue Ranney of Revolution Analytics at JSM 2012, San Diego CA, Aug 1 2012. The Center for Disease Control and Prevention recently issued a report, widely cited in the popular press, on the increased incidence of multiple births in the United States over the last 30 years. Twin birth rates were extracted from annual birth data by a variety of mother's characteristics in order to examine this trend. Our research extends this analysis by applying multivariate analysis to individual-level data obtained from public-use data sets on all births in the United States from 1985 to 2009. We combine the data into a single, multi-year data file (an .xdf file easily accessed by R) containing over 100-million birth records. To analyze the relationship between parental characteristics and multiple birth pregnancies, we first change the unit of observation from the baby to the pregnancy in order to remove replicated observations of parents of multiples. Then, estimating a logistic regression on all of the remaining observations, we show that the trends in increased multiple births are more strongly associated with the age of father than the age of mother, and that controlling for ages, the relative incidence of multiple births for black mothers has been declining.

TRANSCRIPT

Page 1: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

T he R is e in Multiple B irths in the U.S .: A n A nalys is of a

Hundred-Million B irth R ec ords with R

S us an I. R anney, P h.D.

J S M 2012

Page 2: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential T he Tools

Open Source R Flexible, powerful, great graphics Great for prototyping, not very scalable

RevoScaleR (with Revolution R Enterprise) Efficient file format (.xdf) Functions of accessing/importing external data sets

(fixed format & delimited text, SPSS, SAS, ODBC) Very fast, parallelized, distributed analysis functions

(summary stats, crosstabs/cubes, linear models, kmeans clustering, logistic regression, glm)

2

Page 3: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential T he Data

Public-use data sets containing information on all births in the United States for each year from 1985 to 2009 are available to download: http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm

“These natality files are gigantic; they’re approximately 3.1 GB uncompressed. That’s a little larger than R can easily process” – Joseph Adler, R in a Nutshell

3

Page 4: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential T he U.S . B irth Data (c ontinued)

Data for each year are contained in a compressed, fixed-format, text files

Typically 3 to 4 million records per file Variables and structure of the files sometimes change

from year to year, with birth certificates revised in1989 and 2003. Warnings:

NOTE: THE RECORD LAYOUT OF THIS FILE HAS CHANGED SUBSTANTIALLY. USERS SHOULD READ THIS DOCUMENT CAREFULLY. Reporting can differ state-to-state for any given year

4

Page 5: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential T he P roc es s

Basic “life cycle of analysis” Import and combine data 25 years Over 100 million obs.

Check and clean data Basic variable summaries Big data logistic/glm regressions

All on my laptop Option to distribute computations

to cluster

5

Page 6: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential T he Ques tion

6

CDC Report in Jan. 2012

Page 7: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential T he Ques tion

What accounts for the increase in multiple births in the United States? Can we separate out effects of mother’s and father’s ages, race/ethnicity? Examine time trends by sub-group, assumed to be associated with fertility treatment CDC finding: The older age of women at childbirth accounts for only about 1/3 of the rise in twinning over 30 years (but this mixes in the increased rate of “twinning” for older women)

7

Page 8: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Importing the U.S . B irth Data for Us e in R

8

Page 9: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential E xample of Differenc es for Different Years

To create common variables across years, use common names and new factor levels for ‘colInfo’ in rxImport function. For example:

For 1985: SEX = list(type="factor", start=35, width=1, levels=c("1", "2"), newLevels = c("Male", "Female"), description = "Sex of Infant“)

For 2003: SEX = list(type="factor", start=436, width=1, levels=c("M", "F"), newLevels = c("Male", "Female"), description = "Sex of Infant”)

9

Page 10: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential C reating Trans formed Variables on Import

Use standard R syntax to create transformed variables.

For example, create a factor for Mom’s Age using the imported MAGER integer variable:

MomAgeR7 = cut(MAGER, breaks =c(0, 19, 24, 29, 34, 39, 98, 99), labels = c("Under 20", "20-24", "25-29", "30-34", "35-39", "Over 39", "Missing"))

Create binary variable for “IsMultiple” IsMultiple = DPLURAL_REC != 'Single'

10

Page 11: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential S teps for C omplete Import Lists for column information and transforms are

created for 3 base years: 1985, 1989, 2003 when there were very large changes in the structure of the input files Changes to these lists are made where

appropriate for in-between years A test script is run, importing only 1000

observations per year for a subset of years Full script is run, importing each year, sorting

according to key demographic characteristics, and appending to a master .xdf file

11

Page 12: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

E xamining and C leaning the B ig Data F ile

12

Page 13: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential E xamining B as ic Information Basic file information >rxGetInfo(birthAll) File name: C:\Revolution\Data\CDC\BirthUS85to09S.xdf Number of observations: 100672041

Number of variables: 50 Number of blocks: 215

Use rxSummary to compute summary statistics for continuous data and category counts for each of the factors (about 4 minutes on my laptop)

rxSummary(~., data=birthAll, blocksPerRead = 10)

13

Page 14: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential E xample of S ummary S tatis tic s

DadAgeR8 Counts

Under 20 3226602

20-24 15304803

25-29 23805056

30-34 23179418

35-39 13289015

40-44 4984146

Over 44 2140207

Missing 14742794

14

MomAgeR7 Counts

Under 20 11918891

20-24 25975642

25-29 28701398

30-34 22341530

35-39 9788753

Over 39 1945827

Missing 0

Page 15: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential His tograms by Year

Easily check for basic errors in data import (e.g. wrong position in file) by creating histograms by year – very fast (just seconds on my laptop) Example: Distribution of mother’s age by

year. Use F() to have the integer year treated as a factor.

rxHistogram(~MAGER| F(DOB_YY), data=birthAll, blocksPerRead = 10, layout=c(5,5))

15

Page 16: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential A ge of Mother Over T ime

16

Page 17: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential Drill Down and E xtrac t S ubs amples

Take a quick look at “older” fathers: rxSummary(~F(UFAGECOMB),

data=birthAll,

blocksPerRead = 10)

What’s going on with 89-year old Dads? Extract a data frame:

dad89 <- rxDataStep(

inData = birthAll,

rowSelection = UFAGECOMB == 89,

varsToKeep = c("DOB_YY", "MAGER",

"MAR", "STATENAT", "FRACEREC"),

blocksPerRead = 10)

17

Dad’s Age Counts 80 141 81 108 82 81 83 74 84 56 85 43 86 43 87 26 88 27 89 3327

Page 18: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential Year and S tate for 89-Year-Old F athers rxCube(~F(DOB_YY):STATENAT, data=dad89, removeZeroCounts=TRUE)

18

F_DOB_YY STATENAT Counts 1990 California 1 1999 California 1 2000 California 1 1996 Hawaii 1 1997 Louisiana 1 1986 New Jersey 1 1995 New Jersey 1 1996 Ohio 1 1989 Texas 3316 1990 Texas 1 2001 Texas 1 1985 Washington 1

Page 19: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Demographic s of Multiple B irths 1985-2009

19

Page 20: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential Unit of Obs ervation: Delivery

Appropriate unit of observation is the delivery (resulting in 1 or more live births) rather than the individual birth. Use 1/Plurality as probability weight Alternative: Look at nearby records to

compute the “Reported Delivery Birth Order” (RDPO), then select on only the 1st born in the delivery for the analysis

20

Page 21: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

21

Page 22: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

22

Page 23: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

23

Page 24: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Multiple B irths L ogis tic R egres s ion & P redic tions

24

Page 25: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential L ogis tic R egres s ion Logistic Regression Results for: IsMultiple ~ DadAgeR8 + MomAgeR7 +

FRaceEthn + MRaceEthn +

DadAgeR8:FRaceEthn:MNTHS_SINCE_JAN85 + MomAgeR7:MRaceEthn:MNTHS_SINCE_JAN85

File name: C:\Revolution\Data\CDC\BirthUS85to09R.xdf

Probability weights: PluralWeight

Dependent variable(s): IsMultiple

Total independent variables: 118 (Including number dropped: 12)

Number of valid observations: 100672041

Number of missing observations: 0 -2*LogLikelihood: 14555447.8514

(Residual deviance on 100671935 degrees of freedom)

About 6 minutes on my laptop

25

Page 26: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

C ounts of Deliveries by all Demographic C ombinations by Year rxCube(~DadAgeR8:MomAgeR7:FRaceEthn:MRaceEthn:F(DOB_YY),

data = birthAllR, pweights = "PluralWeight",

blocksPerRead = 10)

Under 10 seconds to compute on my laptop Resulting data frame has 50,400 rows

representing all the demographic combinations for each of the 25 years 44,661 have counts <= 1000 Provides input data for predictions and

weights for aggregating predictions

26

Page 27: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential P redic t and A ggregate by Year predOut <- rxPredict(logitObj, data = catAllDF) Create predictions for each detailed

demographic group Aggregate for each year using population

percentages for each detailed group for each year

Compare with actuals Perform “What if?” scenarios Scenario 1: Only change in demographic; no

“fertility-treatment” time trand

27

Page 28: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

28

Page 29: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Drill Down: L ook at P redictions for S pec ific G roups

Select out predicted values from prediction data frame for detailed groups: Age of mother and father Race/ethnicity of mother and father

29

Page 30: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

30

Page 31: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

31

Page 32: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Revolution Confidential

C onc lus ions from L ogis tic R egres s ion

Dads matter: Use of fertility treatment associated with older dads as well as older Mom’s

Hispanics show relatively small increase in multiples for both younger and older couples

Asian pattern similar to whites, but lower for all age groups

Black show similar increase in multiples for both younger and older couples Other Factors Related to Multiples? (Diet, high BMI,

other genetic factors)

32

Page 33: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Revolution Confidential

B ut How Many B abies ?

33

Page 34: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Revolution Confidential

G L M Tweedie Model: How Many B abies ?

When power parameter is between 1 and 2, Tweedie is a compound Poisson distribution Appropriate for data with positive data that

also includes a ‘clump’ of exact zeros Dependent variable: Number of additional

babies (Plurality – 1) Same independent variables

34

Page 35: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Revolution Confidential

35

Page 36: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Revolution Confidential

F urther R es earc h

Data management Further cleaning of data Import more variables Import more years (additional use of weights)

Multiple Births Analysis More variables (e.g., proxies for fertility treatment

trends) Investigation of sub-groups (e.g., young blacks) Improved computation of number of births per

delivery Other analyses with birth data

36

Page 37: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Revolution Confidential

S ummary of A pproac h “Unpredictability” of multiple births requires

large data set to have the power to capture effects

Significant challenges in importing and cleaning the data – using R and .xdf files makes it possible

Even with a huge data, “cells” of tables looking at multiple factors can be small

Using combined.xdf file, we can use individual-level analysis to examine conditional effects of a variety of factors

37

Page 38: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Revolution Confidential

R eferenc es Martin JA, Hamilton BE, Osterman MJK. Three

decades of twin births in the United States, 1980-2009. NCHS data brief, no. 80. Hyattsville, MD: National Center for Health Statistics. 2012.

Blondel B, Kaminiski, M. Trends in the Occurrence, Determinants, and Consequences of Multiple Births. Seminars in Perinatology. 26(4):239-49, 2002.

Vahratian, Anjel. Utilization of Fertility-Related Services in the United States. Fertil Steril. 2008 October; 90(4):1317-1319.

38

Page 39: The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

Revolution Confidential

Revolution Confidential

T hank you! R-Core Team R Package Developers R Community Revolution R Enterprise Customers and Beta

Testers Colleagues at Revolution Analytics Contact: [email protected] 39