proc surveycorr jessica hampton ccsu, new britain, ct september 2013

PROC SURVEYCORR

Jessica Hampton

CCSU, New Britain, CT

September 2013

Introduction

3

Medical Expenditures Panel Survey (MEPS)

• Administered annually by the U.S. Department of Health and Human Services since 1996

• Agency for Healthcare Research and Quality (ARHQ)• Anonymity protected by removing individual identifiers from the public data files• MEPS 2010 consolidated data file released September 2012• Multiple components (household, insurance/employer, and medical provider).• Household component (1,911 variables) covers the following topics:

• Demographics• Household income• Employment• Diagnosed health conditions• Additional health status issues• Medical expenditures and utilization• Satisfaction with and access to care• Insurance coverage

• 18,692 after excluding out of scope, negative person weights, under 18 and 65+• U.S. civilian, noninstitutionalized population • ~3% out of scope (birth/adoption, death, incarceration, living abroad)

4

MEPS Survey Design Methods

• MEPS is a representative but NOT a random sample of the population• Person weights must be used to produce reliable population estimates• Stratification:

• By demographic variables such as age, race, sex, income, etc.• Goal is to maximize homogeneity within and heterogeneity between strata• Sometimes used to oversample certain groups under-represented in the

general population or with interesting characteristics relevant to study • For example: blacks, Hispanics, and low-income households

• Clustering:• By geography in order to reduce survey costs -- not feasible or cost-

effective to do a random sample of the entire population of the U.S. • Within-cluster correlation underestimates variance/error -- two families in

the same neighborhood are more likely to be similar demographically (for example, similar income)

• Desire clusters spatially close for cost effectiveness but as heterogeneous within as possible for reasonable variance.

• Multi-stage clustering used in MEPS: • sample of counties >> sample of blocks >> individuals/households

surveyed from block sample

5

Survey Design Considerations

• If person weights are ignored and one tries to generalize sample findings to the entire population, total numbers, percentages, or means are inflated for the groups that are oversampled and underestimated for others

• In regression analysis, ignoring person weights leads to biased coefficient estimates

• If sampling strata and cluster variables are ignored, means and coefficient estimates are unaffected, but standard error (or population variance) may be underestimated; that is, the reliability of an estimate may be overestimated

• Or when comparing one estimated population mean to another, the difference may appear to be statistically significant when it is not

• (Machlin, S., Yu, W., & Zodet, M., 2005)

SAS Survey Procedures

7

SAS Survey Procedures

• Intended for use with sample designs that may include unequal person weights, clustering, and stratification.

• PROC SURVEYMEANS estimates population totals, percentages, and means. Includes estimated variance, confidence intervals, and descriptive statistics.

• PROC SURVEYFREQ produces frequency tables, population estimates, percentages, and standard error.

• PROC SURVEYREG estimates regression coefficients by generalized least squares.

• PROC SURVEYLOGISTIC fits logistic regression models for discrete response (categorical) survey data by maximum likelihood.

• PROC SURVEYMEANS and PROC SURVEYREG available starting with SAS version 8.

• PROC SURVEYFREQ and PROC SURVEYLOGISTIC available starting with version 9.

• PROC SURVEYSELECT for sampling which will not be used in this project

8

PROC SURVEYMEANS Syntax

PROC SURVEYMEANS DATA=PQI.MEPS_2010;

STRATA VARSTR;

CLUSTER VARPSU;

WEIGHT PERWT10F;

DOMAIN INSCOV10;

VAR TOTEXP10 TOTSLF10;

RUN;

9

PROC SURVEYMEANS Output

10

PROC SURVEYFREQ Syntax

PROC SURVEYFREQ DATA=PQI.MEPS_2010;

STRATA VARSTR;

CLUSTER VARPSU;

WEIGHT PERWT10F;

TABLES PRIEU10 PRING10 INSCOV10;

RUN;

11

PROC SURVEYFREQ Output

12

PROC SURVEYREG Syntax

PROC SURVEYREG DATA=PQI.MEPS_2010;

STRATA VARSTR;

CLUSTER VARPSU;

WEIGHT PERWT10F;

MODEL &TARGET=&&VAR&I /SOLUTION;

ODS OUTPUT PARAMETERESTIMATES=PARAMETER_EST

FITSTATISTICS=FIT;

RUN;

13

PROC SURVEYLOGISTIC Syntax

PROC SURVEYLOGISTIC DATA=SASUSER.MEPS_2010;

STRATA VARSTR;

CLUSTER VARPSU;

WEIGHT PERWT10F;

MODEL TOTEXP_HIGH(EVENT='1')=AGE10X MARRIED--HISPANX POVLEV10--PHYACT53 OBESE--ADSMOK42 ADINSA42--

LOCATN_ER;

ODS OUTPUT PARAMETERESTIMATES=WORK.PARAM;

RUN;

14

PROC SURVEYLOGISTIC/REG Output

Default output (similar to PROC LOGISTIC and PROC REG):• fit statistics (AIC, Schwartz’s criterion, R-square)• chi-squared tests of the global null hypothesis• degrees of freedom• coefficient estimates• standard error of coefficient estimates and p-values • odds ratio point estimates• 95% Wald confidence intervals

Does not include:• Option for stepwise selection• chi-squared test of residuals/tabled residuals (assumptions of normality and

equal variance do not apply)• influential obs/outliers (person weights)

PROC SURVEYCORR

16

Correlations

• Three approaches• Unweighted PROC CORR• PROC CORR with person weights• “PROC SURVEYCORR” macro with PROC SURVEYREG:

• Uses all survey design variables (strata/cluster/weight)• Iteratively runs simple regression models for each predictor variable• Builds table with r-squared, r, and p-values• Sorted by r

• Similar results for all three approaches• PROC CORR output unwieldy with large # of predictor variables• PROC CORR cannot use strata and cluster variables

17

PROC CORR

PROC CORR DATA=PQI.MEPS_2010 PLOTS=MATRIX RANK;

VAR AGE10X WAGEP10X TTLP10X FAMINC10 POVLEV10 TOTSLF10 ERTEXP10 ERTOT10 RXEXP10 OPTEXP10 OPTOTV10 OBVEXP10 OBTOTV10 IPTEXP10 IPNGTD10;

WITH TOTEXP10;

WEIGHT PERWT10F;

RUN;

18

Step 1: PROC SURVEYCORR

PROC SQL;SELECT NVAR INTO :NVAR

FROM DICTIONARY.TABLESWHERE LIBNAME='PQI' AND MEMNAME='MEPS_2010';

QUIT;

• SQL dictionary tables used to select # of predictor variables in the dataset and store in macro variable.

• Note: Data set names stored in dictionary tables using all caps. • # of predictor variables (nvar) = # of iterations SAS will use in DO LOOP

later on in the program.

19


PROC CONTENTS DATA=PQI.MEPS_2010 OUT=CONTENTS NOPRINT;

RUN;PROC SQL NOPRINT;SELECT NAME INTO:VAR1-:VAR76

FROM WORK.CONTENTS;QUIT;

• PROC CONTENTS used to obtain a list of predictor variable names• List of variable names stored as macro variables using PROC SQL

SELECT INTO statement:

20


PROC SQL; CREATE TABLE SURVEYCORR (PARAMETER CHAR(15),R_SQUARE CHAR(8),R NUM(8),PROBT

NUM(8));QUIT;

• Create empty table to store data• Output from PROC SURVEYREG will be inserted one row at a time

21


%MACRO CORR(TARGET=);

PROC SURVEYREG DATA=PQI.MEPS_2010;

STRATA VARSTR;

CLUSTER VARPSU;

WEIGHT PERWT10F;

MODEL &TARGET=&&VAR&I /SOLUTION;

ODS OUTPUT PARAMETERESTIMATES=PARAMETER_EST FITSTATISTICS=FIT;

RUN;

• First part of macro • PROC SURVEYREG uses survey design variables in strata, cluster, and

weight statements• Optional ODS OUTPUT statement stores parameter estimates, fit

statistics, and other information created when the model runs

22


PROC SQL;

INSERT INTO SURVEYCORR

SELECT

PARAMETER

,CVALUE1 AS R_SQUARE

,SIGN(ESTIMATE)* SQRT(INPUT(CVALUE1,8.)) AS R

,PROBT AS PVALUE

FROM FIT

,PARAMETER_EST

WHERE LABEL1 = "R-SQUARE"

AND PARAMETER = "&&VAR&I";

QUIT;

%MEND CORR;

• R-square value extracted from FitStatistics output with PROC SQL• P-value and sign of estimated regression coefficient from ParameterEstimates• Square root function to get correlation coefficient• Sign of regression coefficient = direction of correlation (-/+) with target • Target variable input as a parameter when the macro is called

23


%MACRO LOOP;%DO I=1 %TO &NVAR;

%CORR(TARGET=PUBAT10X);%END;%MEND LOOP;

• Call the macro• Input desired target variable as parameter• Iterate for each predictor variable (NVAR times)• Each time macro is run, new row inserted in table SURVEYCORR

24


PROC SQL;

CREATE TABLE PQI.SURVEYCORR AS

SELECT

PARAMETER

,R_SQUARE

,R FORMAT BEST6.4

,PROBT AS PVALUE FORMAT PVALUE6.4

,CASE WHEN PROBT <=0.05 THEN "YES" ELSE "NO" END AS SIGNIFICANT_95

FROM SURVEYCORR

WHERE PARAMETER NOT IN ('DUPERSID','VARSTR','VARPSU','PERWT10F')

ORDER BY ABS(R) DESC; QUIT;

• Use PROC SQL to:• Format results• Sort by correlation size• Exclude survey design variables from tabulated output

25

PROC SURVEYCORR Output

parameter r-square r p-valuesignificance (95% C.L.)

TOTEXP10 1.000 1.000 <0.0001 yes

IPTEXP10 0.687 0.829 <0.0001 yes

TOTEXP_HIGH 0.287 0.536 <0.0001 yes

IPNGTD10 0.270 0.520 <0.0001 yes

OBVEXP10 0.228 0.477 <0.0001 yes

RXEXP10 0.206 0.454 <0.0001 yes

OBTOTV10 0.158 0.398 <0.0001 yes

OPTEXP10 0.121 0.348 <0.0001 yes

TOTSLF10 0.116 0.340 <0.0001 yes

ADAPPT42 0.089 0.298 <0.0001 yes

Conclusions

27

Recommendations/Conclusions

• Only 4 SAS Survey Procedures• No PROC SURVEYCORR

• Person weights, but• No strata/cluster variables• Significance level (p values) may be less accurate with

complex survey designs• Iterative approach with PROC SURVEYREG

• Can get r and p for large # of predictor variables• Output tabled and ranked

• For categorical variables:• Either reformat to numeric first• Or use CLASS statement in PROC SURVEYREG

References

29

References

• Carrington, W. J., Eltinge, J. L., & McCue, K. (2000). An Economist’s Primer on Survey Samples. Working Paper no. 00-15. Suitland, MD: Center for Economic Studies, U.S. Bureau of the Census, October 2000. Retrieved from ftp://tigerline.census.gov/ces/wp/2000/CES-WP-00-15.pdf January 15, 2013.

• Cohen, J.W., & Rhoades, J.A. (2009). Group and Non-Group Private Health Insurance Coverage, 1996 to 2007: Estimates for the U.S. Civilian Noninstitutionalized Population under Age 65. Medical Expenditure Panel Survey (MEPS) Statistical Brief #267. Agency for Healthcare Research and Quality, Rockville, MD. Retrieved from http://meps.ahrq.gov/data_files/publications/st267/stat267.pdf

• DiJulio, B., & Claxton, G. (2010). Comparison of Expenditures in Nongroup and Employer-Sponsored Insurance: 2004-2007. Kaiser Family Foundation, Menlo Park, CA. Retrieved from http://www.kff.org/insurance/snapshot/chcm111006oth.cfm

• Kaiser Family Foundation (2008). How Non-Group Health Coverage Varies with Income. Menlo Park, CA. Retrieved from http://www.kff.org/insurance/upload/7737.pdf

• Machlin, S., & Yu, W. (2005). MEPS Sample Persons In-Scope for Part of the Year: Identification and Analytic Considerations. April 2005. Agency for Healthcare Research and Quality, Rockville, MD. Retrieved from http://www.meps.ahrq.gov /survey_comp/hc_survey/hc_sample.shtml

ftp://tigerline.census.gov/ces/wp/2000/CES-WP-00-15.pdf

http://meps.ahrq.gov/data_files/publications/st267/stat267.pdf

http://www.kff.org/insurance/snapshot/chcm111006oth.cfm

http://www.kff.org/insurance/upload/7737.pdf

30

References (continued)

• Machlin, S., Yu, W., & Zodet, M. (2005). Computing Standard Errors for MEPS Estimates. January 2005. Agency for Healthcare Research and Quality, Rockville, Md. Retrieved from http://www.meps.ahrq.gov/survey_comp/standard_errors.jsp

• Medical Expenditure Panel Survey (MEPS). (2012). MEPS HC-138: 2010 Full Year Consolidated Data File. Rockville, MD: Agency for Healthcare Research and Quality (AHRQ), September 2012. Retrieved from http://meps.ahrq.gov/data_stats/download_data/pufs/h138/h138doc.pdf September 27, 2012.

• Medical Expenditure Panel Survey (MEPS). (2012). MEPS HC-138: 2010 Full Year Consolidated Data Codebook. Rockville, MD: Agency for Healthcare Research and Quality (AHRQ), August 30, 2012. Retrieved from http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H138 September 27, 2012.

• Medical Expenditure Panel Survey (MEPS). MEPS-HC Panel Design and Collection Process. Agency for Healthcare Research and Quality, Rockville, Md. Retrieved from http://www.meps.ahrq.gov/survey_comp/hc_data_collection.jsp

• Medical Expenditure Panel Survey (MEPS). Data Use Agreement. Agency for Healthcare Research and Quality, Rockville, Md. Retrieved from http://meps.ahrq.gov/mepsweb/data_stats/data_use.jsp

http://meps.ahrq.gov/data_stats/download_data/pufs/h138/h138doc.pdf

http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H138

http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H138

http://www.meps.ahrq.gov/survey_comp/hc_data_collection.jsp

31

References (continued)

• O’Neill, J., & O’Neill, D. (2009). Who are the uninsured? An Analysis of America’s Uninsured Population, Their Characteristics, and Their Health. Employment Policies Institute, Washington, D.C.

• SAS Institute Inc.(2008). SAS/STAT 9.2 User’s Guide. Chapter 14: Introduction to Survey Sampling and Analysis Procedures. Pp. 259-270. Cary, NC: SAS Institute Inc. Retrieved from http://support.sas.com/documentation/cdl/en/statugsurveysamp/61762/PDF/default/statugsurveysamp.pdf on January 15, 2013.

• Trish, E., Damico, A., Claxton, G., Levitt, L., & Garfield, R. (2011). A Profile of Health Insurance Exchange Enrollees. Kaiser Family Foundation, Menlo Park, CA. Retrieved from http://www.kff.org/healthreform/upload/8147.pdf

http://support.sas.com/documentation/cdl/en/statugsurveysamp/61762/PDF/default/statugsurveysamp.pdf

http://support.sas.com/documentation/cdl/en/statugsurveysamp/61762/PDF/default/statugsurveysamp.pdf

http://www.kff.org/healthreform/upload/8147.pdf

proc surveycorr jessica hampton ccsu, new britain, ct september 2013

Documents

surveycorr slide

population estimates

surveyfreq output slide

surveymeans output slide

introduction slide

project slide

block sample slide

population person weights