proc surveycorr jessica hampton ccsu, new britain, ct september 2013
TRANSCRIPT
PROC SURVEYCORR
Jessica Hampton
CCSU, New Britain, CT
September 2013
Introduction
3
Medical Expenditures Panel Survey (MEPS)
• Administered annually by the U.S. Department of Health and Human Services since 1996
• Agency for Healthcare Research and Quality (ARHQ)• Anonymity protected by removing individual identifiers from the public data files• MEPS 2010 consolidated data file released September 2012• Multiple components (household, insurance/employer, and medical provider).• Household component (1,911 variables) covers the following topics:
• Demographics• Household income• Employment• Diagnosed health conditions• Additional health status issues• Medical expenditures and utilization• Satisfaction with and access to care• Insurance coverage
• 18,692 after excluding out of scope, negative person weights, under 18 and 65+• U.S. civilian, noninstitutionalized population • ~3% out of scope (birth/adoption, death, incarceration, living abroad)
4
MEPS Survey Design Methods
• MEPS is a representative but NOT a random sample of the population• Person weights must be used to produce reliable population estimates• Stratification:
• By demographic variables such as age, race, sex, income, etc.• Goal is to maximize homogeneity within and heterogeneity between strata• Sometimes used to oversample certain groups under-represented in the
general population or with interesting characteristics relevant to study • For example: blacks, Hispanics, and low-income households
• Clustering:• By geography in order to reduce survey costs -- not feasible or cost-
effective to do a random sample of the entire population of the U.S. • Within-cluster correlation underestimates variance/error -- two families in
the same neighborhood are more likely to be similar demographically (for example, similar income)
• Desire clusters spatially close for cost effectiveness but as heterogeneous within as possible for reasonable variance.
• Multi-stage clustering used in MEPS: • sample of counties >> sample of blocks >> individuals/households
surveyed from block sample
5
Survey Design Considerations
• If person weights are ignored and one tries to generalize sample findings to the entire population, total numbers, percentages, or means are inflated for the groups that are oversampled and underestimated for others
• In regression analysis, ignoring person weights leads to biased coefficient estimates
• If sampling strata and cluster variables are ignored, means and coefficient estimates are unaffected, but standard error (or population variance) may be underestimated; that is, the reliability of an estimate may be overestimated
• Or when comparing one estimated population mean to another, the difference may appear to be statistically significant when it is not
• (Machlin, S., Yu, W., & Zodet, M., 2005)
SAS Survey Procedures
7
SAS Survey Procedures
• Intended for use with sample designs that may include unequal person weights, clustering, and stratification.
• PROC SURVEYMEANS estimates population totals, percentages, and means. Includes estimated variance, confidence intervals, and descriptive statistics.
• PROC SURVEYFREQ produces frequency tables, population estimates, percentages, and standard error.
• PROC SURVEYREG estimates regression coefficients by generalized least squares.
• PROC SURVEYLOGISTIC fits logistic regression models for discrete response (categorical) survey data by maximum likelihood.
• PROC SURVEYMEANS and PROC SURVEYREG available starting with SAS version 8.
• PROC SURVEYFREQ and PROC SURVEYLOGISTIC available starting with version 9.
• PROC SURVEYSELECT for sampling which will not be used in this project
8
PROC SURVEYMEANS Syntax
PROC SURVEYMEANS DATA=PQI.MEPS_2010;
STRATA VARSTR;
CLUSTER VARPSU;
WEIGHT PERWT10F;
DOMAIN INSCOV10;
VAR TOTEXP10 TOTSLF10;
RUN;
9
PROC SURVEYMEANS Output
10
PROC SURVEYFREQ Syntax
PROC SURVEYFREQ DATA=PQI.MEPS_2010;
STRATA VARSTR;
CLUSTER VARPSU;
WEIGHT PERWT10F;
TABLES PRIEU10 PRING10 INSCOV10;
RUN;
11
PROC SURVEYFREQ Output
12
PROC SURVEYREG Syntax
PROC SURVEYREG DATA=PQI.MEPS_2010;
STRATA VARSTR;
CLUSTER VARPSU;
WEIGHT PERWT10F;
MODEL &TARGET=&&VAR&I /SOLUTION;
ODS OUTPUT PARAMETERESTIMATES=PARAMETER_EST
FITSTATISTICS=FIT;
RUN;
13
PROC SURVEYLOGISTIC Syntax
PROC SURVEYLOGISTIC DATA=SASUSER.MEPS_2010;
STRATA VARSTR;
CLUSTER VARPSU;
WEIGHT PERWT10F;
MODEL TOTEXP_HIGH(EVENT='1')=AGE10X MARRIED--HISPANX POVLEV10--PHYACT53 OBESE--ADSMOK42 ADINSA42--
LOCATN_ER;
ODS OUTPUT PARAMETERESTIMATES=WORK.PARAM;
RUN;
14
PROC SURVEYLOGISTIC/REG Output
Default output (similar to PROC LOGISTIC and PROC REG):• fit statistics (AIC, Schwartz’s criterion, R-square)• chi-squared tests of the global null hypothesis• degrees of freedom• coefficient estimates• standard error of coefficient estimates and p-values • odds ratio point estimates• 95% Wald confidence intervals
Does not include:• Option for stepwise selection• chi-squared test of residuals/tabled residuals (assumptions of normality and
equal variance do not apply)• influential obs/outliers (person weights)
PROC SURVEYCORR
16
Correlations
• Three approaches• Unweighted PROC CORR• PROC CORR with person weights• “PROC SURVEYCORR” macro with PROC SURVEYREG:
• Uses all survey design variables (strata/cluster/weight)• Iteratively runs simple regression models for each predictor variable• Builds table with r-squared, r, and p-values• Sorted by r
• Similar results for all three approaches• PROC CORR output unwieldy with large # of predictor variables• PROC CORR cannot use strata and cluster variables
17
PROC CORR
PROC CORR DATA=PQI.MEPS_2010 PLOTS=MATRIX RANK;
VAR AGE10X WAGEP10X TTLP10X FAMINC10 POVLEV10 TOTSLF10 ERTEXP10 ERTOT10 RXEXP10 OPTEXP10 OPTOTV10 OBVEXP10 OBTOTV10 IPTEXP10 IPNGTD10;
WITH TOTEXP10;
WEIGHT PERWT10F;
RUN;
18
Step 1: PROC SURVEYCORR
PROC SQL;SELECT NVAR INTO :NVAR
FROM DICTIONARY.TABLESWHERE LIBNAME='PQI' AND MEMNAME='MEPS_2010';
QUIT;
• SQL dictionary tables used to select # of predictor variables in the dataset and store in macro variable.
• Note: Data set names stored in dictionary tables using all caps. • # of predictor variables (nvar) = # of iterations SAS will use in DO LOOP
later on in the program.
19
Step 2: PROC SURVEYCORR
PROC CONTENTS DATA=PQI.MEPS_2010 OUT=CONTENTS NOPRINT;
RUN;PROC SQL NOPRINT;SELECT NAME INTO:VAR1-:VAR76
FROM WORK.CONTENTS;QUIT;
• PROC CONTENTS used to obtain a list of predictor variable names• List of variable names stored as macro variables using PROC SQL
SELECT INTO statement:
20
Step 3: PROC SURVEYCORR
PROC SQL; CREATE TABLE SURVEYCORR (PARAMETER CHAR(15),R_SQUARE CHAR(8),R NUM(8),PROBT
NUM(8));QUIT;
• Create empty table to store data• Output from PROC SURVEYREG will be inserted one row at a time
21
Step 4: PROC SURVEYCORR
%MACRO CORR(TARGET=);
PROC SURVEYREG DATA=PQI.MEPS_2010;
STRATA VARSTR;
CLUSTER VARPSU;
WEIGHT PERWT10F;
MODEL &TARGET=&&VAR&I /SOLUTION;
ODS OUTPUT PARAMETERESTIMATES=PARAMETER_EST FITSTATISTICS=FIT;
RUN;
• First part of macro • PROC SURVEYREG uses survey design variables in strata, cluster, and
weight statements• Optional ODS OUTPUT statement stores parameter estimates, fit
statistics, and other information created when the model runs
22
Step 5: PROC SURVEYCORR
PROC SQL;
INSERT INTO SURVEYCORR
SELECT
PARAMETER
,CVALUE1 AS R_SQUARE
,SIGN(ESTIMATE)* SQRT(INPUT(CVALUE1,8.)) AS R
,PROBT AS PVALUE
FROM FIT
,PARAMETER_EST
WHERE LABEL1 = "R-SQUARE"
AND PARAMETER = "&&VAR&I";
QUIT;
%MEND CORR;
• R-square value extracted from FitStatistics output with PROC SQL• P-value and sign of estimated regression coefficient from ParameterEstimates• Square root function to get correlation coefficient• Sign of regression coefficient = direction of correlation (-/+) with target • Target variable input as a parameter when the macro is called
23
Step 6: PROC SURVEYCORR
%MACRO LOOP;%DO I=1 %TO &NVAR;
%CORR(TARGET=PUBAT10X);%END;%MEND LOOP;
• Call the macro• Input desired target variable as parameter• Iterate for each predictor variable (NVAR times)• Each time macro is run, new row inserted in table SURVEYCORR
24
Step 7: PROC SURVEYCORR
PROC SQL;
CREATE TABLE PQI.SURVEYCORR AS
SELECT
PARAMETER
,R_SQUARE
,R FORMAT BEST6.4
,PROBT AS PVALUE FORMAT PVALUE6.4
,CASE WHEN PROBT <=0.05 THEN "YES" ELSE "NO" END AS SIGNIFICANT_95
FROM SURVEYCORR
WHERE PARAMETER NOT IN ('DUPERSID','VARSTR','VARPSU','PERWT10F')
ORDER BY ABS(R) DESC; QUIT;
• Use PROC SQL to:• Format results• Sort by correlation size• Exclude survey design variables from tabulated output
25
PROC SURVEYCORR Output
parameter r-square r p-valuesignificance (95% C.L.)
TOTEXP10 1.000 1.000 <0.0001 yes
IPTEXP10 0.687 0.829 <0.0001 yes
TOTEXP_HIGH 0.287 0.536 <0.0001 yes
IPNGTD10 0.270 0.520 <0.0001 yes
OBVEXP10 0.228 0.477 <0.0001 yes
RXEXP10 0.206 0.454 <0.0001 yes
OBTOTV10 0.158 0.398 <0.0001 yes
OPTEXP10 0.121 0.348 <0.0001 yes
TOTSLF10 0.116 0.340 <0.0001 yes
ADAPPT42 0.089 0.298 <0.0001 yes
Conclusions
27
Recommendations/Conclusions
• Only 4 SAS Survey Procedures• No PROC SURVEYCORR
• Person weights, but• No strata/cluster variables• Significance level (p values) may be less accurate with
complex survey designs• Iterative approach with PROC SURVEYREG
• Can get r and p for large # of predictor variables• Output tabled and ranked
• For categorical variables:• Either reformat to numeric first• Or use CLASS statement in PROC SURVEYREG
References
29
References
• Carrington, W. J., Eltinge, J. L., & McCue, K. (2000). An Economist’s Primer on Survey Samples. Working Paper no. 00-15. Suitland, MD: Center for Economic Studies, U.S. Bureau of the Census, October 2000. Retrieved from ftp://tigerline.census.gov/ces/wp/2000/CES-WP-00-15.pdf January 15, 2013.
• Cohen, J.W., & Rhoades, J.A. (2009). Group and Non-Group Private Health Insurance Coverage, 1996 to 2007: Estimates for the U.S. Civilian Noninstitutionalized Population under Age 65. Medical Expenditure Panel Survey (MEPS) Statistical Brief #267. Agency for Healthcare Research and Quality, Rockville, MD. Retrieved from http://meps.ahrq.gov/data_files/publications/st267/stat267.pdf
• DiJulio, B., & Claxton, G. (2010). Comparison of Expenditures in Nongroup and Employer-Sponsored Insurance: 2004-2007. Kaiser Family Foundation, Menlo Park, CA. Retrieved from http://www.kff.org/insurance/snapshot/chcm111006oth.cfm
• Kaiser Family Foundation (2008). How Non-Group Health Coverage Varies with Income. Menlo Park, CA. Retrieved from http://www.kff.org/insurance/upload/7737.pdf
• Machlin, S., & Yu, W. (2005). MEPS Sample Persons In-Scope for Part of the Year: Identification and Analytic Considerations. April 2005. Agency for Healthcare Research and Quality, Rockville, MD. Retrieved from http://www.meps.ahrq.gov /survey_comp/hc_survey/hc_sample.shtml
30
References (continued)
• Machlin, S., Yu, W., & Zodet, M. (2005). Computing Standard Errors for MEPS Estimates. January 2005. Agency for Healthcare Research and Quality, Rockville, Md. Retrieved from http://www.meps.ahrq.gov/survey_comp/standard_errors.jsp
• Medical Expenditure Panel Survey (MEPS). (2012). MEPS HC-138: 2010 Full Year Consolidated Data File. Rockville, MD: Agency for Healthcare Research and Quality (AHRQ), September 2012. Retrieved from http://meps.ahrq.gov/data_stats/download_data/pufs/h138/h138doc.pdf September 27, 2012.
• Medical Expenditure Panel Survey (MEPS). (2012). MEPS HC-138: 2010 Full Year Consolidated Data Codebook. Rockville, MD: Agency for Healthcare Research and Quality (AHRQ), August 30, 2012. Retrieved from http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H138 September 27, 2012.
• Medical Expenditure Panel Survey (MEPS). MEPS-HC Panel Design and Collection Process. Agency for Healthcare Research and Quality, Rockville, Md. Retrieved from http://www.meps.ahrq.gov/survey_comp/hc_data_collection.jsp
• Medical Expenditure Panel Survey (MEPS). Data Use Agreement. Agency for Healthcare Research and Quality, Rockville, Md. Retrieved from http://meps.ahrq.gov/mepsweb/data_stats/data_use.jsp
31
References (continued)
• O’Neill, J., & O’Neill, D. (2009). Who are the uninsured? An Analysis of America’s Uninsured Population, Their Characteristics, and Their Health. Employment Policies Institute, Washington, D.C.
• SAS Institute Inc.(2008). SAS/STAT 9.2 User’s Guide. Chapter 14: Introduction to Survey Sampling and Analysis Procedures. Pp. 259-270. Cary, NC: SAS Institute Inc. Retrieved from http://support.sas.com/documentation/cdl/en/statugsurveysamp/61762/PDF/default/statugsurveysamp.pdf on January 15, 2013.
• Trish, E., Damico, A., Claxton, G., Levitt, L., & Garfield, R. (2011). A Profile of Health Insurance Exchange Enrollees. Kaiser Family Foundation, Menlo Park, CA. Retrieved from http://www.kff.org/healthreform/upload/8147.pdf