a predictive model of inquiry to enrollment cullen f. goenner, phd department of economics...

A Predictive Model of Inquiry to Enrollment

Cullen F. Goenner, PhDDepartment of Economics University of North [email protected]/goenner

Kenton PaulsDirector of Enrollment ServicesUniversity of North [email protected]

Issues Facing Enrollment Managers

Finding new “markets” Increasing Tuition Declining population (ND) Increasing competition

Need to attract a particular type of student Diversity/Quality

Data driven analysis Accountability

Questions we will answer today

What is predictive modeling?

How does one build a predictive model?

How can predictive modeling be used by institutions of higher education to improve enrollment?

What is Predictive Modeling?

Predictive modeling uses statistical/econometric methods to quantitatively predict the future behavior of individuals.

Steps include Data collection on the subject of interest Build the model based on data analysis Predictions made out of sample Model validation/testing

College Choice3 stage process - Hossler and Gallagher (1987)

Predisposition/aspiration for higher educationEncouragement, coursework, and interest.

Search of potential schools

Councilors, campus contacts, program availability

SelectionSES, Ability, Fit, Geography

Factors Influencing Choice

Economic perspective: Education an investment in human capital Cost vs Benefit calculus

Psychological perspective: Need of self to find sense of belonging and

fulfillment of needs.

Sociological perspective: Social interaction dictated by societal/family

norms.

Existing Empirical Work

Search Choice Applications:

DesJardin, Dundar, Hendel (1999) Weiler (1994)

Interest: SAT scores sent Toutkoushian (2001)

Existing Models of Enrollment Choice

Model a student’s binary choice to enroll at a particular college while controlling for a student’s characteristics.

Logistic models used Conditional on students have

Applied Bruggink and Gambhir (1996) Thomas, Dawes, and Reznik (2001)

Admitted DesJardins (2002) Leppel (1993)

Our Predictive Model

Builds on the models of DesJardins (2002) and Thomas, Dawes, Reznik (2001)

Focus here is on prediction of enrollment of students that inquired of our institution.

“Inquiry model” is relevant because: Time of information exchange, opinion formation Allows for early intervention in a student’s

decision making process (Target Marketing)

Inquiry Model Challenges

Data collection Data already collected on those who are

admitted or apply. Typically not collected for inquiries.

Quality of data Applicants provide detailed data describing

themselves (demographic data test scores, HSGPA, etc.), which are not available for most student inquiries.

Types of Inquiries We Recorded

Return of information card Attendance of college fair Campus visit Contact via e-mail Contact via phone Referral from faculty, coach, or alumni ACT automatically submitted

How these data were captured

Enrollment Services Prospective Student Network relational database (ESPSN)

Customized system SQL 2000/Visual Basic

Information Collected From Information Request Card

Name High School attended Interested Major (if any) Address

Lacks the demographic data typical to application records and use in most predictive models.

Geodemography

Process of attaching demographic characteristics to geographic characteristics.

Notion is that “Birds of a Feather Flock Together”, i.e. individuals living in the same neighborhood will tend to have similar behavior patterns.

Ex: Neighborhoods homogenous in terms of household income, occupations, family size, and purchases.

Implementation

US Census data aggregated to zip code level

“Geodemographic” variables considered for our model specification: College age demographic Population Average Income White demographic Median age

Building the model

Binary choice model: Model whether students, who inquire of UND, either enroll or do not enroll.

15,827 students made inquiries for Fall 2003 enrollment. Of these students 2067 actually enrolled.

Logistic regression model used.

Candidate Control Variables

Type and Frequency of Contact Geographic Academic Geodemographic Interaction Effects

Contact VariablesPredictor Description

contacts Number of inquiries

autoact 1 if automatically submitted ACT score; 0 otherwise

visit Number of campus visits

referral 1 if referred by faculty, coach, alumni; 0 otherwise

www 1 if inquiry made by internet; 0 otherwise

phone 1 if inquiry made by phone; 0 otherwise

Geographic VariablesPredictor Description

distance Distance in miles from our institution

hystate Resident of MN or ND

hyschool Historically high yield school

compete Distance in miles to closest regional competitor

dist1 Distance between 100-300 miles



dist4 Distance greater than 1000 miles

Academic/Geodemographic

Predictor Description

acadint 1 if academic interest expressed; 0 otherwise

aviation 1 if academic interest is aviation; 0 otherwise

colldemo % of population who completed some college

totalpop Total population of zip code

medage Median age of zip code

whitedem % of population white (Non-Hispanic)

avginc Average income in dollars of zip code

Interaction Terms

vismile # of visits x distance

avitmile Aviation x Distance

aviatinc Aviation x Average income

incmile1 Average income * Distance 1

incmile2 Average income x Distance 2



vismile # of visits x distance

Model Specification

Researchers typically assume their model specification is the true model which generates the data.

Difficult to justify a priori the choice of variables to include in model, given each by design is theoretically relevant.

With k candidate variables there are 2k different linear models one could consider.

Consider the case in which several models {M1, … MK} are theoretically possible.

Basing inference on the results of a single model is risky.

Bayesian model averaging (BMA) allows us to account for this type of uncertainty.

BMA

The posterior distribution of the parameters given the data in the presence of uncertainty is the posterior distribution under each of the K models, with weights equal to the posterior model probabilities P(Mk/D) .

(1))/(),/()/(

1

DMPDMPDP k

K

kk

Posterior Model Probability is

(2)

Where P(D/Mk) is the likelihood and P(Mk)

is the prior probability that model Mk is the

true model, given one of the K models is the true model.

K

lll

kkk

MPMDP

MPMDPDMP

1

)()/(

)()/()/(

Posterior Model Probability

Assuming a non-informative prior, (P(M1)

= … P(Mk) = 1/K)

(3)

K

ll

k

k

BIC

BICDMP

1

)2

1exp(

)2

1exp(

)/(

The posterior mean and variance summarize the effects of the parameters on the dependent variable. Raftery (1995) reports

(9)

where (k) and Var(k) are MLE under model k, and the summation is over models that include .

211

21

111

1111

)0,/()/(])()([)0,/(

)/()(ˆ)0,/(

DEDMPkkVarDVar

DMPkDE

kA

kA

1̂

1̂

BMA Implementation

SPlus function bic.logit – performs BMA on logistic regression models.

30 regressors implies summation in equation 1 over 1 billion models.

To manage summation we use Occam’s window.

Occam’s Window

Exclude models that predict the data sufficiently less than predictions of the best model. Predictions based on PMP of each model. Models in A’ are included

}max

:{' CPMP

PMPMA

k

lk

Results 26 Models supported by the data Model with highest PMP receives 21% of

total. Variables that receive strong support for

inclusion include: Geographic: Distance, HY State, HY School,

Competitor distance Geodemog: College Age, Average Income Contacts: Number, Campus visit, Referral

Table 3: Results of BMA Applied to Prediction of Enrollment

Predictor Mean β/D Std Error β/D Pr(β≠0/D) Contact

contacts 0.1969 0.0299 100 autoact 0.0191 0.0690 8.3 visit 1.3386 0.0827 100 referral 1.7240 0.0745 100 www 0.0147 0.0665 5.6 phone 0.1650 0.1901 47.5

Geographic distance -0.0040 0.0004 100 hystate 0.7726 0.1213 100 hyschool 0.9491 0.0819 100 compete 0.0033 0.0004 100 dist1 0 0 0 dist2 0.0155 0.0723 5.2 dist3 0 0 0 dist4 0 0 0

Geodemographic colldemo 2.8015 0.5395 100 totalpop 2.80E-07 1.34E-06 4.9 medage 0 0 0 whitedem 0.0578 0.2178 7.9 avginc 8.59E-06 1.49E-06 100

Academic acadint 0.2725 0.0803 97.8 aviation 0.1871 0.2478 38.1

Interaction vismile 0.0016 0.0002 100 avitmile 0.0005 0.0004 63.1 aviatinc 0 0 0 incmile1 0 0 0 incmile2 4.10E-07 1.59E-06 7.4 incmile3 0 0 0 incmile4 0 0 0

Out of Sample Predictive Performance

Split the data into two equal parts: First part used to build/estimate the model Second part used to test the model’s

predictions.

Outcome (enrollment) is binary, while our model generates a probability estimate.

What is a successful prediction?

Greene (2001) - No “correct” choice for probability cutoff. Typical value is .5

Tradeoff in cutoff choice: Lower cutoff increases the accuracy of

inquiries that are predicted to enroll and who actually enroll (sensitivity) at the expense of inquiries predicted to enroll and do not enroll (false positive rate)

Predictive Performance: Classification

Actual Outcome

Prediction Enrolled

Did not Enroll TOTAL

Predicted to Enroll

37036%

1942.8%

564

Predicted not to Enroll

65764%

669397%

7350

TOTAL 1027 6887 7914

Predictive performance

89% of observations correctly classified Specificity: 97% Sensitivity: 36%

ROC curve describes relation between sensitivity and 1- specificity (false + rate) Area under ROC curve = .87

Another Predictive Performance Method

MODEL SCORE RANGES1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00

Total Estimates 15,412 24 221 278 217 217 319 434 592 1,048 3,231 8,831Total Enrolled 1,893 20 158 202 136 140 153 209 225 247 292 111Total Not Enrolled 13,519 4 63 76 81 77 166 225 367 801 2,939 8,720Percent of Enrolled 1% 8% 11% 7% 7% 8% 11% 12% 13% 15% 6%

Accumulating % of Enrolled 9% 20% 27% 35% 43% 54% 66% 79%Accumulating Count of Enrolled records within ranges 24 245 523 740 957 1,276 1,710 2,302 3,350Accumulated percent of total estimates 22%

79% of enrolled found within 22% of entire population (scores >= 0.2)

Focused efforts without compromising enrollment numbers

Efficiency implications

Practical Applications

Effective regional market segmentation Targeted tele-counseling efforts Special projects

Regional Market Segmenting

Target Marketing and Segmentation Prospect names purchased based on zip

code. Establish a predictive “score” for all zip

codes in US based on census-level data

What the data indicated (WA)

Where enrolled students came from (WA)

83% of enrolled WA students fell within top scoring zips over three years

Direct Mail Names Purchases Prior years very open search criteria

MN, CO, SD, MT This year, much more restrictive to get

deeper into broader markets Only key zips CO, WA, OR, AZ, IL, MN, etc.

WA Search Names - 2003

WA Search Names - 2004

Targeted Tele-Counseling Efforts

Student calling program Top 20% of all model scores identified Fluid number excluding applicants Prompt student to take action

Special Projects

Limited funds but targeted initiatives Focus on as many of top scoring

students Postcards, brochures, etc.

Possible Future Research

Cluster analysis for better market segmentation

Study of marginal effects

Thank You!

Questions?

a predictive model of inquiry to enrollment cullen f. goenner, phd department of economics...

Documents