a predictive model of inquiry to enrollment cullen f. goenner, phd department of economics...
Post on 20-Dec-2015
220 views
TRANSCRIPT
A Predictive Model of Inquiry to Enrollment
Cullen F. Goenner, PhDDepartment of Economics University of North [email protected]/goenner
Kenton PaulsDirector of Enrollment ServicesUniversity of North [email protected]
Issues Facing Enrollment Managers
Finding new “markets” Increasing Tuition Declining population (ND) Increasing competition
Need to attract a particular type of student Diversity/Quality
Data driven analysis Accountability
Questions we will answer today
What is predictive modeling?
How does one build a predictive model?
How can predictive modeling be used by institutions of higher education to improve enrollment?
What is Predictive Modeling?
Predictive modeling uses statistical/econometric methods to quantitatively predict the future behavior of individuals.
Steps include Data collection on the subject of interest Build the model based on data analysis Predictions made out of sample Model validation/testing
College Choice3 stage process - Hossler and Gallagher (1987)
Predisposition/aspiration for higher educationEncouragement, coursework, and interest.
Search of potential schools
Councilors, campus contacts, program availability
SelectionSES, Ability, Fit, Geography
Factors Influencing Choice
Economic perspective: Education an investment in human capital Cost vs Benefit calculus
Psychological perspective: Need of self to find sense of belonging and
fulfillment of needs.
Sociological perspective: Social interaction dictated by societal/family
norms.
Existing Empirical Work
Search Choice Applications:
DesJardin, Dundar, Hendel (1999) Weiler (1994)
Interest: SAT scores sent Toutkoushian (2001)
Existing Models of Enrollment Choice
Model a student’s binary choice to enroll at a particular college while controlling for a student’s characteristics.
Logistic models used Conditional on students have
Applied Bruggink and Gambhir (1996) Thomas, Dawes, and Reznik (2001)
Admitted DesJardins (2002) Leppel (1993)
Our Predictive Model
Builds on the models of DesJardins (2002) and Thomas, Dawes, Reznik (2001)
Focus here is on prediction of enrollment of students that inquired of our institution.
“Inquiry model” is relevant because: Time of information exchange, opinion formation Allows for early intervention in a student’s
decision making process (Target Marketing)
Inquiry Model Challenges
Data collection Data already collected on those who are
admitted or apply. Typically not collected for inquiries.
Quality of data Applicants provide detailed data describing
themselves (demographic data test scores, HSGPA, etc.), which are not available for most student inquiries.
Types of Inquiries We Recorded
Return of information card Attendance of college fair Campus visit Contact via e-mail Contact via phone Referral from faculty, coach, or alumni ACT automatically submitted
How these data were captured
Enrollment Services Prospective Student Network relational database (ESPSN)
Customized system SQL 2000/Visual Basic
Information Collected From Information Request Card
Name High School attended Interested Major (if any) Address
Lacks the demographic data typical to application records and use in most predictive models.
Geodemography
Process of attaching demographic characteristics to geographic characteristics.
Notion is that “Birds of a Feather Flock Together”, i.e. individuals living in the same neighborhood will tend to have similar behavior patterns.
Ex: Neighborhoods homogenous in terms of household income, occupations, family size, and purchases.
Implementation
US Census data aggregated to zip code level
“Geodemographic” variables considered for our model specification: College age demographic Population Average Income White demographic Median age
Building the model
Binary choice model: Model whether students, who inquire of UND, either enroll or do not enroll.
15,827 students made inquiries for Fall 2003 enrollment. Of these students 2067 actually enrolled.
Logistic regression model used.
Candidate Control Variables
Type and Frequency of Contact Geographic Academic Geodemographic Interaction Effects
Contact VariablesPredictor Description
contacts Number of inquiries
autoact 1 if automatically submitted ACT score; 0 otherwise
visit Number of campus visits
referral 1 if referred by faculty, coach, alumni; 0 otherwise
www 1 if inquiry made by internet; 0 otherwise
phone 1 if inquiry made by phone; 0 otherwise
Geographic VariablesPredictor Description
distance Distance in miles from our institution
hystate Resident of MN or ND
hyschool Historically high yield school
compete Distance in miles to closest regional competitor
dist1 Distance between 100-300 miles
dist2 Distance between 300-500 miles
dist3 Distance between 500-1000 miles
dist4 Distance greater than 1000 miles
Academic/Geodemographic
Predictor Description
acadint 1 if academic interest expressed; 0 otherwise
aviation 1 if academic interest is aviation; 0 otherwise
colldemo % of population who completed some college
totalpop Total population of zip code
medage Median age of zip code
whitedem % of population white (Non-Hispanic)
avginc Average income in dollars of zip code
Interaction Terms
vismile # of visits x distance
avitmile Aviation x Distance
aviatinc Aviation x Average income
incmile1 Average income * Distance 1
incmile2 Average income x Distance 2
incmile3 Average income x Distance 3
incmile4 Average income x Distance 4
vismile # of visits x distance
Model Specification
Researchers typically assume their model specification is the true model which generates the data.
Difficult to justify a priori the choice of variables to include in model, given each by design is theoretically relevant.
With k candidate variables there are 2k different linear models one could consider.
Consider the case in which several models {M1, … MK} are theoretically possible.
Basing inference on the results of a single model is risky.
Bayesian model averaging (BMA) allows us to account for this type of uncertainty.
BMA
The posterior distribution of the parameters given the data in the presence of uncertainty is the posterior distribution under each of the K models, with weights equal to the posterior model probabilities P(Mk/D) .
(1))/(),/()/(
1
DMPDMPDP k
K
kk
Posterior Model Probability is
(2)
Where P(D/Mk) is the likelihood and P(Mk)
is the prior probability that model Mk is the
true model, given one of the K models is the true model.
K
lll
kkk
MPMDP
MPMDPDMP
1
)()/(
)()/()/(
Posterior Model Probability
Assuming a non-informative prior, (P(M1)
= … P(Mk) = 1/K)
(3)
K
ll
k
k
BIC
BICDMP
1
)2
1exp(
)2
1exp(
)/(
The posterior mean and variance summarize the effects of the parameters on the dependent variable. Raftery (1995) reports
(9)
where (k) and Var(k) are MLE under model k, and the summation is over models that include .
211
21
111
1111
)0,/()/(])()([)0,/(
)/()(ˆ)0,/(
DEDMPkkVarDVar
DMPkDE
kA
kA
1̂
1̂
BMA Implementation
SPlus function bic.logit – performs BMA on logistic regression models.
30 regressors implies summation in equation 1 over 1 billion models.
To manage summation we use Occam’s window.
Occam’s Window
Exclude models that predict the data sufficiently less than predictions of the best model. Predictions based on PMP of each model. Models in A’ are included
}max
:{' CPMP
PMPMA
k
lk
Results 26 Models supported by the data Model with highest PMP receives 21% of
total. Variables that receive strong support for
inclusion include: Geographic: Distance, HY State, HY School,
Competitor distance Geodemog: College Age, Average Income Contacts: Number, Campus visit, Referral
Table 3: Results of BMA Applied to Prediction of Enrollment
Predictor Mean β/D Std Error β/D Pr(β≠0/D) Contact
contacts 0.1969 0.0299 100 autoact 0.0191 0.0690 8.3 visit 1.3386 0.0827 100 referral 1.7240 0.0745 100 www 0.0147 0.0665 5.6 phone 0.1650 0.1901 47.5
Geographic distance -0.0040 0.0004 100 hystate 0.7726 0.1213 100 hyschool 0.9491 0.0819 100 compete 0.0033 0.0004 100 dist1 0 0 0 dist2 0.0155 0.0723 5.2 dist3 0 0 0 dist4 0 0 0
Geodemographic colldemo 2.8015 0.5395 100 totalpop 2.80E-07 1.34E-06 4.9 medage 0 0 0 whitedem 0.0578 0.2178 7.9 avginc 8.59E-06 1.49E-06 100
Academic acadint 0.2725 0.0803 97.8 aviation 0.1871 0.2478 38.1
Interaction vismile 0.0016 0.0002 100 avitmile 0.0005 0.0004 63.1 aviatinc 0 0 0 incmile1 0 0 0 incmile2 4.10E-07 1.59E-06 7.4 incmile3 0 0 0 incmile4 0 0 0
Out of Sample Predictive Performance
Split the data into two equal parts: First part used to build/estimate the model Second part used to test the model’s
predictions.
Outcome (enrollment) is binary, while our model generates a probability estimate.
What is a successful prediction?
Greene (2001) - No “correct” choice for probability cutoff. Typical value is .5
Tradeoff in cutoff choice: Lower cutoff increases the accuracy of
inquiries that are predicted to enroll and who actually enroll (sensitivity) at the expense of inquiries predicted to enroll and do not enroll (false positive rate)
Predictive Performance: Classification
Actual Outcome
Prediction Enrolled
Did not Enroll TOTAL
Predicted to Enroll
37036%
1942.8%
564
Predicted not to Enroll
65764%
669397%
7350
TOTAL 1027 6887 7914
Predictive performance
89% of observations correctly classified Specificity: 97% Sensitivity: 36%
ROC curve describes relation between sensitivity and 1- specificity (false + rate) Area under ROC curve = .87
Another Predictive Performance Method
MODEL SCORE RANGES1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00
Total Estimates 15,412 24 221 278 217 217 319 434 592 1,048 3,231 8,831Total Enrolled 1,893 20 158 202 136 140 153 209 225 247 292 111Total Not Enrolled 13,519 4 63 76 81 77 166 225 367 801 2,939 8,720Percent of Enrolled 1% 8% 11% 7% 7% 8% 11% 12% 13% 15% 6%
Accumulating % of Enrolled 9% 20% 27% 35% 43% 54% 66% 79%Accumulating Count of Enrolled records within ranges 24 245 523 740 957 1,276 1,710 2,302 3,350Accumulated percent of total estimates 22%
79% of enrolled found within 22% of entire population (scores >= 0.2)
Focused efforts without compromising enrollment numbers
Efficiency implications
Practical Applications
Effective regional market segmentation Targeted tele-counseling efforts Special projects
Regional Market Segmenting
Target Marketing and Segmentation Prospect names purchased based on zip
code. Establish a predictive “score” for all zip
codes in US based on census-level data
What the data indicated (WA)
Where enrolled students came from (WA)
83% of enrolled WA students fell within top scoring zips over three years
Direct Mail Names Purchases Prior years very open search criteria
MN, CO, SD, MT This year, much more restrictive to get
deeper into broader markets Only key zips CO, WA, OR, AZ, IL, MN, etc.
WA Search Names - 2003
WA Search Names - 2004
Targeted Tele-Counseling Efforts
Student calling program Top 20% of all model scores identified Fluid number excluding applicants Prompt student to take action
Special Projects
Limited funds but targeted initiatives Focus on as many of top scoring
students Postcards, brochures, etc.
Possible Future Research
Cluster analysis for better market segmentation
Study of marginal effects
Thank You!
Questions?