evolution of regression ols to gps to mars
TRANSCRIPT
1Salford Systems ©2013
Evolution of Regression:From Classical Least Squares to
Regularized Regression to Machine Learning Ensembles
Covering MARS®, Generalized PathSeeker®,
TreeNet® Gradient Boosting and Random Forests®
A Brief Overview the 4 Part Webinarat www.salford-systems.com
May 2013Dan Steinberg
Mikhail Golovnya
Salford Systems
Salford Systems ©2013 2
Full Webinar Outline
• Regression Problem – quick overview• Classical Least Squares – the starting point• RIDGE/LASSO/GPS – regularized regression• MARS – adaptive non-linear regression splines
• CART Regression tree– quick overview• Random Forest decision tree ensembles• TreeNet Stochastic Gradient Boosted Trees• Hybrid TreeNet/GPS (trees and regularized regression)
Webinar Part 1
Webinar Part 2
Salford Systems ©2013 3
Regression• Regression analysis at least 200 years old
o most used predictive modeling technique (including logistic regression)
• American Statistical Association reports 18,900 memberso Bureau of Labor Statistics reports more than 22,000 statisticians in 2008
• Many other professionals involved in the sophisticated analysis of data not included in these countso Statistical specialists in marketing, economics, psychology, bioinformaticso Machine Learning specialists and ‘Data Scientists’o Data Base professionals involved in data analysiso Web analytics, social media analytics, text analytics
• Few of these other researchers will call themselves statisticianso but may make extensive use of variations of regression
• One reason for popularity of regression: effective
Salford Systems ©2013 4
Regression Challenges• Preparation of data – errors, missing values, etc.
o Largest part of typical data analysis (modelers often report 80% time)
o Missing values a huge headache (listwise deletion of rows)
• Determining which predictors to include in modelo Text book examples typically have 10 predictors availableo Hundreds, thousands, even tens and hundreds of thousands available
• Transformation or coding of predictorso Conventional approaches: logarithm, power, inverse, etc..o Required to obtain a good model
• High correlation among predictorso With increasing numbers of predictors this complication
becomes more serious
Salford Systems ©2013 5
More Regression Challenges
• Obtaining “sensible” results (correct signs, no wild outcomes)
• Detecting and modeling important interactionso Typically never done because too difficult
• “Wide” data has more columns than rows• Lack of external knowledge or theory to guide
modeling as more topics are modeled
Salford Systems ©2013 6
Boston Housing Data Set• Concerns the housing values in Boston area• Harrison, D. and D. Rubinfeld. Hedonic Prices and
the Demand For Clean Air. o Journal of Environmental Economics and Management, v5, 81-102 , 1978
• Combined information from 10 separate governmental and educational sources to produce data set
•506 census tracts in City of Boston for the year 1970o Goal: study relationship between quality of life variables and property valueso MV median value of owner-occupied homes in tract ($1,000’s)o CRIM per capita crime rateso NOX concentration nitric oxides (p.p. 10 million) proxy for air pollution generallyo AGE percent built before 1940o DIS weighted distance to centers of employmento RM average number of rooms per houseo LSTAT % lower status of population (without some high school and male laborers)o RAD index of accessibility to radial highwayso CHAS borders Charles River (0/1)o INDUS percent of acreage non-retail businesso TAX property tax rate per $10,000o PT pupil teacher ratioo ZN proportion of neighborhood zoned for large lots (>25K sq ft)
Salford Systems ©2013 7
Ten Data Sources Organized
• US Census (1970)• FBI (1970)• MIT Boston Project• Metropolitan Area Planning Commission (1972)• Voigt, Ivers, and Associates (1965) (Land Use Survey)• US Census Tract Maps• Massachusetts Dept Of Education (1971-1972)• Massachusetts Tax Payer’s Foundation (1970)• Transportation and Air Shed Simulation Model, Ingram, et. al. Harvard
University Dept of City and Regional Planning (1974)• A. Schnare: An Empirical Analysis of the dimensions of neighborhood
quality. Ph.D. Thesis. Harvard. (1974)
• An excellent example of creative data blending• Also excellent example of careful model construction• Authors emphasize the quality (completeness of their data)
Salford Systems ©2013 8
Least Squares Regression• LS – ordinary least squares regression
o Discovered by Legendre (1805) and Gauss (1809) o Solve problems in astronomy using pen and papero Statistical foundation by Fisher in 1920so 1950s – use of electro-mechanical calculators
• The model is always of the form
• The response surface is a hyper-plane!• A – the intercept term• B1, B2, B3, … – parameter estimates
• A usually unique combination of values exists which minimizes the mean squared error of predictions on the learn sample
• Experimental approach to model building
Response = A + B1X1 + B2X2 + B3X3 + …
Salford Systems ©2013 9
Transformations In Original Paper(For Historical Reference)
• RM number of rooms in house: RM2
• NOX raised to power p, experiments on value: NOXp
• DIS, RAD, LSTAT entered as logarithms of predictor• Regression in paper is run on ln(MV)
• Considerable experimentation undertaken• No train/test methodology• Classical Regression agrees very closely with paper on
reported coefficients and R2=.81 (same w/o logging MV)• Converting predictions back from logs yields
MSE=15.77• Note that this is learn sample only no testing performed
Salford Systems ©2013 10
Classical Regression Results
• 20% random test partition
• Out of the box regression
• No attempt to perfect • Test MSE=27.069
Salford Systems ©2013 11
BATTERY PARTITION: Rerun 80/20 Learn test 100
times
Note partition sizes are constantAll three partitions change each cycleMean MSE=23.80
Salford Systems ©2013 12
Least Squares Regression on Raw Boston Data
• 414 records in the learn sample
• 92 records in the test sample
• Good agreement L/T:o LEARN MSE = 27.455o TEST MSE = 26.147
• Used MARS in forward stepwise LS mode to generate this model
3-variable Solution
-0.597 +5.247
-0.858
Salford Systems ©2013 13
Motivation for Regularized Regression1960s and 1970s
• Unsatisfactory results based modeling physical processes o Coefficients changed dramatically with small changes in datao Some coefficients judged to be too largeo Appearance of coefficients with “wrong sign”o Severe with substantial correlations among predictors
(multicollinearity)
• Solution (1970) Hoerl and Kennard, “Ridge Regression”
• Earlier version just for stabilization of coefficients 1962o Initially poorly received by statistics profession
Salford Systems ©2013 14
Regression Formulas• X matrix of potential predictors (NxK)• Y column: the target or dependent variable (Nx1)
• Estimated b = (X’X)-1(X’y) standard formula• Ridge = b (X’X + rI)-1(X’y)• Simplest version: constant added to diagonal
elements of the X’X matrix• r=0 yields usual LS
• r=∞ yields degenerate model =0b• Need to find r that yields best generalization error• Observe that there is a potentially distinct
“solution” for every value of the penalty term r• Varying r traces a path of solutions
Salford Systems ©2013 15
Ridge Regression
• “Shrinkage” of regression coefficients towards zero• If zero correlation among all predictors then shrinkage
will be uniform over all coefficients (same percentage)• If predictors correlated then while the length of the
coefficient vector decreases some coefficients might increase (in absoluter value)
• Coefficients intentionally biased but yields both more satisfactory estimates and superior generalizationo Better performance (test MSE) on previously unseen data
• Coefficients much less variable even if biased• Coefficients will be typically be closer to the “truth”
Salford Systems ©2013 16
Ridge Regression Features
• Ridge frequently fixes the wrong sign problem• Suppose you have K predictors which happen to
be exact copies of each other• RIDGE will give each a coefficient equal to 1/K
times the coefficient that would be given to just one copy in a model
Salford Systems ©2013 17
Ridge Regression vs OLS
Ridge Regression
Classical Regression
Ridge: Worse on training data but much better on test dataWithout test data must use Cross-Validation to determine how much to shrink
RIDGE TEST MSE=21.36
Salford Systems ©2013 18
Lasso Regularized Regression
• Tibshirani (1996) an alternative to RIDGE regression
• Least Absolute Shrinkage and Selection Operator
• Desire to gain the stability and lower variance of ridge regression while also performing variable selection
• Especially in the context of many possible predictors looking for a simple, stable, low predictive variance model
• Historical note: Lasso inspired by related work (1993) by Leo Breiman (of CART and RandomForests fame) ‘non-negative garotte’.
• Breiman’s simulation studies showed the potential for improved prediction via selection and shrinkage
Salford Systems ©2013 19
Regularized Regression -
Concepts
• Any regularized regression approach tries to balance model performance and model complexity
• λ – regularization parameter, to be estimatedo λ = ∞ Null model zero-coefficients (maximum possible penalty)o λ = 0 LS solution (no penalty)
Mean Squared Error Model Complexity
LS Regression
Minimize
Minimize
Regularized Regression
Ridge: Sum of squared
coefficients
Lasso: Sum of absolute
coefficients
Compact: Number of coefficients
λ
Salford Systems ©2013 20
Regularized Regression: Penalized Loss
Functions
• RIDGE penalty Sb2 squared b
• LASSO penalty |S b| absolute value b
• COMPACT penalty |S b|0 count of bs
• General penalty |S b| r 0<= <=2r
• RIDGE does no selection but Lasso and Compact select
• Power on b is called the “elasticity” ( 0, 1, 2)• Penalty to be estimated is a constant multiplying one of
the above functions of the b vector• Intermediate elasticities can be created: e.g. we
could have a 50/50 mix of RIDGE and LASSO yielding an elasticity of 1.5
Salford Systems ©2013 21
LASSO Features• With highly correlated predictors the LASSO will tend
to pick just one of them for model inclusion
• Dispersion of b greater than for RIDGE
• Unlike AIC and BIC model selection methods that penalize after the model is built these penalties
influence the bs
• A convenient trick for estimating models with regularization |S b| r is weighted average of any two of the major elasticities 0, 1, and 2. e.g.:
• wSb2 + (1-w) |S b| (the “elastic net”)
Salford Systems ©2013 22
Computational Challenge
• For a given regularization (e.g LASSO) find the optimal penalty on the b term
• Find the best regularization from the family |S b|r
• Potentially very many models to fit
Salford Systems ©2013 23
Computing Regularized Regressions -
1
• Earliest versions of regularized regressions required considerable computation as the penalty parameter is unknown and must be estimated
• Lasso was originally computed by starting with no penalty and gradually increasing the penaltyo So start with ALL vars in the modelo Gradually tighten the noose to squeeze predictors outo Infeasible for problems with thousands of possible predictors
• Need to solve a quadratic programming problem to optimize the Lasso solution for every penalty value
Salford Systems ©2013 24
Computing Regularized Regressions -
2
• Work by Friedman and others introduced very fast forward stepping approaches
• Start with maximum penalty (no predictors)• Progress forward with stopping rule
o Dealing with millions of predictors possible
• Coordinate gradient descent methods (next slides)• Will still want test sample or cross-validation for
optimization • Generalized PathSeeeker full range of regularization
from compact to Ridge (elasticies from 0 thru 2)• Glmnet in R partial range of regularization from Lasso
to Ridge (elasticities from 1 to 2)
Salford Systems ©2013 25
GPS Algorithm• Start with NO predictors in model• Seek the path b( ) n of solutions as function of
penalty strength
• Define pj( )=n dP/dbj marginal change in Penalty
• Define gj( ) = - n dR/dbj marginal change in Loss
• Define qj( )=n gj( )/n pj( ) n ratio (benefit/cost)
• Find max|qj( )| n to identify coefficient to update (j*)
• Update bj* in the direction of sign qj*
• - dR/dbj requires computing inner products of
current residual with available predictorso Easily parallelizable
Salford Systems ©2013 26
How to Forward Step• At any stage of model development choose between
• Add a new variable to Update an existing model variable coefficient
• Step sizes are small, initial coefficients for any model are very small and are updated in very small increments
• This explains why the Ridge elasticity can have solutions with less than all the variableso Technically ridge does not select variables, it only shrinkso In practice it can only add one variable per step
Salford Systems ©2013 27
Regularized Regression – Practical
Algorithm
• Start with the zero-coefficient solution• Look for best first step which moves one coefficient away from zero
o Reduces Learn Sample MSEo Increases Penalty as the model has become more complex
• Next step: Update one of the coefficients by a small amounto If the selected coefficient was zero, a new variable effectively enters into the modelo If the selected coefficient was not zero, the model is simply updated
CurrentModel
X1 0.0
X2 0.0
X3 0.2
X4 0.0
X5 0.4
X6 0.5
X7 0.0
X8 0.0
X1 0.0
X2 0.0
X3 0.2
X4 0.1
X5 0.4
X6 0.5
X7 0.0
X8 0.0
Introducing New Variable
NextModel
CurrentModel
X1 0.0
X2 0.0
X3 0.2
X4 0.0
X5 0.4
X6 0.5
X7 0.0
X8 0.0
X1 0.0
X2 0.0
X3 0.3
X4 0.1
X5 0.4
X6 0.5
X7 0.0
X8 0.0
Updating Existing Model
NextModel
Salford Systems ©2013 28
Path Building Process
• Elasticity Parameter – controls the variable selection strategy along the path (using the LEARN sample only), it can be between 0 and 2, inclusiveo Elasticity = 2 – fast approximation of Ridge Regression, introduces
variables as quickly as possible and then jointly varies the magnitude of coefficients – lowest degree of compression
o Elasticity = 1 – fast approximation of Lasso Regression, introduces variables sparingly letting the current active variables develop their coefficients – good degree of compression versus accuracy
o Elasticity = 0 – fast approximation of Best Subset Regression, introduces new variables only after the current active variables were fully developed – excellent degree of compression but may loose accuracy
ZeroCoefficient
Model
A Variableis Added
Sequence of
1-variablemodels A Variable
is Added
Sequence of
2-variablemodels
A Variableis Added
Sequence of
3-variablemodels
FinalOLS
Solution
Variable Selection Strategy
λ = ∞ λ = 0…
Salford Systems ©2013 29
Points Versus Steps
• Each path(elasticity) will have different number of steps• To facilitate model comparison among different paths,
the Point Selection Strategy extracts a fixed collection of models into the points grido This eliminates some of the original irregularity among individual paths and
facilitates model extraction and comparison
Path 2: Steps OLSSolution
Points
Path 1Path 2Path 3
ZeroSolution
Path 1: Steps
Path 3: Steps
Point Selection Strategy
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Salford Systems ©2013 30
LS versus GPS
• GPS (Generalized Path Seeker) introduced by Jerome Friedman in 2008 (Fast Sparse Regression and Classification)
• Dramatically expands the pool of potential linear models by including different sets of variables in addition to varying the magnitude of coefficients
• The optimal model of any desirable size can then be selected based on its performance on the TEST sample
Learn Sample
OLS Regression
X1, X2 , X3, X4, X5, X6,…
Test Sample
X1, X2 , X3, X4, X5, X6,…
A Sequence of Linear Models1-variable model2-variable model3-variable model…Optim
ize Ignore
GPS Regression
Large Collection of Linear Models (Paths)1-variable models, varying coefficients2-variable models, varying coefficients3-variable models, varying coefficients…
Learn From Select Optim
al
31
Paths Produced by SPM GPS
• Example of 21 paths with different variable selection strategies
Salford Systems ©2013
Salford Systems ©2013 32
Path Points on Boston Data
• Each path uses a different variable selection strategy and separate coefficient updates
Point 30 Point 100 Point 150 Point 190
Path Development
Salford Systems ©2013 33
GPS on Boston Data
3-variable Solution
• 414 records in the learn sample
• 92 records in the test sample• 15% performance
improvement on the test sampleo GPS TEST MSE = 22.669o LS MSE= 26.147
+5.247
-0.858
-0.597
LS26.147
Salford Systems ©2013 34
Sentinel Solutions Detail
• Along the path followed by GPS for every elasticity we identify the solution (coefficient vector) best for each performance measure
• No attention is paid to model size here so you might still prefer to select a model from the graphical display
Salford Systems ©2013 35
Regularized Logistic RegressionAll the same GPS ideas apply
Specify Logistic Binary Analysis
Specify optimality criterion
Salford Systems ©2013 36
How To Select a Best Model• Regularized regression was originally invented to
help modelers obtain more intuitively acceptable models
• Can think of the process as a search engine generating predictive models
• User can decide based on o Complexity of modelo Acceptability of coefficients magnitude, signs, predictors included)
• Clearly can be set to automatic mode• Criterion could well be performance on test data
Salford Systems ©2013 37
Key Problems with GPS• Still a linear regression!• Response surface is still a global hyper-plane• Incapable of discovering local structure in the
data
• Develop non-linear algorithms that build response surface locally based on the data itselfo By trying all possible data cuts as local boundarieso By fitting first-order adaptive splines locallyo By exploiting regression trees and their ensembles
Salford Systems ©2013 38
From Linear to Non-linear
• Classical regression and regularized regression build globally linear models
• Further accuracy can be achieved by building locally linear models connected to each other at boundary points called knots
• Function is known as a spline• Each separate region of data represented by a “basis
function” (BF)
-100
102030405060
0 10 20 30 40LSTAT
MV
0
10
20
30
40
50
60
0 10 20 30 40
LSTAT
MV
Localize
Knots
Salford Systems ©2013 39
Finding Knots Automatically
• Stage-wise knot placement process on a flat-top function
True Knots Knot 1 Knot 2 Knot 3
Knot 4 Knot 5 Knot 6Data
True Function
Salford Systems ©2013 40
MARS Algorithm• Multivariate Adaptive Regression Splines• Introduced by Jerome Friedman in 1991
o (Annals of Statistics 19 (1): 1-67) (earlier discussion papers from 1988)
• Forward stage:o Add pairs of BFs (direct and mirror pair of basis functions represents a
single knot) in a step-wise regression mannero The process stops once a user specified upper limit is reached
• Backward stage:o Remove BFs one at a time in a step-wise regression mannero This creates a sequence of candidate models of declining complexity
• Selection stage:o Select optimal model based on the TEST performance (modern approach)o Select optimal model based on GCV criterion (legacy approach)
Salford Systems ©2013 42
Non-linear Response Surface
• MARS automatically determined transition points between various local regions
• This model provides major insights into the nature of the relationship• Observe in this model NOX appears linearly
Salford Systems ©2013 43
200 Replications Learn/Test
Partition
• Models were repeated with 200 randomly selected 20% test partitions
• GPS shows marginal performance improvement but much smaller model
• MARS shows dramatic performance improvement
Regression
GPS
MARS
Distribution of TEST MSE across runs
Salford Systems ©2013 44
Combining MARS and GPS
• Use MARS as a search engine to break predictors into ranges reflecting differences in relationship between target and predictors
• MARS also handles missing values with missing value indicators and interactions for conditional use of a predictor (only when not missing)
• Allow the MARS model to be large• GPS can then select basis functions and shrink
coefficients• We will see that this combination of the best of
both worlds will also apply to ensembles of decision trees
Salford Systems © Copyright 2005-2013
45
Running Score: Test Sample MSEMethod 20%
randomParametric Bootstrap
Battery Partition
Regression 27.069 27.97 23.80
MARS Regression Splines
14.663 15.91 14.12
GPS Lasso/ Regularized
21.361 21.11 23.15
Salford Systems © Copyright 2005-2013
46
Regression TreeOut of the box results, no tuning of
controls
9 regions (terminal nodes)
Test MSE= 17.296
Regression Tree Representation of a Surface
High Dimensional Step function
Should be at a disadvantage relative to other tools. Can never be smooth.But always worth checking
Regression Tree Partial Dependency
Plot
LSTAT NOX
Use model to simulate impact of a change in predictorHere we simulate separately for every training data record and then averageFor CART trees is essentially a step functionMay only get one “knot” in graph if variable appears only once in tree
See appendix to learn how to get these plots
Salford Systems © Copyright 2005-2013
49
Running ScoreMethod 20% random Parametric
BootstrapRepeated 100 20% Partitions
Regression 27.069 27.97 23.80
MARS 14.663 15.91 14.12
GPS Lasso 21.361 21.11 23.15
CART 17.296 17.26 20.66
Salford Systems © Copyright 2005-2013
50
Bagger Mechanism• Generate a reasonable number of bootstrap samples
o Breiman started with numbers like 50, 100, 200
• Grow a standard CART tree on each sample• Use the unpruned tree to make predictions
o Pruned trees yield inferior predictive accuracy for the ensemble
• Simple voting for classificationo Majority rule voting for binary classificationo Plurality rule voting for multi-class classificationo Average predicted target for regression models
• Will result in a much smoother range of predictionso Single tree gives same prediction for all records in a terminal nodeo In bagger records will have different patterns of terminal node results
• Each record likely to have a unique score from ensemble
Salford Systems © Copyright 2005-2013
51
Bagger Partial Dependency Plot
LSTAT NOX
Averaging over many trees allows for a more complex dependencyOpportunity for many splits of a variable (100 large trees)Jaggedness may reflect existence of interactions
Salford Systems © Copyright 2005-2013
52
Running ScoreMethod 20% random Parametric
BootstrapBattery Partition
Regression 27.069 27.97 23.80
MARS 14.663 15.91 14.12
GPS Lasso 21.361 21.11 23.15
CART 17.296 17.26 20.66
Bagged CART 9.545 12,79
Salford Systems © Copyright 2005-2013
53
RandomForests: Bagger on
Steroids
• Leo Breiman was frustrated by the fact that the bagger did not perform better. Convinced there was a better way
• Observed that trees generated bagging across different bootstrap samples were surprisingly similar
• How to make them more different?• Bagger induces randomness in how the rows of the data are
used for model construction• Why not also introduce randomness in how the columns are
used for model construction• Pick a random subset of predictors as candidate predictors – a
new random subset for every node
• Breiman was inspired by earlier research that experimented with variations on these ideas
• Breiman perfected the bagger to make RandomForests
Salford Systems © Copyright 2005-2013
54
Running ScoreMethod 20% random Parametric
BootstrapBattery Partition
Regression 27.069 27.97 23.80
MARS 14.663 15.91 14.12
GPS Lasso 21.361 21.11 23.15
CART 17.296 17.26 20.66
Bagged CART 9.545 12,79
RF Defaults 8.286 12.84
Salford Systems © Copyright 2005-2013
55
Stochastic Gradient Boosting (TreeNet )
• SGB is a revolutionary data mining methodology first introduced by Jerome H. Friedman in 1999
• Seminal paper defining SGB released in 2001o Google scholar reports more than 1600 references to this paper and a further
3300 references to a companion paper
• Extended further by Friedman in major papers in 2004 and 2008 (Model compression and rule extraction)
• Ongoing development and refinement by Salford Systemso Latest version released 2013 as part of SPM 7.0
• TreeNet/Gradient boosting has emerged as one of the most used learning machines and has been successfully applied across many industries
• Friedman’s proprietary code in TreeNet
Trees incrementally revise predictions
First tree grown on original target. Intentionally “weak” model
2nd tree grown on residuals from first. Predictions made to improve first tree
3rd tree grown on residuals from model consisting of first two trees
+ +
Tree 1 Tree 2 Tree 3
Every tree produces at least one positive and at least one negative node. Red reflects a relatively large positive and deep blue reflects a relatively negative node. Total “score” for a given record is obtained by finding relevant terminal node in every tree in model and summing across all trees
Salford Systems © Copyright 2005-2013
56
Salford Systems © Copyright 2005-2013
57
Gradient Boosting Methodology: Key
points
• Trees are usually kept small (2-6 nodes common)o However, should experiment with larger trees (12, 20, 30
nodes)
o Sometimes larger trees are surprisingly good
• Updates are small (downweighted). Update factors can be as small as .01, .001, .0001. o Do not accept the full learning of a tree (small step size, also GPS
style)
o Larger trees should be coupled with slower learn rates
• Use random subsets of the training data in each cycle. Never train on all the training data in any one cycleo Typical is to use a random half of the learn data to grow each tree
Salford Systems © Copyright 2005-2013
58
Running ScoreMethod 20% random Parametric
BootstrapBattery Partition
Regression 27.069 27.97 23.80
MARS 14.663 15.91 14.12
GPS Lasso 21.361 21.11 23.15
CART 17.296 17.26 20.66
Bagged CART 9.545 12,79
RF Defaults 8.286 12.84
RF PREDS=6 8.002 12.05
TreeNet Defaults
7.417 8.67 11.02
Using cross-validation on learn partition to determine optimal number of trees and then scoring the test partition with that model: TreeNet MSE=8.523
Salford Systems © Copyright 2005-2013
59
Vary HUBER Threshold: Best
MSE=6.71
Vary threshold where we switch from squared errors to absolute errorsOptimum when the 5% largest errors are not squared in loss computation
Yields best MSE on test data. Sometimes LAD yields best test sample MSE.
Salford Systems © Copyright 2005-2013
61
Running ScoreMethod 20% random Parametric
BootstrapBattery Partition
Regression 27.069 27.97 23.80
MARS 14.663 15.91 14.12
GPS Lasso 21.361 21.11 23.15
CART 17.296 17.26 20.66
Bagged CART 9.545 12,79
RF Defaults 8.286 12.84
RF PREDS=6 8.002 12.05
TreeNet Defaults
7.417 8.67 11.02
TreeNet Huber 6.682 7.86 11.46
TN Additive 9.897 10.48If we had used cross-validation to determine the optimal number of trees and then used those to score test partition the TreeNet Default model MSE=8.523
Salford Systems ©2013 62
References MARS
• Friedman, J. H. (1991a). Multivariate adaptive regression splines (with discussion). Annals of Statistics, 19, 1-141 (March).
• Friedman, J. H. (1991b). Estimating functions of mixed ordinal and categorical variables using adaptive splines. Department of Statistics,Stanford University, Tech. Report LCS108.
• De Veaux R.D., Psichogios D.C., and Ungar L.H. (1993), A Comparison of Two Nonparametric Estimation Schemes: Mars and Neutral Networks, Computers Chemical Engineering, Vol.17, No.8.
Salford Systems ©2013 63
References Regularized
Regression
• Arthur E. HOERL and Robert W. KENNARD. Ridge Regression: Biased Estimation for Nonorthogonal Problems TECHNOMETRICS, 1970, VOL. 12, 55-67
• Friedman, Jerome. H. Fast Sparse regression and Classification. http://www-stat.stanford.edu/~jhf/ftp/GPSpaper.pdf
• Friedman, J. H., and Popescu, B. E. (2003). Importance sampled learning ensembles. Stanford University, Department of Statistics. Technical Report. http://www-stat.stanford.edu/~jhf/ftp/isle.pdf
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc. B. 58, 267-288.
Salford Systems ©2013 64
References Regression via Trees
• Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, CRC Press.
• Breiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-140
• Breiman, L. (2001) Random Forests. Machine Learning. 45, pp 5-32.
• Friedman, J. H. Greedy function approximation: A gradient boosting machine http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf Ann. Statist. Volume 29, Number 5 (2001), 1189-1232.
• Friedman, J. H., and Popescu, B. E. (2003). Importance sampled learning ensembles. Stanford University, Department of Statistics. Technical Report. http://www-stat.stanford.edu/~jhf/ftp/isle.pdf
Salford Systems ©2013 65
What’s Next• Visit our website for the full 4-hour video series• https://www.salford-systems.com/videos/tutorials/the-
evolution-of-regression-modeling o 2 hours methodologyo 2 hours hands-on running of exampleso Also other tutorials on CART, TreeNet gradient boosting
• Download no-cost 60-day evaluationo Just let the Unlock Department know you participated in the on-demand
webinar series
• Contains many capabilities not present in open source renditionso Largely the source code of the inventor of today’s most important data
mining methods: Jerome H. Friedmano We started working with Friedman in 1990 when very few people were
interested in his work
© Salford Systems 2012
Salford Predictive Modeler SPM
• Download a current version from our website http://www.salford-systems.com
• Version will run without a license key for 10-days
• For more time request a license key [email protected]
• Request configuration to meet your needso Data handling capacityo Data mining engines made available