The sas script containing the SuperLearner macro actually contains 4 main macros: %SuperLearner, %_SuperLearner, %CVSuperLearner macro, and %_CVSuperLearner
FILENAME slgh URL "https://cirl-unc.github.io/SuperLearnerMacro/super_learner_macro.sas";
%INCLUDE slgh;
FILENAME slgh "C:/path/to/SuperLearnerMacro-XXXX/super_learner_macro.sas";
%INCLUDE slgh;
Some examples of using the %SuperLearner macro are available here
Stacking is based on what Wolpert refers to as a set of ‘level-0’ models and a ‘level-1’ model, indexed by parameters and in some study sample S. Where
Level-0:
Level-1:
The parameterization of the macro is based loosely on this notation. Each level-0 model is referred to as a ‘learner’ in the super learner library. A call to super learner is structured as follows:
%SuperLearner(
Y=,
X=,
library=,
indata=,
preddata=,
outdata=sl_out,
dist=GAUSSIAN,
method=NNLS,
id=,
by=,
intvars=,
binary_predictors=,
ordinal_predictors=,
nominal_predictors=,
continuous_predictors=,
weight=,
trtstrat=false,
folds=10
);
Macro parameters include the following:
Y: [value = variable name] the target variable, or outcome
X: [value = blank, or a space separated list of variable names] predictors of Y on the right side of the level-0 models. Note that this is a convenience function for the individual [coding]_predictors macro variables. The macro will make a guess at whether each predictor in X is continuous, categorical, or binary. (OPTIONAL but at least one of the X or [coding]_predictors - binary_predictors, ordinal_predictors, nominal_predictors, continuous_predictors - parameters must be specified, as described below). If X is specified and any one of the [coding]_predictors has a value, the macro will generate an error.
library: [value = a space separated list of learners] the names of the m level-0 models (e.g. glm lasso cart). A single learner can be used here if you only wish to know the cross-validated expected loss (e.g. mean-squared error). See all available default learners here and how to construct new learners here
indata: [value = an existing dataset name] the dataset used for analysis that contains Y and all predictors (and weight variables, if needed)
preddata: [OPTIONAL value = a dataset name] the validation dataset. A dataset which contains all predictors and possibly Y that is not used in model fitting but predictions for each learner and superlearner are made in these data
outdata: [value = a dataset name; default: sl_out] an output dataset that will contain all predictions as well as all variables and observations in the indata and preddata datasets
dist: [value = one of: GAUSSIAN,BERNOULLI; default GAUSSIAN] Super learner can be used to make predictions of a continuous (assumed gaussian in some learners) or a binary variable. Use GAUSSIAN for all continuous variables and BERNOULLI for all binary variables. Nominal/categorical variables currently not supported.
method:[value = one of: NNLS, CCNLS, OLS, NNLOGLIK, CCLOGLIK, LOGLIK, CCLAE, NNLAE, LAE, CCRIDGE, NNRIDGE, RIDGE, BNNLS, BCCNLS, BOLS, BNNLOGLIK, BCCLOGLIK, BLOGLIK, BCCLAE, BNNLAE, BLAE, BCCRIDGE, BNNRIDGE, BRIDGE, BCCLASSO, BNNLASSO, BLASSO; default NNLS] the method used to estimate the coefficients of the level-1 model. Methods are indexed by [prefix][suffix], where the prefix sets constraints and the suffix sets the mean model form
id: [OPTIONAL value = variable name] a variable that uniquely identifies clusters or individuals within a dataset where there are possibly multiple records per cluster/individual (e.g. discrete hazard estimation).
by: [OPTIONAL value = variable name] a by variable in the usual SAS usage. Separate super learner fits will be specified for each level of the by variable (only one allowed, unlike typical “by” variables.
intvars:[OPTIONAL value = variable name] an intervention variable that is included in the list of predictors. This is a convenience function that will make separate predictions for the intvars variable at 1 or 0 (with all other predictors remaining at their observed levels)
binary_predictors: [value = blank, or a space separated list of variable names] advanced specification of predictors: a space separated list of binary predictors (OPTIONAL but at least one of the X or [coding]_predictors parameters must be specified)
ordinal_predictors: [value = blank, or a space separated list of variable names]advanced specification of predictors: a space separated list of ordinal predictors (OPTIONAL but at least one of the X or [coding]_predictors parameters must be specified)
nominal_predictors: [value = blank, or a space separated list of variable names]advanced specification of predictors: a space separated list of nominal predictors (OPTIONAL but at least one of the X or [coding]_predictors parameters must be specified)
continuous_predictors: [value = blank, or a space separated list of variable names] advanced specification of predictors: a space separated list of continuous predictors (OPTIONAL but at least one of the X or [coding]_predictors parameters must be specified)
weight: [OPTIONAL value = a variable name] a variable containing weights representing the relative contribution of each observation to the fit (a.k.a. case weights). Not all learners will respect non-integer weights, so weights will either be ignored or truncated by some procedures.
trtstrat: [value = true, false; DEFAULT: false] convenience function. If this is set to true and intvars is specified, then all fits will be stratified by levels of intvars. Levels 0,1 only.
folds: [value = integer ; default: 10] number of cross-validation folds to use.
This is a version of the %SuperLearner macro for advanced users that may be somewhat faster due to reduced error checking, and offers finer level controls. If the %SuperLearner macro completes successfully, it will give some example code that can be run with %_SuperLearner. Of note, there is no checking or correction of parameter syntax, so the case-sensitive parameter arguments may cause an error in %_SuperLearner, but not %SuperLearner.
One main difference is that %_SuperLearner will make no guesses about variable types for X, so use of the [coding]_predictors parameters is required for correct specification. See the source code for documentation of additional options.
This macro is used to estimate the cross-validated expected loss of super learner itself. It does not produce predictions! This gives an idea about whether super learner is the appropriate learner to use in a given scenario, and allows some choice between parameters of the the super learner model, such as the method (e.g. NNLS vs. CCLS).
Options repeated from %SuperLearner (see definitions given above)
Y, X, binary_predictors, ordinal_predictors, nominal_predictors, continuous_predictors, weight, indata, dist, library, method
This is a version of the %CVSuperLearner macro for advanced users that may be somewhat faster due to reduced error checking, and offers finer level controls. See the source code for further tuning options.
See the Troubleshooting help
D. H. Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.
L. Breiman. Stacked regressions. Machine learning, 24(1):49–64, 1996.
M. J. van der Laan, E. C. Polley, and A. E. Hubbard. Super learner. Report, Division of Biostatistics, University of California, Berkeley, 2007.
E. C. Polley and M. J. van der Laan. Super learner in prediction. Report, Division of Biostatistics, University of California, Berkeley, 2010.
This work was only possible with valuable advice and beta testing from the following people: Stephen R Cole, Jessie K Edwards, Katie M O'Brien, Eric Polley, Marie Stoner, Jennifer Winston and many others