Skip to content

Adding a new model to rstanarm

Jonah Gabry edited this page Apr 8, 2018 · 9 revisions

Update: please see the developer notes at mc-stan.org/rstanarm/dev-notes/ for more recent instructions than those on this page.

Introduction

Adding a new model to the rstanarm package is not difficult, but it is tedious to make all the necessary changes to all the required files that are spread throughout the rstanarm package. The following attempts to enumerate all these details, using an instrumental variables model as an example.

Steps

Step 0: Create a branch

Create a git branch for your work. If you use the command line version of git, it would be something like

git checkout master
git checkout -b feature/ivreg

where the name of the branch depends on the R function you are trying to emulate in Step 1.

Step 1: Identify the R function to emulate

The first step is to locate an existing function in R or an R package that you want to emulate in rstanarm. In fact, rstanarm abides by the principle that the stan_foo function should specify the same likelihood of the data as the foo function. Moreover, the core arguments of the foo function that are necessary to specify the likelihood of the data should be arguments of the stan_foo function with the same names, defaults, and ordering. In this example, we want to emulate the ivreg function in the AER package, whose arguments are

> args(AER::ivreg)
function (formula, instruments, data, subset, na.action, weights, 
    offset, contrasts = NULL, model = TRUE, y = TRUE, x = FALSE, 
    ...) 

Consequently, when I create a stan_ivreg.R function in the R/ subdirectory of the rstanarm package, I start with

stan_ivreg <- function (formula, instruments, data, subset, na.action, weights, 
                        offset, contrasts = NULL, model = TRUE, y = TRUE, x = FALSE, 
                        ...) # there will be more arguments

In the rstanarm package, we position the ... to divide the arguments that we adopt from the emulated function from the arguments we add, which often pertain to the priors and cannot be abbreviated when called. In addition, the ... serves an important purpose because any additional arguments are passed on to the underlying fitting function in the rstan package, which is usually sampling. For example, the user could request 8 chains by calling stan_ivreg with chains = 8 and chains will be passed onto sampling.

Step 2: Parse the data in exactly the same way as the emulated function

In order to guarantee that the likelihood of the data is the same in stan_ivreg as in AER::ivreg, it is necessary for stan_ivreg to parse the data exactly like AER::ivreg does. The easiest way to do so is often to call a minimal version of the emulated function, and this approach is all-but-necessary when the emulated function calls unexported functions to parse the data. Consequently, the first few lines of stan_ivreg employ the usual R trick of creating a call, changing the symbol name of the function to be evaluated, and NULLifying unnecessary arguments.

  mc <- match.call(expand.dots = FALSE)
  mc$model <- mc$y <- mc$x <- TRUE
  # NULLify any Stan specific arguments in mc now
  mc[[1L]] <- quote(AER::ivreg)
  if (!requireNamespace("AER")) stop("the AER package is needed by 'stan_ivreg'")
  mf <- eval(mc, parent.frame())

At this point the mf object is the list documented in the Value section of help(ivreg, package = "AER"), so, I can use the x, y, and other elements of the mf list inside stan_ivreg.

In this case, the Stan program will need the data in a somewhat different format, so further processing will be necessary. However, it is good for x and y to be available in the original format for post-estimation functions.

In the case of AER::ivreg, there is a closed-form solution involving matrix algebra only, so it is computationally cheap to call AER::ivreg. In some cases, the emulated function will use numerical optimization to find the optimum, in which case it is best to call the emulated function with zero iterations if possible, which was done in the stan_glmer function. If so, there is probably some internal function that defines the log-likelihood of the data (perhaps multiplied by -2 or something) that you can refer to when implementing your Stan program in Step 4. However, if you call a function in another package then you need to add that package --- in this case AER --- to the Suggests section of the DESCRIPTION file of rstanarm.

Step 3: Hand-off to a .fit workhorse function

Many model-fitting functions in the stats package that come with R have a high-level version of the function that takes a formula, a data.frame, etc. and a low-level version of the function suffixed with .fit that is more intended for programmers and inputs an outcome vector, a design matrix, etc. For example, there is glm and glm.fit. The same convention is adopted in the rstanarm package, even if the function being emulated does not have an exported .fit counterpart. Thus, I create a stan_ivreg.fit function in a stan_ivreg.fit.R file in the R/ directory of the rstanarm package whose initial arguments are the same as those in AER::ivreg:

stan_ivreg.fit <- function (x, y, z, weights, offset, ...) { # there will be more arguments
  stopifnot(is.matrix(X), is.vector(y), is.matrix(z), length(y) == nrow(X), nrow(X) == nrow(Z))
}

The stan_ivreg.fit function will ultimately call the appropriate fitting function in the rstan package and return a stanfit object to the calling function, which is typically stan_ivreg. If there were no ivreg.fit function in the AER package, we have more freedom to specify the required form of the arguments to the stan_ivreg.fit function.

Step 4: Write (or edit) the Stan program

At this point, we are ready to write a Stan program for this model, which will define what objects the stan_ivreg.fit function will need to pass to the data block. While it is preferable to add to an existing .stan program if possible, in this case a new ivreg.stan file is needed because there is a matrix of endogenous variables (whereas the previously written .stan programs only take a vector or a one-dimensional array for the outcome). Some general rules for writing Stan programs that go into the rstanarm package are

  1. Put the Stan program into the exec/ subdirectory of the rstanarm package and use a .stan extension. In this case, it is ivreg.stan.
  2. As much as possible, utilize custom functions that are defined in the functions block of the Stan function because custom functions can (and must) be tested in R using the expose_stan_functions function in the rstan package. See Step 12 for details on that.
  3. As much as possible, utilize include directives in the Stan program that point to re-usable chunks of Stan syntax, such as #include 'make_eta.stan' which forms the linear predictor in a generalized linear model.
  4. If you need to make a new re-usable chunk of Stan syntax, put it under inst/chunks and use a .stan extension.
  5. Try to make the Stan program as general as possible, sacrificing readability if necessary.
  6. Calculate the sample average of the posterior predictive distribution in generated quantities.

Step 5: Create a stanreg object

The stan_ivreg.fit function will return an object of S4 class stanfit which is defined in the rstan package. The stan_ivreg function needs to take the stanfit object and use it to create a stanreg object which is an S3 class. This should be similar to the stanreg object produced by stan_glm and / or similar to the object produced by the emulated function. In particular, its minimal list elements for are documented in help("stanreg-objects", package = "rstanarm").

Step 6: Add prior options to the R functions

Go back and add arguments to the main R functions after the ... that input prior information about the parameters in the .stan program you wrote in Step 4. The ideal here is to have one hyperparameter that either takes a reasonable default value or can easily be required of the user, which conveys some weakly informative prior information. Beyond that you can add more prior options for advanced users, which may require some iteration between Step 4 and Step 6. Some priors are already implemented and documented in R/priors.R.

In the case of stan_ivreg and stan_ivreg.fit ...

Step 7: Modify the print method for a stanreg object

There is a print method for a stanreg object that uses branch logic to customize the output depending on what model-fitting function estimated it. In general, the output of print.stanreg should resemble the output when print or summary is called on an object produced by the emulated function but there can be less or more output.

Step 8: Deal with predict and / or posterior_predict

The predict function can be used when the model is estimated by maximum (penalized) likelihood and utilizes the point estimates of the coefficients. The posterior_predict method is preferable and is used when the model is estimated by MCMC or ADVI and utilizes all the draws. There is now a posterior_linpred function that is exposed but mostly used internally to produce the posterior (not predictive) distribution of the linear predictor, which can then be used to generate the predictive distribution. There are helper functions to be written, but it should mostly be clear from the existing R/posterior_predict.R file.

Step 9: Deal with log_lik and loo

The log_lik function returns the log-likelihood for each observation, rather than summing them up over the entire dataset. This allows the log-likelihood to be evaluated for newdata and with a set of draws from the posterior distribution. In turn, this allows for the Leave-One-Out (LOO) Information Criterion to be evaluated in a memory efficient way. Again, the helper functions in R/loo.R should provide enough guidance as to how to implement this for a new model.

Step 10: Reimplement appropriate post-estimation functions

The R package that contains the emulated function will often have post-estimation methods that users will be accustomed to calling. Thus, where appropriate, those same methods should be reimplemented in the rstanarm package. In the case of AER::ivreg, candidate methods can be found by executing

methods(class = "ivreg")
 [1] anova        bread        estfun       hatvalues    model.matrix predict      print       
 [8] summary      terms        vcov      

Of these, anova, bread, estfun, and hatvalues only make sense in a frequentist context where there is no prior information and can probably be ignored for the purposes of rstanarm. We have already covered the predict and print methods. There is already a general summary.stanreg method so a customized one is usually not necessary. That leaves model.matrix and terms, which can be implemented simply by calling the appropriate functions in the AER package, and vcov, which is calculated by the internal stanreg function.

Step 11: Write documentation

We use the roxygen2 package, so the documentation is embedded in the .R files. See the documentation for the roxygen2 package. If you need to use or create any snippets, they are in the man-roxygen/ subdirectory. This form of documentation allows you to embed examples, but remember that the examples for each function should execute in five seconds or less. This usually requires that the examples utilize ADVI rather than MCMC or (less preferably) have to be embedded in a \dontrun block.

Step 12: Write unit tests

The tests/testthat directory contains unit tests. The test_stan_functions.R file compiles all the user-defined Stan functions in rstanarm, so be sure not to duplicate the name of an existing user-defined Stan function, which can be prevented by placing any user-defined Stan functions that are used in multiple Stan programs into inst/chunks/common_functions.stan. Add to test_stan_functions.R to ensure that the user-defined Stan functions you wrote produce the same output as their R counterparts.

You should create another file in tests/testthat called test_foo.R that tests the new functions more holistically. We often try to use optimization and no prior information here in order to compare the results to an existing frequentist R function. However, the results are often more different than one would think. Although it is suboptimal for unit-testing, it is a good idea to combine several things that you expect to pass into a single call in order to conserve time when calling R CMD check.

Step 13: Write a vignette

There are many examples in the vignettes/ subdirectory. We use the html_vignette format for output. Here we emphasize MCMC almost exclusively. Thus, the unit tests (ideally) involve optimization, the examples (reluctantly) involve ADVI, and the vignettes involve MCMC. There are some child chunks in the vignettes directory that you can include in your vignette.

Step n: Write the rest of this wiki page