Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added regression syntax design doc #27

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
256 changes: 256 additions & 0 deletions designs/0026-stan-regression-syntax.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
- Feature Name: Stan Regression Syntax
- Start Date: 2020-09-21
- RFC PR:
- Stan Issue:

# Summary
[summary]: #summary

The goal is to add a regression syntax to Stan similar to that offered by the high level interfaces (rstanarm and brms, and historically lm/glmer/lme4, etc.).

# Motivation
[motivation]: #motivation

Adding a regression syntax directly to Stan should make it easy to:

1. Develop high level Stan interfaces that expose regression-like syntaxes without depending on external design matrix or sparse matrix libraries

2. Express likelihood evaluations in a way that can be automatically parallelized and sent to GPUs/other accelerators

3. Manipulate design matrices directly in Stan to do things like integrate out random variables

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

The basic multilevel regression syntax from rstanarm/brms/inla/lme4/etc. is written like:

```response ~ linear_model```

The response variable is the variable being modeled, the `~` indicates this expression is going to form a regression, and the `linear_model` is a representation of all the covariates, factors, and variables that go into predicting the response on the left.

Right off the bat this syntax is a little different than Stan code because there is no distribution. For instance, a normal likelihood term in Stan looks like:

```y ~ normal(mu, sigma)```

The regression model above leaves the distribution undefined, and usually the `linear_model` only defines a certain part of the model. In terms of Stan syntax this might look like:

```response ~ ?(linear_model, ?)```

where the `?` terms are undefined.

Different regression software packages make different assumptions about the choice of likelihood and link functions and how extra parameters of the model are defined. This is true too for priors, which are either inferred automatically or specified separately from the linear model.

Stan already provides a way to express priors, likelihoods, and link functions though, and so the key to a regression syntax in Stan is a mechanism to support the `linear_model` part of the regression.

## Linear Models

brms breaks down the linear part of the model into a few terms:

```linear_model = population_level_terms + group_level_terms + smooth_terms```

This proposal is going to ignore the `smooth_terms` part of the regression (not because they aren't interesting, just because it is a big enough project to work on the first two terms).

The responses in multilevel regression models are arranged into groups. That is why there are both population level terms and group level terms -- the population level terms account for the overall effects and the group level effects account for the group level variation.

A basic regression might be:
```
y ~ 1 + x + (1 | group) + (x | group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 1 in the random effects part what would the generated name be? Just like [group name]_intercept?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would depend on the package, but I'd expect something like that (and for it to be a vector with the number of elements of the group).

```

The `1 + x` term are the population level effects. The `1` indicates that there should be a population level intercept in the model and the `x` term indicates a population level slope term should be fit.

The `(1 | group) + (x | group)` terms are the group level effects. The `(1 | group)` term indicates there should be a group level intercept (a different intercept term for every group) and the `(x | group)` term indicates there should be group level slopes (a different slope term for every group).

The regression implicitly defines population level intercept and slope parameters, intercept and slope parameters for each group, and any hierarhical parameters to go along with these. The implicitly defined variables is something that a Stan syntax would not have; all parameters are explicitly defined in Stan.

Though the statistical dichotomy in the problem are the two types of terms, the population level effects and the group level effects, in terms of computation, evaluating a linear model like this splits into a dense matrix vector multiply and a sparse matrix vector multiply. The population intercept and continuous population effects can be handled efficiently in a dense matrix vector multiply, and the population level factors variables and all the group level terms can be handled efficiently in a sparse matric vector multiply. This is how the linear model is formulated in the [lme4 paper](https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf).

There are numerous other syntaxes in various regression packages in R. For instance, `(1 + x | group)` implies a correlated prior on the group level slopes and intercepts in lme4, but `(1 + x || group)` does not. `x - 1` means do not fit a population level intercept. For a syntax in Stan, where parameters must be explicitly defined, this extra syntax will not be necessary.

One thing not mentioned so far are the higher order groupings. Terms like `(x | group1:group2)` indicate that the slopes should vary according to the unique combination of `group1` and `group2` labels on each response (so a combined grouping of `group1` and `group2`). The aims of this proposal can be met with a syntax that supports the terms in the regression:

```
1 + x + (1 + x | group1) + (1 + x | group1:group2)
```

## Possible Implementation

The above syntax are already supported in Stan. `brms` and `rstanarm` already use Stan code directly. The problem is, while this syntax is fairly routine Stan code, it does not make it easy to meet the goals laid out in the beginning.

1. Interface developers are still dependent on design matrix libraries (that may differ across languages)

2. While perhaps efficient for a CPU, the code is unlikely to be efficient for a GPU or accelerator

3. The cleanest way to express the sparse matrix-vector multiply in Stan are for loops which do not make it easy to get design matrices out.

For clarity, the example model from above can be written in Stan syntax. Here is the regression model:
```
y ~ 1 + x + (1 + x | group1) + (1 + x | group1:group2)
```

Assume there are `N` responses, `G1` unique values of group 1, and `G2` values of group 2.

The data for the Stan model would look like:

```
int N; // number of responses
int G1; // number of unique members of group 1
int G2; // number of unique members of group 1
vector[N] y; // responses
vector[N] x; // covariates
int<lower=1, upper=G1> group1_idx[N]; // group 1 membership of the responses
int<lower=1, upper=G2> group2_idx[N]; // group 2 membership of the responses
```

The parameters would look like:

```
real intercept; // population level intercept
real slope; // population level slopes
vector[G1] group1_intercepts; // group 1 intercepts
vector[G1] group1_slopes; // group 1 slopes
matrix[G1, G2] group12_intercepts; // group1:group2 intercepts
matrix[G1, G2] group12_slopes; // group1:group2 slopes
```

The linear model can either be evaluated in a loop:

```
vector[N] linear_model;
for(n in 1:N) {
linear_model[n] = intercept + slope * x[n] +
group1_intercepts[group1_idx[n]] + group1_slopes[group1_idx[n]] * x[n] +
group12_intercepts[group1_idx[n], group2_idx[n]] + group12_slopes[group1_idx[n], group2_idx[n]] * x[n];
}
```

or it can be written in a vectorized form:

```
vector[N] linear_model = intercept + slope * x +
group1_intercepts[group1_idx] + group1_slopes[group1_idx] .* x +
diagonal(group12_intercepts[group1_idx, group2_idx]) +
diagonal(group12_slopes[group1_idx, group2_idx]) .* x;
}
```

The `diagonal(...)` terms are a very inefficient way to encode the combined groupings -- in all practical cases these would be written as single group terms by actually creating a new single group index to represent the combined `group1:group2` terms. The diagonal terms suffice for demonstration though.

There's really nothing to do more with the population intercept and slope. For everything else, a basic syntax that would be suitable for Stan would be:

```
covariate .* (parameters | group_idx)
```

where:

1. `group_idx` is the group index of the responses. It would be an integer array of length `N` in the range `[1, G]` assuming `G` groups.
2. `parameters` are the group level coefficients. It would be a length `G` vector-like variable.
3. `covariate` is an optional, length `N` vector-like set of covariates.

In the simplest case where `group_idx` is a 1d array of integers of length N, `parameters` is a vector of length `G`, and `covariate` is a length `N` vector of coefficients this would be equivalent to the Stan code here:

```
covariate .* parameters[group_idx]
```

This is not anything beyond what is already possible, but with a few variations on the syntax it can be.

First, this syntax can support multiple-group indexing. Instead of writing loops, or doing an inefficient `covariate .* diagonal(parameters[group1, group2])`, or refactoring the two groups into one, the multiple group membership could be handled by:

```
covariate .* (parameters | group1, group2)
```

where `parameters` is now a 2d variable. To get the design matrices, the `parameters` variable could be removed.

So if the following returns a length `N` vector:
```
covariate .* (parameters | group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always going to be a .*? If so I think this would be easier to work with in the compiler and C++ code if we directly put the covariates in the () like

(covariate, parameters | group)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems good. I am nervous the .* there was actually more specialized than a .* in the wild might be, if that makes sense. The Fabular thing that @enetsee linked above does something like:

(covariate {parameters} | group)

Not opposed to something else either.

```

Then this would return an `N` by `G` design matrix:
```
design_matrix dmat = covariate .* (1 | group)
```

And then this design matrix could be cast to a `sparse_matrix` (once these are in Stan):
```
sparse_matrix smat = mat;
```

With the sparse matrix, the matrix vector product could be evaluated directly to get the equivalent of `covariate .* (parameters | group)`:

```
smat * parameters
```

But the design matrix could also get a `operator()` so that this was equivalent to:

```
dmat(parameters)
```

It might seem awkward to do this with call syntax, but this might be clearer for multiple group design matrices. For instance:

```
design_matrix dmat2 = (1 | group1, group2);
```

In this case `coefficients_matrix` would need compressed to a vector to make the sparse matrix vector product make sense, whereas `dmat2(coefficients_matrix)` can mean the effective matrix vector product without worrying about flattening `coefficients_matrix`. An extra function `flatten` could be provided to do this reordering if necessary.

With this syntax, the regression from above could be rewritten:
```
vector[N] mu = intercept + slope * x +
(group1_intercepts | group1_idx) + x .* (group1_slopes | group1_idx) +
(group12_intercepts | group1_idx, group2_idx) +
x .* (group12_slopes | group1_idx, group2_idx);
Comment on lines +204 to +207
Copy link
Collaborator

@SteveBronder SteveBronder Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this I'm not really sure exactly how we'd optimize this on the backend. I kind of like something similar to your original proposal for a hiearchy() function for something like

vector[N] mu = intercept + slope * x + 
  hierarchy(f, group1_idx, group2_idx, group12_intercepts, x, group12_slopes)

Where f is a function with a signature like vector f(real intercept, matrix x, vector slope) that would be applied to each group (in this case the group is the combo of groups 1 and 2).

I wonder if we could do anything useful if we have a group_index type in the stan language? Or could we give matrices an index? Maybe something like

group_index[N] group_idx1;
group_index[N] group_idx2;
// Anything after first two dims for a matrix are indices?
matrix[N, M, group_idx1, group_idx2] X;
// Or maybe a tag?
matrix[N, M] X index(group_idx1, group_idx2);

Though I think the issue with what I have here is that we would have to pay for a sort / unsort if the data was not presorted so we could iterate over the groups nicely

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure exactly how we'd optimize this on the backend

I imagined this would be something recognizable by the compiler. Like, "oh, I have a bunch of design matrices, better lump them together and turn them into a big hierarchy statement".

But also I think @t4c1 likes expressions and I could see that working too, so it kinda all comes together at runtime. As long as it gets into a big hierarchy-like statement it's good.

group_index type in the stan language
we would have to pay for a sort / unsort if the data

Is the goal here to express a sparseness or to express the ordering of the matrix to make accesses efficient in some way?

```

All of the individual design matrix terms can be combined and evaluated at once. It should be possible to take advantage of group indices being reused in multiple terms (to avoid reading them twice).

To get the overall sparse design matrix for this expression, it would be necessary to have an equivalent of `bind_cols` for the design matrices.

```
design_matrix dmat = bind_cols((1 | group1_idx), x .* (1 | group1_idx),
(1 | group1_idx, group2_idx), x .* (1 | group1_idx, group2_idx));
vector[N] mu = dmat(group1_intercepts, group1_slopes, group12_intercepts, group12_slopes);
```

The equivalent sparse matrix and vector could be constructed with some extra code:
```
sparse_matrix Z = dmat;
vector[N] mu = Z * flatten(group1_intercepts, group1_slopes, group12_intercepts, group12_slopes);
```

And that's that. This is a regresion syntax for Stan that does not depend on outside design matrix libraries for Stan, can be written in a way that makes it (hopefully) possible to compile large expressions that can be passed off to accelerators, and allows design matrices, if necessary, to be extracted.

## Appendix: Multiple part regressions

It is natural to write some regressions in multiple parts. For instance, election responses might be modeled per state, with each state being in a certain region of the country. This looks like a smaller regression model feeding into a larger one:

```
vector[G] state_effect = (region_effect | region);
vector[N] mu = (state_effect | state);
```

This could be written as nested design matrix expressions:
```
vector[N] mu = ((region_effect | region) | state)
```

Or this could be broken out to make the design matrices more explicit:
```
vector[N] mu = (1 | state)((region_effect | region))
```

Taking this one step further:
```
vector[N] mu = (1 | state)((1 | region)(region_effect))
```

And then finally there is a combined design matrix:
```
design_matrix dmat = (1 | state)((1 | region));
vector[N] mu = dmat(region_effect);
```