Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added regression syntax design doc #27

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bbbales2
Copy link
Member

@bbbales2 bbbales2 commented Sep 26, 2020

I've been thinking about some sort of regression syntax in Stan. It's kind of a follow up to Speeding up hierarchical models. When I say regression syntax I'm talking about the interfaces (brms/rstanarm/lme4/glmer/lm/...) that take in something like the R regression syntax (y ~ 1 + x + (1 | group) + ...) and do something with it.

It's probably something that has come up before, but there's just so much there it's hard to talk about it without writing a lot down. So I tried to write down the bare minimum justification and implementation of a regression syntax in Stan.

Stan is written to do regressions and more, so at face value this doesn't seem that important but I'm trying to hit three points:

  1. Make it possible to avoid design matrix libraries outside of Stan
  2. Package up big computation to send it off to a GPU/accelerator
  3. Build design matrices if you need them

Now that I got that written out I have some questions hoooray:

  1. @jgabry @paul-buerkner @bgoodri does avoiding external design matrices buy us anything from an interface-writing perspective? I thought it would when I wrote this, but I've never actually written a high level regression interface and so I have my doubts.

  2. @SteveBronder @t4c1 @rok-cesnovar is this syntax enough to pack things up into big, GPU-friendly calculations?

  3. @dpsimpson is it actually useful to build out design matrices in the way I've written here? I remember the sparse stuff you talked about we do linear algebra on them, but maybe this syntax isn't enough to build the matrices you need and you're left having to pass them in from the outside anyway.

Rendered markdown is here

@bbbales2 bbbales2 marked this pull request as draft September 26, 2020 23:08
@seantalts
Copy link
Member

Tagging @enetsee who is also interested in this topic!

@dpsimpson
Copy link

dpsimpson commented Sep 28, 2020 via email

@t4c1
Copy link
Contributor

t4c1 commented Sep 28, 2020

So this is just new syntax for something that is already possible? While I am more familiar with the existing syntax I guess that is fine if that syntax is closer to majority of Stan users.

I have to say I do not see what is the point of the design matrix stuff. To me it just looks like forcing sparse algebra, which will probably be slower that dense version. Are there any advantages in terms of usability?

Anyway, the dense should work fine with GPUs. With kernel generator it won't be hard to implement - we can just write something very similar to the vectorized form of the model. Although the CPU implementation will be hard to optimize, due to the version of Eigen we are using not supporting indexing with vectors. On the other hand we do not have any GPU implementations for sparse operations yet.

@dpsimpson
Copy link

dpsimpson commented Sep 28, 2020 via email

@enetsee
Copy link
Member

enetsee commented Sep 28, 2020

@seantalts - thanks for the nudge!

@bbbales2, as Sean mentioned, I'm very interested in this topic and is a use case I had in mind when we were looking at the structure of the compiler. Rather than adding this as new syntax to Stan the language, the idea was to have an alternative frontend just for regression models which elaborated to core Stan or the intermediate representation we use in the compiler (i.e. we have an additional step where we recover anything that was left implicit in the terser language of regression).

There is some really interesting work by Andy Gordon on a 'regression calculus' here (which is just a fancy way of describing a core language for regression along with its meaning) which I was keen on extending to add BRMS like functionality. I spoke with Andy Gordon about this briefly at StanCon in Cambridge but I haven't had too much time to follow up of late.

There was also a presentation at StanCon of another front end language, Chronikis which was targeted at time series models and again, elaborated to Stan.

In terms of motivation, I think we have to recognize that syntax is important: it facilitates the expression of recurring patterns and allows us to restrict expressiveness. With specialized front-end we can also introduce stricter semantics allowing us to pick up errors more easily. The very fact that BRMS has a large user base is evidence of the utility of this approach.

I would be very keen to get involved with this from a language design perspective.

@paul-buerkner
Copy link

I agree this is a very cool idea, and I am happy to provide my expertise with brms in making this possible. Essentially, I do see that a little bit like (a subset of) brms but directly in Stan with potentially much more opportunities to optimize the code on a lower level (brms does try to write optimized Stan code but of course has only access to the top-level of the Stan language itself). Additional advantages would be that we would get methods such as posterior_predict and log_lik (using our R names right now) for these models directly in Stan, which is otherwise a little ugly to do in the generated quantities block.

There are obvious challenges of course but also a few perhaps not as obvious ones, for example, how to deal with predicting new data (which is a quite relevant aspect in brms users' workflows) or how to handle predictions for new levels of the groupings factors. Ideally, we could even achieve some sort of marginalization not only for estimation but also for predictions (e.g., as discussed in paul-buerkner/brms#489). Some of these features may go to far for the Stan regression syntax but would definitely improve its usability a lot.

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple little comments. My main Qs are mostly around how we would use this syntax to optimize things on the backend. @t4c1 if you check out the lmer paper they go into details on why the sparse matrix representation can be more efficient for large numbers of random effects. Though I do wonder if there is a way we can keep this dense while being efficient

https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf


A basic regression might be:
```
y ~ 1 + x + (1 | group) + (x | group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 1 in the random effects part what would the generated name be? Just like [group name]_intercept?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would depend on the package, but I'd expect something like that (and for it to be a vector with the number of elements of the group).


So if the following returns a length `N` vector:
```
covariate .* (parameters | group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always going to be a .*? If so I think this would be easier to work with in the compiler and C++ code if we directly put the covariates in the () like

(covariate, parameters | group)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems good. I am nervous the .* there was actually more specialized than a .* in the wild might be, if that makes sense. The Fabular thing that @enetsee linked above does something like:

(covariate {parameters} | group)

Not opposed to something else either.

Comment on lines +204 to +207
vector[N] mu = intercept + slope * x +
(group1_intercepts | group1_idx) + x .* (group1_slopes | group1_idx) +
(group12_intercepts | group1_idx, group2_idx) +
x .* (group12_slopes | group1_idx, group2_idx);
Copy link
Collaborator

@SteveBronder SteveBronder Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this I'm not really sure exactly how we'd optimize this on the backend. I kind of like something similar to your original proposal for a hiearchy() function for something like

vector[N] mu = intercept + slope * x + 
  hierarchy(f, group1_idx, group2_idx, group12_intercepts, x, group12_slopes)

Where f is a function with a signature like vector f(real intercept, matrix x, vector slope) that would be applied to each group (in this case the group is the combo of groups 1 and 2).

I wonder if we could do anything useful if we have a group_index type in the stan language? Or could we give matrices an index? Maybe something like

group_index[N] group_idx1;
group_index[N] group_idx2;
// Anything after first two dims for a matrix are indices?
matrix[N, M, group_idx1, group_idx2] X;
// Or maybe a tag?
matrix[N, M] X index(group_idx1, group_idx2);

Though I think the issue with what I have here is that we would have to pay for a sort / unsort if the data was not presorted so we could iterate over the groups nicely

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure exactly how we'd optimize this on the backend

I imagined this would be something recognizable by the compiler. Like, "oh, I have a bunch of design matrices, better lump them together and turn them into a big hierarchy statement".

But also I think @t4c1 likes expressions and I could see that working too, so it kinda all comes together at runtime. As long as it gets into a big hierarchy-like statement it's good.

group_index type in the stan language
we would have to pay for a sort / unsort if the data

Is the goal here to express a sparseness or to express the ordering of the matrix to make accesses efficient in some way?

@andrewgelman
Copy link

andrewgelman commented Sep 30, 2020 via email

@SteveBronder
Copy link
Collaborator

@bob-carpenter

@bbbales2
Copy link
Member Author

bbbales2 commented Oct 4, 2020

And obviously we can add cool things like GPs (dense) and spatial models
(often sparse), and smoothing splines (often sparse), and time series
models (sparse). It would be ideal if we could keep as much of the brms
notation as possible for these complex models...

@dpsimpson Yeah I like the y ~ gp(x) sorta thing that brms does.

Separately I would like some sort of Stan library system with a package manager. I think that's the only practical way we're going to get all the fancy design matrices and stuff that would need built for splines, approx GPs, etc in Stan models. Pumping that stuff through Stan math would be too slow development-wise. I think if we had a more formal/centralized way for people to share their code, we'd see a lot of interesting code getting shared (and avoid everyone having to reinvent things, digging in forums for code snippets, etc.).

So this is just new syntax for something that is already possible?
Are there any advantages in terms of usability?

@t4c1 Yeah, having the design matrices and being able to evaluate/use them efficiently is a plus. Part of the goal is usability and part of the goal is to make it easier for the compiler/runtime to recognize that a bunch of calculations are related. A little of both.

There is some really interesting work by Andy Gordon on a 'regression calculus'
another front end language, Chronikis

@enetsee The Fabular thing was interesting. Yes this is very much what I want to see in Stan. I like how they talk about breaking the regressions out into different levels. I also like the variables being explicit, but I'm probably not considering some advantage of them being implicit.

I had a look at Chronikis. I was kinda thinking about domain-specific Stan packages (alternative frontends like you're saying, and I talked to @imadmali about one). The problem with front end languages is I don't know compiler stuff and so the extent of compiling I want to do is Python format-strings sorta code generation.

The discourse thread at the top (this one) has sortof a brute-force giant function that we could use for evaluating the right hand side of a hierarchical regression. This would be something we could compile a higher level language to.

The downsides to this are it's not terribly easy to read the syntax, we don't get the design matrices themselves, and there's no hierarchy of regressions like in the regression calculus paper. I don't want to clutter the Stan language up with too much extra syntax, but I also wouldn't want to limit it to a super-conservative syntax either (cuz the intention is lots of people happily use Stan and the interface languages -- not filter everyone from Stan to interfaces).

There are obvious challenges of course but also a few perhaps not as obvious ones

@paul-buerkner Could we talk about the workflow stuff on this sometime? I am rather scared of forgetting something here. Maybe in a couple weeks after the 2.25 release (on the 19th)? @enetsee or @andrewgelman or anyone else if you want to come to this, that's great. I'll update the design doc with whatever we talk about.

@paul-buerkner
Copy link

paul-buerkner commented Oct 5, 2020 via email

@bbbales2
Copy link
Member Author

bbbales2 commented Oct 26, 2020

@paul-buerkner @enetsee @andrewgelman are you free any time in the next couple weeks to talk about the not-so-obvious challenges here?

Edit: I'm available basically anytime. If there's no obvious other preferable time, before or after the Stan meeting Thursday is good.

@paul-buerkner
Copy link

I would be flexible on most days as well. Feel free to suggest a time.

@bbbales2
Copy link
Member Author

@paul-buerkner let's plan on before the Stan meeting then. That's 2PM UTC Thursday (10AM EST) if I'm doing my math correctly.

@enetsee if you want to come and that doesn't work, pick another time. Sounds like we're both flexible. I'll e-mail a call link before then.

@paul-buerkner
Copy link

paul-buerkner commented Oct 26, 2020

If you want to talk this week, would it also work after the Stan meeting?

@bbbales2
Copy link
Member Author

After is good too, so let's say 4PM UTC, noon EST

@andrewgelman
Copy link

Hi, just curious what happened with this?

@bbbales2
Copy link
Member Author

@andrewgelman had a meeting and I never updated the design doc.

There were some things there worth writing down, but the thing that was nagging me when I was trying to update this were unresolved problems with defining parameters that might have different sizes as a function of data.

Problems like, you have an interaction:

parameters {
  matrix[N_ages, N_educs] alpha0;
}

And then you have a model where in your data you don't have at least one of all the ages and educations.

I sorta got stuck here. And if you don't solve this problem you still leave a lot of annoying shuffling of stuff to the outside.

@bbbales2
Copy link
Member Author

I'm not sure I talk about this problem in the design doc, but I think I am assuming that it's solved in some way, and that seems like a big assumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants