Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Laplace approximation as algorithm #16

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

bbbales2
Copy link
Member

@bbbales2 bbbales2 commented Mar 8, 2020

This is a proposal for a Laplace approximation algorithm to Stan.

Rendered output is here

Edit: I just totally changed the text here. A lot is changed since this the algorithm I initially put up was just wrong.

@bbbales2
Copy link
Member Author

@jgabry can you look through this?

@bbbales2 bbbales2 changed the title Added proposal for adding Hessian of log density in optimizing output for cmdstan Added proposal for supporting computing draws from posterior Laplace approximation when optimizing in cmdstan Mar 15, 2020
@bbbales2
Copy link
Member Author

bbbales2 commented Mar 15, 2020

Now this just mirrors the Rstan way of doing this. The prototype was relatively easy to implement.

If someone can approve that this is vaguely the right way to go (@jgabry, @avehtari or anyone), then I'll talk to people about interfaces and how this should go in.

Prototype code is there and can be run.


```laplace_draws``` - The number of draws to take from the posterior approximation. By default, this is zero, and no laplace approximation is done

```laplace_diag_shift``` - A value to add to the diagonal of the hessian approximation to fix small non-singularities (defaulting to zero)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

laplace_diag_shift -> laplace_diag_jitter or laplace_diag_add or laplace_add_diag?

Comment on lines 50 to 56
The new output would look like:

```
# stan_version_major = 2
...
# refresh = 100 (Default)
lp__,b.1,b.2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add lg__, ie, log density with respect to the approximation? Then it would be easier to use Pareto k diagnostic and do importance resampling.

Copy link
Member Author

@bbbales2 bbbales2 Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can evaluate both the true (edit: unnormalized) log density and the log density of the normal approximation. Just want one or both?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both lp__ and lg__ as in CmdStan advi, and in RStan advi and Laplace

Comment on lines 79 to 82
Another design would be the print the Hessian on the unconstrained space and let users handle the sampling and the parameter transformation. The issue here is there is no good way for users to do these parameter transformations outside of certain interfaces (at least Rstan, maybe PyStan).

Another design would be to print a Hessian on the constrained space and let users handle the sampling. In this case users would also be expected to handle the constraints, and I don't know how that would work practically rejection sampling maybe?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this I realized that it is not mentioned before what the draws are. I assume that the proposed approach is to draw from multivariate normal and then transform to the constrained space? If so it would be good to explicitly write what are the values for the "Draws from the Laplace approximation" in the new output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add more.

If uopt is the unconstrained optimum we're doing draws in pseudo-code like:

multivariate_normal(mean = uopt, cov = -inverse(hessian(uopt)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then transform to the constrained space?

# Prior art
[prior-art]: #prior-art

Rstan does a version of this already.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leaves it unclear if the RStan version is different that the design here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should check that it's actually the same. @bgoodri is this actually the same?

@avehtari
Copy link

I now reviewed only the design. After it's fixed, I can review code, too.

@bgoodri
Copy link

bgoodri commented Mar 16, 2020 via email

@bob-carpenter
Copy link
Collaborator

In terms of right way to go the autodiff is better (speed and precision both) where supported, but doesn't exist for some of our higher-order solvers like ODEs.

…out how constrained samples are generated as response to reviews
@bbbales2
Copy link
Member Author

bbbales2 commented Mar 17, 2020

@avehtari this should be available to test here:

git clone --recursive --branch=feature/design-doc-16-optimize-hessian https://github.com/stan-dev/cmdstan.git cmdstan-hessian

Currently I only implemented stuff for the lbfgs optimizer. Could you give this a go and check me for statistical correctness in terms of what I'm spitting out?

There are some software design issues:

  1. Need to share calculations between all the different optimizers (lbfgs, bfgs, and Newton)
  2. If we get errors on log_p evaluation, what to do
  3. If we get errors on generated quantities evaluation, what to do
  4. Do we print the laplace draws in the output file? If we do we have a csv with two different formats of outputs. The optimization output has less columns than the sampling output, and they also are different things.

I'd prefer to print a warning and skip the sample for 2 and 3 instead of blowing up. In that case if you request N posterior draws you might only get M < N actual draws. Blowing up is bad cause then you have to redo the whole optimization (and things might still blow up). I think if we do rejection sampling to make sure we get N posterior draws we'd be in danger of people writing models that never finish.

Edit: I guess options for 4 are:

a. Leave the different outputs in the same file
b. Get rid of the actual MAP estimate. Just print samples from Laplace approximation
c. Make a laplace_file argument and put the draws in there

@bgoodri I'm currently just too lazy to write and test the hessian_auto equivalent that uses vars. If someone wants to yell at me I guess I'd get unlazy and go do it.

@bob-carpenter I think he was just saying finite diff reverse mode to get the Hessian. That makes a lot of sense, but I just wanted to lean on what was available in stan-dev/math instead of writing my own.

@bgoodri
Copy link

bgoodri commented Mar 17, 2020 via email

@bbbales2
Copy link
Member Author

Sigh I guess I'll eat my vegetables. stan/math/rev/functor/finite_diff_hessian_auto.hpp or something incoming, but should still be able to talk through the other issues before I get this done.

@@ -13,7 +13,16 @@ When computing a MAP estimate, the Hessian of the log density can be used to con

I have a Stan model that is too slow to practically sample all the time. Because optimization seem to give reasonable results, it would be nice to have the normal approximation to the posterior to give some sense of the uncertainty in the problem as well.

An approximate posterior covariance comes from computing the inverse of the Hessian of the negative log density.
It is standard to compute a normal approximation to the posterior covariance comes from the inverse of the Hessian of the negative log density.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "comes" (I would remove it myself, but I don't have edit rights. It would be useful to have edit rights for design-docs repo)

constrained_sample = constrain(unconstrained_sample)
```

We can output unnormalized log densities of the actual model and the approximate model to compute importance sampling diagnostics and estimates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can output unnormalized log densities in the unconstrained space of the true posterior and the approximate posterior

@avehtari
Copy link

Do we print the laplace draws in the output file? If we do we have a csv with two different formats of outputs. The optimization output has less columns than the sampling output, and they also are different things.

Change the output for the optimization also when there are no approximate draws? If I remember correctly advi has the modal value as the first row and if the number of draws is > 0, the rest of the rows are draws. optimizing could follow that

@betanalpha
Copy link

I believe that the discussion is missing a critical point -- the current optimizer behavior is not appropriate for computing posterior Laplace approximations. The RStan reference implementation has never been correct.

Remember that the current optimizer doesn't include Jacobian corrections, so instead of computing a maximum of the posterior density it's actually computing the maximum of a penalized likelihood function. The Hessian at this penalized maximum likelihood could be used for defining confidence intervals (although because of the penalty functions they wouldn't have clear coverage properties) but the samples wouldn't have any meaning in this context.

In order to compute a Laplace approximation the optimizer would have to run with the Jacobian corrections turned on so that the maximum posterior density and the Hessian of the right posterior density is computed. This is sufficiently different behavior that it would at the very least warrant its own method rather than trying to force it into an inappropriate penalized maximum likelihood context.

@avehtari
Copy link

maximum of the posterior density it's actually computing the maximum of a penalized likelihood function.

Can you say this in terms of unconstrained and constrained space? We want the mode in the unconstrained space. Do we get that or something else (I've seen the explanation before, but keep forgetting enough to be uncertain)?

but the samples wouldn't have any meaning in this context.

When using importance sampling they have meaning of being draws from a proposal distribution. It can be that this is not the optimal proposal distribution, but they have meaning and diagnostic when that proposal distribution is bad. Of course we would like to have the best proposal distribution.

@betanalpha
Copy link

betanalpha commented Mar 18, 2020 via email

@bob-carpenter
Copy link
Collaborator

To put it another way, we do the optimization on the unconstrained space, but turn off the Jacobian adjustment for the change of variables. That's easy enough to turn on from the model class---it's just a template parameter. So maybe we need another flag on the service interface for optimization for +/- Jacobian.

I'm also with @betanalpha that we don't want to change current behavior, especially by automatically returning something much larger.

@bbbales2
Copy link
Member Author

The current code computes the mode on the constrained space, not the unconstrained space.

I wasn't thinking about that. Good catch. I guess these are just different functions with and without the adjustments.

I just think that it would be much clearer to have a new route with this functionality that wraps the internal optimizer and adds all of this post processing.

Sounds like this would be better suited as a new algorithm? That way there's no point estimate/sample output confusion either -- the Laplace algorithm only samples from a Laplace approximation to the posterior.

@bob-carpenter
Copy link
Collaborator

So we'd have (1) MLE without Jacobian correction, with std errors from the Hessian on the unconstrained scale translated back to constrained, and (2) Laplace approximation using Jacobian correction, with a set of draws returned.

It'd be nice to retrieve the parameters to the normal for the Laplace, as it'd let you generate more draws in the future. But that's very much a secondary consideration compared to getting the draws out in our standard interface.

@bbbales2
Copy link
Member Author

Laplace approximation using Jacobian correction

I think we'd only do this, however we end up doing it interface-wise. At least, this is what I meant to do so that is my goal

@avehtari
Copy link

The current code computes the mode on the constrained space, not the unconstrained space.

Thanks Mike, now I remember we have discussed this same issue several times before! It seems I just keep forgetting it because the mode and Hessian in unconstrained space so much more natural for me. Great if we can now have also mode and Hessian in unconstrained space.

@bbbales2
Copy link
Member Author

This popped up over here: https://github.com/stan-dev/pystan/issues/694

I still need it for my own purposes too (current just using a version of cmdstan with this hacked in).

Would a new algorithm be appropriate? So algorithms would be sampling, optimization, variational, laplace? If so I'll rewrite the design doc with that in mind and we can iterate on it.

@andrewgelman
Copy link

andrewgelman commented Apr 16, 2020 via email

@bbbales2 bbbales2 changed the title Added proposal for supporting computing draws from posterior Laplace approximation when optimizing in cmdstan Add Laplace approximation as algorithm Apr 20, 2020
@bbbales2
Copy link
Member Author

@avehtari @betanalpha @bgoodri this is ready for comments again.

I rewrote it as a new algorithm since the first thing was just wrong. I think the differences @betanalpha pointed out between what we want to do here and optimization are enough that I kinda don't want to put them together, even if the gears are basically the same. I dunno. Opinions welcome.

It's kinda awkward that it'll share all the same parameters as optimization but be a different algorithm, but I guess that's just interface overhead.

I think @bgoodri is right and this should be written with finite differences of gradients instead of just finite differences. I was working with this code hacked together on another project and this was important (and not easily diagnosed). It makes me want to do the full second order autodiff, but we can't with the higher order algorithms.

@bob-carpenter
Copy link
Collaborator

Speaking of Hessians, we should also change the testing framework to use finite diffs of gradients rather than double finite diffs for Hessians. I think it could improve the tolerances dramatically.

@avehtari
Copy link

how to handle errors evaluating the log density

why not then output NA, -Inf, or Inf based on what the log density evaluation returns?

An alternative that seems appealing at first is printing out the mode and
hessian so that it would be possible for people to make their own posterior
draws.

This would be useful if people want more draws without need to run the optimizer and Hessian computation again, but it's different behavior and would need to be considered together with variational output.

@bob-carpenter
Copy link
Collaborator

how to handle errors evaluating the log density

why not then output NA, -Inf, or Inf based on what the log density evaluation returns?

Errors raise exceptions, so there's not a return value. We can also have NaN propagate through (there's no NA in Stan like in R, but presumably we're talking about not-a-number).

@bbbales2
Copy link
Member Author

why not then output NA, -Inf, or Inf based on what the log density evaluation returns?

I don't think there's a way to write these to csv files. Maybe we already have a convention for it though?

This would be useful if people want more draws without need to run the optimizer and Hessian computation again

It's a drawback. Getting draws is somewhat expensive since we evaluate the model log density, so there's that.

@bbbales2
Copy link
Member Author

Lookin' for some feeeedback. Otherwise I'll just do it.

The actual new new code for this won't be much, but it'll be a lot of interface bits.

avehtari
avehtari previously approved these changes Apr 30, 2020
@betanalpha
Copy link

betanalpha commented Apr 30, 2020 via email

@bob-carpenter
Copy link
Collaborator

@betanalpha --- what would you like to see in the way of validation for something like this?

@betanalpha
Copy link

betanalpha commented Apr 30, 2020 via email

@avehtari
Copy link

avehtari commented May 1, 2020

I had thought this was more like fix for the previous algorithm, but I also now realize the design document is not mentioning diagnostics. Laplace algorithm itself is not novel and there is extensive literature demonstrating when it works and when it fails. PSIS paper https://arxiv.org/abs/1507.02646 describes the diagnostics we can use in the same way as we have diagnostics for MCMC. We can compute Monte Carlo standard errors and diagnostics when the Laplace approximation fails and when MCSEs are not reliable.

@betanalpha
Copy link

betanalpha commented May 1, 2020 via email

@andrewgelman
Copy link

andrewgelman commented May 1, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants