Add KDE bandwidth selectors using biased or unbiased cross-validation #2384

sethaxen · 2024-09-19T20:32:44Z

Description

As discussed on Slack, the existing bandwidth selection methods for kde oversmooth for draws from multimodal distributions with well-separated modes. This PR adds new bw options "ucv" and "bcv", which used unbiased or biased LOO-CV to select the bandwidth.

Example

import numpy as np
import matplotlib.pyplot as plt
import arviz as az
az.style.use('arviz-doc')

rng = np.random.default_rng(123)
x = np.concatenate([rng.normal(0, 1, 1000), rng.normal(60, 1, 1000)])
fig, ax = plt.subplots()
az.plot_kde(x, ax=ax, plot_kwargs={"color": "k"}, label="default")
az.plot_kde(x, ax=ax, bw='ucv', plot_kwargs={"color": "C0"}, label="UCV")
az.plot_kde(x, ax=ax, bw='bcv', plot_kwargs={"color": "C1"}, label="BCV")
ax.set_xlabel("x")
ax.set_ylabel("Density")
ax.legend()

Checklist

Does the PR follow official
PR format?
Has included a sample plot to visually illustrate the changes? (only for plot-related functions)
Is the new feature properly documented with an example?
Does the PR include new or updated tests to cover the new feature (using pytest fixture pattern)?
Is the code style correct (follows pylint and black guidelines)?
Is the new feature listed in the New features
section of the changelog?

📚 Documentation preview 📚: https://arviz--2384.org.readthedocs.build/en/2384/

sethaxen · 2024-09-20T19:08:17Z

Here are some example plots showing how KDEs with UCV and BCV bandwidths differ from those with the default bandwidth. In general, both methods seem to work pretty well when the original density does not have bounded support. BCV tends to smooth a little less than the default, while UCV smoothes even less. Personally I would still prefer the default or maybe BCV over UCV in almost every case except for the multimodal one in the OP:

In terms of performance, the CV-based bandwidth selection methods are significantly faster than the default, since _bw_isj is quite slow, but probably also because they tend to select lower bandwidths, which allows for fewer convolutions with the kernel.

In [1]: import arviz as az

In [2]: import numpy as np

In [3]: x = np.random.normal(0, 1, 10_000);

In [4]: %time [az.kde(x) for _ in range(1_000)];
CPU times: user 2.52 s, sys: 9.99 ms, total: 2.53 s
Wall time: 2.55 s

In [5]: %time [az.kde(x, bw='scott') for _ in range(1_000)];
CPU times: user 106 ms, sys: 0 ns, total: 106 ms
Wall time: 106 ms

In [6]: %time [az.kde(x, bw='ucv') for _ in range(1_000)];
CPU times: user 544 ms, sys: 228 μs, total: 544 ms
Wall time: 543 ms

In [7]: %time [az.kde(x, bw='bcv') for _ in range(1_000)];
CPU times: user 577 ms, sys: 73 μs, total: 577 ms
Wall time: 577 ms

sethaxen · 2024-09-20T19:49:27Z

Remaining pylint errors seem to be in code not touched in this PR

sethaxen · 2024-09-22T16:08:20Z

I re-made the plots in #2384 (comment) using the explicit algorithm in the reference (i.e. without histogram binning) to see if the results were very different, and for all but log-normal, the bandwidth from the explicit algorithm is indistinguishable from the one implemented here. In that case, explicit UCV looks like BCV.

aloctavodia · 2024-09-23T12:43:17Z

It seems UCV and BCV, can be quite noisy compared to "default" except when distribution are normal-like..., so their main use case could be multimodal densities? Did you compared against bw="isj"?

aloctavodia · 2024-09-23T12:45:18Z

arviz/stats/density_utils.py

 }


-def _get_bw(x, bw, grid_counts=None, x_std=None, x_range=None):
+def _get_bw(x, bw, grid_counts=None, bin_width=None, x_std=None, x_range=None):  # pylint: disable=too-many-positional-arguments


maybe we can disable "too-many..." globally. It's popping up in many places.

sethaxen · 2024-09-23T16:25:31Z

It seems UCV and BCV, can be quite noisy compared to "default" except when distribution are normal-like..., so their main use case could be multimodal densities?

I suspect it's more that the UCV and BCV criteria don't use any boundary correction, and when the distribution is bounded with high density near the bounds, then for large bandwidths the mean integrated squared error (MISE) is very high near the bound, while smaller bandwidths allow a sharper transition from low density on the wrong side of the bound to high density on the right side of the bound. So this explains why the bw that minimizes this objective for Normal, beta, t, and GMM produces smooth KDEs, while for exponential and uniform it looks undersmoothed. (log-normal looking weird is due to histogram binning, not due to the objective itself).

I've worked out the equivalent objective for boundary correction using reflection. Reflection is not technically equivalent to what we do, but I suspect it would produce similar results. I'm working out efficient evaluation of the objective now.

Did you compared against bw="isj"?

Not directly, but I will.

I also located the source for how the R stats module computes selects the UCV and BCV-based bandwidths, and they seem to use the identical approach to this PR, except they set the lower bound for the optimization to 0.1 * bw_max. See https://github.com/wch/r-source/blob/24e4ab74c81594bdbf9e05fd60b60cff2efe4215/src/library/stats/R/bandwidths.R#L107-L147 and https://github.com/wch/r-source/blob/24e4ab74c81594bdbf9e05fd60b60cff2efe4215/src/library/stats/src/bandwidths.c#L39-L72. Any differences between our results and theirs should be due to differences in binning and in the optimization algorithm.

sethaxen added 14 commits September 19, 2024 21:47

Add cross-validation-based bandwidth selection

69dfc83

Pass bin width to bandwidth functions

2831780

Make function more modular

b7b1df0

Document cv bandwidth methods

d2eae90

Add example using UCV bandwidth

2ea80f1

Document that ucv and bcv are acceptable values for bw

276479d

Fix bugs in cv score computation

0207f14

Use histogram utility function

ca30ec7

Add tests for density utils

c2515da

Reorganizing existing histogram test with other density utils tests

d26eb17

Make corrections to docstrings

76ac63c

Add missing test import

1332338

Fix pylint error

3588378

Remove unused imports

485b38a

sethaxen added 2 commits September 20, 2024 21:15

Disable pylint checks

51626c0

Update CHANGELOG.md

8947d60

sethaxen changed the title ~~[WIP] Add KDE bandwidth selectors using biased or unbiased cross-validation~~ Add KDE bandwidth selectors using biased or unbiased cross-validation Sep 20, 2024

sethaxen marked this pull request as ready for review September 20, 2024 19:17

sethaxen requested a review from aloctavodia September 20, 2024 19:18

Simplify code by using correlate

c85b86d

aloctavodia reviewed Sep 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KDE bandwidth selectors using biased or unbiased cross-validation #2384

Add KDE bandwidth selectors using biased or unbiased cross-validation #2384

sethaxen commented Sep 19, 2024 •

edited

Loading

sethaxen commented Sep 20, 2024 •

edited

Loading

sethaxen commented Sep 20, 2024

sethaxen commented Sep 22, 2024 •

edited

Loading

aloctavodia commented Sep 23, 2024

aloctavodia Sep 23, 2024 •

edited

Loading

sethaxen commented Sep 23, 2024

Add KDE bandwidth selectors using biased or unbiased cross-validation #2384

Are you sure you want to change the base?

Add KDE bandwidth selectors using biased or unbiased cross-validation #2384

Conversation

sethaxen commented Sep 19, 2024 • edited Loading

Description

Example

Checklist

sethaxen commented Sep 20, 2024 • edited Loading

sethaxen commented Sep 20, 2024

sethaxen commented Sep 22, 2024 • edited Loading

aloctavodia commented Sep 23, 2024

aloctavodia Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

sethaxen commented Sep 23, 2024

sethaxen commented Sep 19, 2024 •

edited

Loading

sethaxen commented Sep 20, 2024 •

edited

Loading

sethaxen commented Sep 22, 2024 •

edited

Loading

aloctavodia Sep 23, 2024 •

edited

Loading