Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notion of "distance" or "scale" for indexes and selection #5453

Open
WardBrian opened this issue Jun 8, 2021 · 8 comments
Open

Notion of "distance" or "scale" for indexes and selection #5453

WardBrian opened this issue Jun 8, 2021 · 8 comments

Comments

@WardBrian
Copy link
Contributor

Is your feature request related to a problem? Please describe.

I've been using xarray with atmospheric data given on pressure levels. This data is best thought of in log(pressure) for computation, but it is stored and displayed as standard numbers. I would love if there was some way to have data.sel(lev=4, method='nearest') return the nearest data in log-space without having to have my index stored in the log space.

E.g, currently

>>> a = xr.DataArray(['a', 'b', 'c'], dims='lev', coords=[('lev', [0.1, 10, 100])])
>>> a.sel(lev=2, method='nearest')
<xarray.DataArray ()>
array('a', dtype='<U1')
Coordinates:
    lev      float64 0.1

but in the scientific sense underlying this data, the closer point was actually 'b' at 10, not 'a' at 0.1.

In general, one can imagine situations where the opposite is true (storing data in log-space for numerical accuracy, but wanting a concept of 'nearest' which is the standard linear sense), or a desire for arbitrary scaling.

Describe the solution you'd like
The simplest solution I can imagine is to provide a preprocessor argument to the sel function which operates over numpy values and is used before the call to get_loc.

e.g.

>>> a.sel(lev=2, method='nearest', pre=np.log)
<xarray.DataArray ()>
array('b', dtype='<U1')
Coordinates:
    lev      float64 10.0

I believe this can be implemented by wrapping both index and label_value here with a call to the preprocess function (assuming the argument is only desired alongside the 'method' kwarg):

indexer = index.get_loc(
label_value, method=method, tolerance=tolerance
)

Describe alternatives you've considered
I'm not sure how this would relate to the ongoing work on #1603, but one solution is to include a concept of the underlying number line within the index api. The end result is similar to the proposed implementation, but it would be stored with the index rather than passed to the sel method each time. This may keep the sel api simpler if this feature was only available for a special ScaledIndex class or something like that.

One version of this could also be used to set reasonable defaults when plotting, e.g. if a coordinate has a log numberline then it could set the x/yscale to 'log' by default when plotting over that coordinate.

@WardBrian
Copy link
Contributor Author

It appears enough has been done in the explicit indexing refactor that this is now possible!

Using xarray 2024.6.0:

import xarray as xr
import numpy as np

# some example data
data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 100]})
ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))
print(ds)

# We create a class which is a subclass of xarray.Index
# https://docs.xarray.dev/en/latest/internals/how-to-create-custom-index.html

from xarray.core.indexes import PandasIndex, IndexSelResult, normalize_label, get_indexer_nd

# However, the built-in PandasIndex is _almost_ what we want,
# so we subclass it to steal most of the functionality
class LogIndexer(PandasIndex):
  def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)

  # Except for .sel, where we can inject our custom logic!
  def sel(self, labels: dict, method=None, tolerance=None) -> IndexSelResult:
    # based on the code in https://github.com/pydata/xarray/blob/ce5130f39d780cdce87366ee657665f4a5d3051d/xarray/core/indexes.py#L745
    if method == "nearest_log":
      assert len(labels) == 1
      coord_name, label = next(iter(labels.items()))
      label_array = normalize_label(label, dtype=self.coord_dtype)
      indexer = get_indexer_nd(np.log(self.index), np.log(label_array), method="nearest", tolerance=tolerance)
      return IndexSelResult({self.dim: indexer})
    return super().sel(labels, method, tolerance)

# Now we need to make a dataset that uses this custom index
ds_log = ds.drop_indexes('x').set_xindex('x', LogIndexer)
print(ds_log)

# x has values 10 and 100, to make it easy to observe the differences here:
print("normal:\t", ds_log.sel(x=[10,20,30,40,50,60], method="nearest").x.data)
print("log:\t", ds_log.sel(x=[10,20,30,40,50,60], method="nearest_log").x.data)

# prints:
# normal:  [ 10  10  10  10  10 100]
# log:     [ 10  10  10 100 100 100]

@dcherian
Copy link
Contributor

I think we'd happily bundle that in Xarray, seems common enough.

cc @benbovy

@WardBrian
Copy link
Contributor Author

To hew closer to my original suggestion, one could also accept a pointwise transform in the constructor, and then nearest could just apply that to both the index coordinates and the requested positions.

E.g., the set_xindex call would look like set_xindex("foo", PandasIndex, scale=np.log). This is similar to a suggestion given here: #7964 (comment)

This is more general (e.g. you could also do something like np.exp), but you lose the ability to use both nearest and nearest_log at the same time. For my own usages, either choice would be fine

@benbovy
Copy link
Member

benbovy commented Aug 14, 2024

Yes this would be nice to have this functionality built-in Xarray.

Another option could be to have some TransformIndex meta-index that would basically wrap any Xarray index and apply an (inverse) user-defined transformation on label data before (after) executing the logic in the wrapped index. I.e., something very much like the unit-aware PintIndex implemented in xarray-contrib/pint-xarray#163.

This is even more general and could be useful also for indexes implementing alignment with some tolerance (e.g., #4489), although that might seem a little far-fetched.

you lose the ability to use both nearest and nearest_log at the same time.

It is still possible to duplicate the coordinate so each have an index working on a different scale, e.g., "foo" and "foo_log").

@WardBrian
Copy link
Contributor Author

WardBrian commented Aug 14, 2024

I'd be happy to open a PR on either approach if there's a clear favorite

Edit: I may be misunderstanding, but it's not clear to me that a meta-index would work here, since the transformation also needs to be applied to the coordinate inside the index, not just the labels coming in to the sel call.

@benbovy
Copy link
Member

benbovy commented Aug 14, 2024

Yes you are right, I didn't look carefully enough at your example and top comment.

@WardBrian
Copy link
Contributor Author

Given that:

  1. Is there still interest in bundling the nearest_log functionality into xarray?
  2. Would it be preferred to be done as part of the implementation of PandasIndex, or as a wrapper class as above?
  3. Would it be preferred to implement this in such a way where the "log" nature is baked in, or would something like the set_xindex("foo", PandasIndex, scale=np.log) be preferred?

@benbovy
Copy link
Member

benbovy commented Aug 15, 2024

It depends on whether we want to have this feature for free also in all indexes built on top of PandasIndex (most index types I've seen so far are either subclassing or encapsulating PandasIndex) vs. we want to keep PandasIndex as simple as possible (i.e., consistent with pandas and no extra option).

From a broader perspective it seems that the flexible distance / scale discussed here might be useful for both selection and alignment so maybe it is preferable to address it outside of PandasIndex? Something like a NearestIndex class in which we could later implement other "nearest" specific functionality like #4489 and add index options for tolerance, scale transform, etc.

I'm curious what others think, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants