Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose indexers #32

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

willirath
Copy link
Contributor

@willirath willirath commented Dec 20, 2020

This splits the .sel method into two steps (getting the positional indexers), and actually doing the selection.

The aim of this is to provide an easy way of obtaining the indices for use in downstream applications like horizontal interpolation etc.

  • Add method for directly querying the index (.query)
  • Tests for the new logic (unchanged coverage is not a good metric here)
  • [ ] Add query keyword arguments and pass them to the underlying index
  • Add examples:
    • Using the .query() method for calculating distances between target and selected points
    • [ ] Using the indexers for interpolation (on a structured grid)
    • [ ] Using the indexers for a two-step process with step one just getting the nodes for a polyline and step two querying all intermediate points.

Test on binder: https://mybinder.org/v2/gh/willirath/xoak/expose-indexers?filepath=doc%2Fexamples

@willirath
Copy link
Contributor Author

@benbovy This is not ready yet and mostly meant for discussing if and how we'd expose indexers. But feel free to have a look and leave comments as soon as you see fit.

@benbovy
Copy link
Member

benbovy commented Dec 20, 2020

The aim of this is to provide an easy way of obtaining the indices for use in downstream applications like horizontal interpolation etc.

I agree that it's valuable to provide users with some API so that they can do something with the indices.

Actually, I think that we could rename get_indexer to query and generalize so that it more-or-less matches scipy and sklearn indexes' query methods, here returning xarray object(s) instead of numpy array(s).

A possible signature would look like:

ds.xoak.query(
    indexers: Dict,
    return_distance: bool = True,
    query_kwargs: Optional[Dict] = None,
    **indexer_kwargs
)

(or alternatively expand query_kwargs instead of indexer_kwargs, or expand none of them).

Likewise, we could later extend the API with other common methods like query_radius, etc., with optional support by xoak's index adapters...

@willirath
Copy link
Contributor Author

I'll reduce this to just providing a query method for now. Passing arbitrary kwargs adds a lot of complexity because we loose an easy way of knowing about the structure of the data returned by then indexer, then.

@willirath willirath requested a review from koldunovn March 1, 2021 11:35
@benbovy
Copy link
Member

benbovy commented Mar 9, 2021

Should we wait for this PR to be merged before moving the repository to xarray-contrib?

@willirath
Copy link
Contributor Author

willirath commented Mar 9, 2021 via email

@willirath
Copy link
Contributor Author

Doc build fails with unrelated "Command killed due to excessive memory consumption". The rest is (finally) good for a review.

I've decided to skip the more elaborate examples for now.

@willirath willirath marked this pull request as ready for review August 1, 2021 11:11
@willirath willirath requested review from benbovy and removed request for koldunovn August 1, 2021 11:11
Copy link
Member

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great @willirath! Overall that looks good to me, I've just left a few minor comments.

@@ -64,6 +72,34 @@ def query_brute_force(dataset, dataset_dims_shape, indexer, indexer_dims_shape,
return dataset.isel(indexers=pos_indexers)


def query_brute_force_indexers(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could avoid duplicating this function and have query_brute_force simply return pos_indexers.

@@ -224,16 +224,13 @@ def _get_pos_indexers(self, indices, indexers):

return pos_indexers

def sel(
def query(
self, indexers: Mapping[Hashable, Any] = None, **indexers_kwargs: Any
) -> Union[xr.Dataset, xr.DataArray]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the return type is rather Dict[Hashable, Variable].

It would be nice if query would eventually return the distances too. Maybe returning Tuple[DataArray, DataArray] would make sense instead of trying to return both indexers and distances in the same xarray object. This is closer to what the underlying indexes usually return. Also DataArray (even without any coordinate) is more common than Variable. We can save this for later, though.

@@ -16,6 +16,14 @@ def test_s2point(geo_dataset, geo_indexer, geo_expected):
xr.testing.assert_equal(ds_sel.load(), geo_expected.load())


def test_s2point_via_query(geo_dataset, geo_indexer, geo_expected):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unnecessarily adds a lot of tests via the parametrized fixtures, but it's probably fine for now.

@@ -13,6 +13,15 @@ def test_scipy_kdtree(xyz_dataset, xyz_indexer, xyz_expected):
xr.testing.assert_equal(ds_sel.load(), xyz_expected.load())


def test_scipy_kdtree_via_indexer(xyz_dataset, xyz_indexer, xyz_expected):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_scipy_kdtree_via_indexer(xyz_dataset, xyz_indexer, xyz_expected):
def test_scipy_kdtree_via_query(xyz_dataset, xyz_indexer, xyz_expected):

@@ -7,3 +7,4 @@ Examples
introduction
dask_support
custom_indexes
query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pooch is missing in environment.yml to load the xarray tutorial dataset in this example notebook.

@@ -38,10 +38,13 @@ def test_sklearn_balltree_options():
assert ds.xoak._index._index_adapter._index_options == {'leaf_size': 10}


def test_sklearn_geo_balltree(geo_dataset, geo_indexer, geo_expected):
def test_sklearn_geo_balltree(geo_dataset, geo_indexer, geo_expected, geo_expected_indexers):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we split this into two functions like for the other indexes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants