Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactoring] N-dimenstional Kriging #31

Closed
4 of 9 tasks
rth opened this issue Dec 20, 2016 · 12 comments
Closed
4 of 9 tasks

[Refactoring] N-dimenstional Kriging #31

rth opened this issue Dec 20, 2016 · 12 comments

Comments

@rth
Copy link
Contributor

rth commented Dec 20, 2016

Just a few thoughts about possible code refactoring in PyKrige.

Currently PyKrige has separate code that implements 2D and 3D kriging (e.g. in ok.py and ok3d.py), this result in code duplication and makes code maintenance and adding new features more difficult (they need to be added to every single file). Besides the 1D Kriging is not implemented, and it might have been nice to have some 1D examples as they are easier to visualize.

In addition, PR #24 adds a scikit-learn API to PyKrige on top of the existing UniversalKrigging and OrdinaryKrigging methods.

A possible solution to remove the current code duplication, would be to refactor the UniversalKrigging and OrdinaryKrigging to work in N-dimensions. The simplest way of doing it would be to use something along the lines of the scikit-learn API which will also remove the need for an additional wrapper for that. The general API could be something like,

class OrdinaryKrigging(pykrige.compat.BaseEstimator):
    def __init__(self, <kriging options>):
         [..]
    def fit(X, y):
        """ Where X is an array [n_samples, n_dimensions] and y is an array [n_samples]"""
        [...]
    def predict(X):
        """ Where X is an array [n_samples, n_dimensions]
        The equivalent of the current execute(style="points", ...)
        """
        [...]

The case of execute(style="masked", ...) could be done by supporting masked arrays for predict(X), while the case execute(style="grid", ...) can be done with a helper function,

def points2grid(**points):
       """ points: a list of arrays e.g [zpts, ypts, xpts] """
       grids = np.meshgrid(*points, indexing='ij')
       return np.concatenate([grid.flatten()[:, None] for grid in grids])

which is mostly what is done internally by execute at present.

This would break backward compatibility though, so a major version change would be needed (PyKrige v2).

Update: the refactoring could follow the steps below,

  • in core.py merge adjust_for_anisotropy and adjust_for_anisotropy_3d into a single private function _adjust_for_anisotropy(X, center, scaling, angle) where X is a [n_samples, n_dim] array, and all the other are list of floats and, it returns the X_modified array. (PR ND refactoring of adjust_for_anisotropy #33 )
  • in core.py merge the initialize_variogram_model and initialize_variogram_model_3d into a single private function (PR Variogram Refactorization and Improvements #47 )
    _initialize_variogram_model(X, y,  variogram_model, variogram_model_parameters, variogram_function, nlags, weight)
    
  • in core.py merge the krige and krige_3d into a single private function _krige(X, y, coords, variogram_function, variogram_model_parameters) (PR Refactor statistics calculations #51 )
  • in core.py similarly merge find_statistics* (PR Refactor statistics calculations #51)
  • create a new ND kriging class for ordinary Kriging that follows the scikit-learn compatible api. We need to decide on a name OrdinaryKriging would be great but it's already taken). Maybe OrdinaryNDKriging ?
  • rewrite the internals of OrdinaryKriging and OrdinaryKriging3D to use OrdinaryNDKriging internally.
  • rewrite specific tests for OrdinaryNDKriging and add deprecation warnings on OrdinaryKriging , OrdinaryKriging3D
  • do the same thing for other kriging methods
  • after a few versions remove the old style kriging interface.

What do you think?

@rth rth added this to the PyKrige v2 milestone Dec 20, 2016
@basaks
Copy link
Collaborator

basaks commented Dec 21, 2016

@rth This is a good idea and I did in fact think about it as this is essentially the sklearn_cv.Krige class does, and yes, we can pull in the 2d/3d cases together using it as well. I was not sure how receptive the PyKrige users will be using something like this.

The question is who is going to bell the cat?

@kvanlombeek
Copy link
Contributor

I am also very interested in this generalisation.

@bsmurphy
Copy link
Contributor

This is a supremely excellent suggestion, especially as it would mean that future enhancements wouldn't have to be redundantly added to the various kriging classes. As long as everything we do is well-documented with plenty of examples, I think the existing user base would probably be ok with this change as well. (The documentation is one of the really useful parts according to users I've talked to.) Maybe we can keep the existing classes as-is just for backwards compatibility.

The first big question I'd have about this is how to generalize the array input and handling to an arbitrary number of dimensions. Maybe we can just have one input array that would be MxN+1, with M as the number of measurements and N as the dimension of the space in which to krige. Then the last column could be the actual measurement values to krige. @rth, what do you think? Then it would probably be simple to use the scipy distance routines to calculate euclidian (or some other metric even?) distances between measurement points, and populating a kriging matrix would then be simple I think...

@rth
Copy link
Contributor Author

rth commented Jan 8, 2017

Thanks for the feedback @bsmurphy @basaks @kvanlombeek and sorry for the late response.

The first big question I'd have about this is how to generalize the array input and handling to an arbitrary number of dimensions. Maybe we can just have one input array that would be MxN+1, with M as the number of measurements and N as the dimension of the space in which to krige. Then the last column could be the actual measurement values to krige. @rth, what do you think?

So now PyKridge takes as input some x, y, z arays that are 1D with the latest one being the dimension along which to Krige. If we translate it to scikit-learn API, we would just concatenate all the arrays apart from the last one to produce the X array [n_points, n_dim] and keep the last array as it is an call it y (it's the target variable). Cf my original post above as to how these X, y arrays would be provided to the Krigging class.

Then it would probably be simple to use the scipy distance routines to calculate euclidian (or some other metric even?) distances between measurement points, and populating a kriging matrix would then be simple I think...

Yes, definitely, that would also simplify the calls to ND distance routines from scipy.

Maybe we can keep the existing classes as-is just for backwards compatibility.

Well, we could keep the interface of the existing classes as they are, and progressively migrate the internals until we progressively transition from "separate krige methods + single scikit-learn compatible wrapper" to "single krige implementation + multiple krige wrappers that support the old interface". This would also present the advantage of this new method passing the old unit tests (so that we are confident of not breaking anything during refactoring). In practice this refactoring could take the steps described in the updated first post above.

What do you think?

@basaks
Copy link
Collaborator

basaks commented Jan 9, 2017

@rth Thank you for putting this list together.
I will go through your list in detail tonight. A bit under the pump today.

@bsmurphy
Copy link
Contributor

bsmurphy commented Jan 10, 2017

This looks great @rth, thanks for getting going on this. I'll try to work on merging the statistics and variogram estimation business all together, as there were some other changes in those portions of the code that I wanted to make anyways... (as I side in another thread, statistics calculations are wrong, need to fix that)

BTW, @rth (and anyone else who thinks about matrices, I don't think about them that much admittedly), I've been wondering if we could use LU decomposition on the kriging matrix to speed up the solution. Any thoughts? I haven't thought about this that much, but since the LHS matrix is the same (just the RHS vector that's changing), we might be able to leverage this for the looping backend...

@basaks
Copy link
Collaborator

basaks commented Jan 14, 2017

@rth
Your list looks good to me. I also reviewed your most recent PR and cross checked both the functions that you have manged to pull together into one.

@bsmurphy
Copy link
Contributor

bsmurphy commented Feb 3, 2018

OK, so I'm finally get the chance to set aside some solid time to work on this. I ultimately want to implement complex kriging, which would allow for kriging of vector field data. (So you could combine X and Y components of the vector field into a single complex number, then krige the complex number field. Since I work with geophysical EM field data, this would be really useful to me.) In terms of the refactor, the main differences with complex kriging come in the variogram estimation routine, so it would be beneficial to make things much more modular, as proposed in #56. All this is to say, here's what I'm thinking...

  • A lot of the core functions (e.g., variogram estimation) have been refactored to ND, but ideally they'd totally be pulled out into individual pieces that would be stringed together by the user. So we would have (1) a coordinate system transformer, which would handle the geographic to projected coordinate system transformation; (2) an anisotropy transformer, which would of course handle all the anisotropy stuff; (3) a variogram model estimator, which would potentially automatically fit the variogram to data and which would yield some variogram object that could be passed into kriging; and (4) a kriging class that would pull the output from those modules together to perform the estimation. A more sophisticated user might want to iterate between (2) and (3) to do the best job they can with their data, so ideally those modules would be generic and flexible. Oh, also would need (5) a separate variogram visualization module, as proposed in Refactoring of code: separation of concerns for ploting variograms #72 (would also address Make matplotlib an optional dependency #63).

  • Since Ordinary Kriging is just a special case of Universal Kriging, the ND versions of both of those should be merged into a single class, probably just called something like KrigingEstimator. This class would follow the scikit-learn API, like we discussed before.

  • For users who don't want to think so much about the full details of the analysis pipeline, it would be good to maintain the OrdinaryKriging, UniversalKriging, OrdinaryKriging3D, etc. classes as wrappers that combine variogram estimation, anisotropy stuff, etc. with the KrigingEstimator class. These would ideally still use the same API as they currently do for backwards compatibility...

I can start working right away on refactoring the variogram stuff into a separate module and putting together a core KrigingEstimator class. Although maybe it would be best to start drafting out what this would look like in its entirety with pseudocode first, that way we can figure out how the various modules will talk to each other...

@bsmurphy
Copy link
Contributor

bsmurphy commented Feb 5, 2018

Ok, here are the various building blocks that I'd envision.

(A) A generic coordinates module object that would take the data coordinates and scale/rotate/project them into the appropriate coordinate system to calculate the distance between points. So this would include all the anisotropy stuff as well as the geographic coordinates capabilities. The raw data coordinates would go in, and a list of inter-point distances would come out. (In reality, that "list" might be a callable object, like at its core a wrapper around scipy's KDtree...)

(B) A generic variogram model module that would take as input the inter-point distances and evaluate the specified variogram function on those distances. If users want more control over the variogram estimation they can iterate between A and B to adjust anisotropy parameters, etc. So, distances (probably in the form of object A) would go in, and the evaluated variogram function would come out.

(C) A variogram plotting module that would sit on top of B, as proposed in #72 (and would also solve #63).

(D) A model parameter estimation module that would also sit on top of B and provide the automatic variogram estimation stuff. Keeping this stuff in a separate object would make it easier to implement different variogram search routines and features (e.g., #54, #57, #29).

(E) A model validation module that would also sit on top of B to calculate, e.g., the statistics (would address #52).

(F) A drift terms object, which would be a user specified collection of drift terms to add into the estimation process.

(G) A KrigingEstimator object that would build and solve the kriging matrix. It would take as input the variogram object B and the ordered values corresponding to the order of points in objects A and B (placing some trust in the user here to keep the dataset ordered). It could also take as input object F (include it you get Universal Kriging, exclude it you get Ordinary Kriging). It would kick out the kriging estimates and estimate error, and it would also allow for access to, e.g., the kriging weights (per #82). This estimator object would have the scikit-learn style API and it would also provide the option for continuous part kriging, which would solve #30 in the correct way.

In principle only object A would need to know about exact coordinate locations, and everything else would just need to know inter-point distances. So all objects besides A would be intrinsically ND (which is what this issue was originally started to discuss...). Although for the local variogram stuff (#41) we might need to keep memory of point locations throughout the entire pipeline. Although, maybe in object A there could be some way of grouping points into multiple categories that would somehow each get their own object B, and then object G would be smart enough to use different object Bs in different parts of the kriging domain...

The existing API would be refactored to be wrappers around these various objects.

Thoughts? I'm trying to think of how to best arrange things to allow further extensions and improvements (e.g., #59 and the complex kriging stuff I mentioned...)

@rth
Copy link
Contributor Author

rth commented Feb 5, 2018

A lot of the core functions (e.g., variogram estimation) have been refactored to ND, but ideally they'd totally be pulled out into individual pieces that would be stringed together by the use

Yes, if we implement a compatible transform for different components, I think it might we worth linking them together with a scikit-learn pipeline. We would save time no re-inventing something similar.
It would also work for hyper-parameter estimation by cross-validation out of the box.

Since Ordinary Kriging is just a special case of Universal Kriging, the ND versions of both of those should be merged into a single class, probably just called something like KrigingEstimator

+1

For users who don't want to think so much about the full details of the analysis pipeline, it would be good to maintain the OrdinaryKriging, UniversalKriging, OrdinaryKriging3D

I'm just a bit concerned about the amount of code to maintain, and overall code complexity in this project. If we manage to express these classes a function of the new processing pipeline, why not, but if we keep both old and new implementations side by side, I'm worried that it would be hard to manage.

Will respond to your other message later today.

@MuellerSeb
Copy link
Member

Related : #133

@MuellerSeb MuellerSeb pinned this issue Jan 28, 2020
@MuellerSeb MuellerSeb changed the title [RFC] Generalization to N-dimenstional Kriging [Refactoring] N-dimenstional Kriging Jan 28, 2020
@MuellerSeb MuellerSeb unpinned this issue Mar 26, 2020
@MuellerSeb
Copy link
Member

We will discuss this here: #138 and #143. Closing for now. Feel free to Re-open or (better) discuss in the linked issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants