option to downsample the data to calculate the variogram #29

kvanlombeek · 2016-12-20T08:51:59Z

When you krige with large datasets (100k datapoints for example), the number of pairwise distances blow up. It would be nice if there is a parameter in the function like max_n_for_variogram that limits the number of datapoints used.

I have personally already implemented this in the ordinary kriging function as our dataset was too large.

rth · 2016-12-20T10:01:10Z

@kvanlombeek Would this be a different functionality than the n_closest_points parameter in ordinary kriging? It was implemented in PR #5 also to reduce the amount of calculated pairwise distances (also using cKDTree for nearest neighbor calculations)...

Edit: ok, so the goal is to use only a subset of data for the variogram, I missed the description in the issue title )

basaks · 2016-12-21T04:58:54Z

I am keen to understand the opinion of the kriging community on this topic.
As identified by @rth we can already use n_closest_points using OrdrinaryKring. Amongst other things, it makes the prediction orders or magnitude faster for (large number of target points), and I think it is also more accurate as we essentially want to capture the local effects while we are kriging over a very large area (over many targets).

Now down sampling I would think will diminish the krige prediction confidence interval (variance will increase) locally. I can not immediately see the value as to why you would want to do this compared to the n_closest_points approach.

kvanlombeek · 2016-12-21T07:52:51Z

I believe they are two different things.

I am talking about downsampling the data just to calculate the variogram, not to actualy Krige.

Calculating the variogram is a coslty operation as you have to calculate all the pairwise distances. You don't loose a lot of information if you fit the variogram on only a 1000 points for example. (the fitting of the variogram implies only calculating three parameters, sill, nugget and range).

The n_closest_points is usefull when you do the actual interpolation. What this option does is to limit the number of surrounding points that you take into account when you actually krige. The variogram is by that time already calculated.

So by no means I have the intention to drop values, on the contrary, the updates I have made locally are to make sure the kriging algorithm works with a large dataset so that I can extract all local effects. (on www.rockestate.be you can see our kriged map)

basaks · 2016-12-21T09:48:50Z

OK @kvanlombeek
It makes a bit more sense now. Just looking for some confirmation here.
You are saying that this function, which is used to calculate the variogram model, will be cheaper to compute if we sample the observations. Specifically you are saying that you want to reduce the size of these arrays in the code in core.py lines 124-126:

x1, x2 = np.meshgrid(x, x)
y1, y2 = np.meshgrid(y, y)
z1, z2 = np.meshgrid(z, z)

Sure that will be fine. In this case then sampling (x/y/z) is equivalent to a stochastic approach in a large learning problem.

Did I get that right?

bsmurphy · 2016-12-21T22:46:39Z

Yes, I see how this can be a problem. Down-sampling would certainly be useful, but I would worry that the (random) sample might not be representative of the entire dataset (assuming stationarity of the entire dataset). Maybe we could randomly sub-sample the data multiple times and each time estimate the variogram, then compare all resulting variograms in order to figure out the most representative model (kind of like the RANSAC algorithm). Maybe there's some approach from the machine learning literature that could be implemented here to accomplish this and to make the computations efficient. @basaks, seems like you're familiar with this kind of machine learning analysis, what do you think?

basaks · 2016-12-22T13:43:18Z

@bsmurphy @kvanlombeek
When there are a lot of observations, downsampling may provide a very useful boost to performance without losing much accuracy. However, I don't think it is necessary to have this built in with this package.

One can manually downsample outside PyKrige and then use PyKrige functionality. I use shapely, pandas etc for sampling shapefiles quite often :)

On your concerns on downsampling, one can bin the observations (both spatially and by magnitude) and then sample from each bin for a representative set for Kriging. I actually do very similar operations. I think this is very problem specific and should be left to the user.

One can also go the other way in machine learning, i.e., create artificial observations when there are not enough observations.

@kvanlombeek what do you think? I will be, however, very interested to know how you down sample your observations.

kvanlombeek · 2016-12-22T14:13:45Z

I think we are discussing next to each other a bit, but your post from yesterday @basaks was exactly what I wanted to do.

Can I ask to all of you on how much data you have already fitted this algorithm? 100, 1.000, 100.000 or 1.000.000 points? Not to show of or something, just to see if it worked. I fit the algorithm on around 50k to 200k datapoints, and then I needed these littles tweaks.

I propose that I give it a shot, open a PR and continue to discuss based on that?

basaks · 2016-12-22T14:33:05Z

hi @kvanlombeek I have one dataset with ~2.2million observations across all of Australia. For many machine learning models, I use ~50k, sometimes upto 200k observations.

rth · 2016-12-22T14:46:57Z

@kvanlombeek It's great that you are using PyKrige with large datasets and can help with the scaling issues. However, I would tend to agree with @basaks and @bsmurphy on this one; there is potentially a lot of ways this could be done, and it can also be problem specific. Can you not downsample the input X, Y, Z arrays used to train the variogram before providing it to OrdinaryKriging in the way you need?

I'm +1 on adding some additional function for downsampling that would make sens in the Kriging context (if such a thing exists), and -1 on adding it inside the OrdinaryKriging classes which already have quite a bit of parameters IMO.. Unless I'm misunderstanding the issue..

basaks · 2017-01-09T03:34:59Z

Instead of downsampling, I think what is more useful, and seems like an already used technique in the kriging community, is the local variogram model. Can someone work on this instead? This will speed up computation manyfolds for large models.

VESPER is a PC-Windows program developed by the Australian Centre for Precision Agriculture (ACPA) for spatial prediction that is capable of performing kriging with local variograms (Haas, 1990). Kriging with local variograms involves searching for the closest neighbourhood for each prediction site, estimating the variogram from the neighbourhood, fitting a variogram model to the data and predicting the value and its uncertainty. The local variogram is modelled in the program by fitting a variogram model automatically through the nonlinear least-squares method. Several variogram models are available, namely spherical, exponential, Gaussian and linear with sill. Punctual and block kriging is available as interpolation options. This program adapts itself spatially in the presence of distinct differences in local structure over the whole field.

bsmurphy · 2017-01-10T04:27:15Z

@rth added the moving window function, which is similar to what you're suggesting @basaks except that it assumes a stationary variogram. Adding in the extra layer of re-estimating the variogram for local neighborhoods could certainly be done at some level, but it would require a calculation for each moving window location; therefore, not sure how much it would really boost speed... Anyways, both of these ideas would be nice to implement at some point: the local variogram estimation for max flexibility and the downsampling in global variogram estimation as a useful tool (could be put in the kriging tools module)...

rth · 2017-01-14T14:43:10Z

@basaks Feel free to open a new issue with the local variogram model..

rth · 2017-10-08T10:13:00Z

When there are a lot of observations, downsampling may provide a very useful boost to performance without losing much accuracy. However, I don't think it is necessary to have this built in with this package.

Actually, looking into this anew, as I understand it the issue is that currently, the API does not allow to calculate the variogram on a random subset of the data, then run all the remaining calculations on the full dataset. I don't think this would be too difficult to implement...

bsmurphy · 2017-10-09T16:48:43Z

True, shouldn't be difficult at all to do something really simple like take a random subset (maybe a couple of times) and compute the variogram estimate. This could be a useful addition in refactoring everything into distinct building blocks as per #56 ...

MuellerSeb · 2020-01-27T07:15:09Z

In GSTools we provide a sample size for variogram estimation. Maybe we could outsource the variogram estimation.

MuellerSeb · 2020-04-04T17:09:26Z

Since we will use the variogram estimation routines of GSTools in the future, we will discuss things like this here: #136
Closing for now. Feel free to re-open or (better) discuss in the linked issue.

rth mentioned this issue Jan 14, 2017

Scaling to large datasets #37

Open

rth mentioned this issue Oct 8, 2017

does it work well with big data? #74

Closed

bsmurphy mentioned this issue Feb 5, 2018

[Refactoring] N-dimenstional Kriging #31

Closed

9 tasks

MuellerSeb self-assigned this Jan 27, 2020

MuellerSeb added enhancement new feature labels Jan 27, 2020

MuellerSeb mentioned this issue Jan 27, 2020

GeoStat-Framework integration: PyKrige v2 #136

Open

10 tasks

MuellerSeb added this to the v2.0 milestone Mar 26, 2020

MuellerSeb closed this as completed Apr 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

option to downsample the data to calculate the variogram #29

option to downsample the data to calculate the variogram #29

kvanlombeek commented Dec 20, 2016

rth commented Dec 20, 2016 •

edited

Loading

basaks commented Dec 21, 2016

kvanlombeek commented Dec 21, 2016 •

edited

Loading

basaks commented Dec 21, 2016 •

edited

Loading

bsmurphy commented Dec 21, 2016 •

edited

Loading

basaks commented Dec 22, 2016 •

edited

Loading

kvanlombeek commented Dec 22, 2016

basaks commented Dec 22, 2016 •

edited

Loading

rth commented Dec 22, 2016

basaks commented Jan 9, 2017 •

edited

Loading

bsmurphy commented Jan 10, 2017

rth commented Jan 14, 2017

rth commented Oct 8, 2017

bsmurphy commented Oct 9, 2017

MuellerSeb commented Jan 27, 2020

MuellerSeb commented Apr 4, 2020

option to downsample the data to calculate the variogram #29

option to downsample the data to calculate the variogram #29

Comments

kvanlombeek commented Dec 20, 2016

rth commented Dec 20, 2016 • edited Loading

basaks commented Dec 21, 2016

kvanlombeek commented Dec 21, 2016 • edited Loading

basaks commented Dec 21, 2016 • edited Loading

bsmurphy commented Dec 21, 2016 • edited Loading

basaks commented Dec 22, 2016 • edited Loading

kvanlombeek commented Dec 22, 2016

basaks commented Dec 22, 2016 • edited Loading

rth commented Dec 22, 2016

basaks commented Jan 9, 2017 • edited Loading

bsmurphy commented Jan 10, 2017

rth commented Jan 14, 2017

rth commented Oct 8, 2017

bsmurphy commented Oct 9, 2017

MuellerSeb commented Jan 27, 2020

MuellerSeb commented Apr 4, 2020

rth commented Dec 20, 2016 •

edited

Loading

kvanlombeek commented Dec 21, 2016 •

edited

Loading

basaks commented Dec 21, 2016 •

edited

Loading

bsmurphy commented Dec 21, 2016 •

edited

Loading

basaks commented Dec 22, 2016 •

edited

Loading

basaks commented Dec 22, 2016 •

edited

Loading

basaks commented Jan 9, 2017 •

edited

Loading