Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError: Unable to allocate array with shape (114671, 114671) and data type float64 #166

Closed
robo-warrior opened this issue Sep 11, 2020 · 9 comments

Comments

@robo-warrior
Copy link

I get the following error: MemoryError: Unable to allocate array with shape (114671, 114671) and data type float64

Defining Ordinary Kriging as:

gridx = np.arange(min_x, max_x, 1)
gridy = np.arange(min_y, max_y, 1)

# Ordinary Kriging
OK = OrdinaryKriging(x, y, z, variogram_model='exponential',
                       verbose=False, enable_plotting=True, coordinates_type='geographic')

z1, ss = OK.execute('grid', gridx, gridy)

Where,
min_x = 8084396
min_y = 12073405
max_x = 8084864
max_y = 12073894

I understand that grid_x and grid_y arrays are too big. What can I do in this case to make this work?

@hoax-killer
Copy link

I have also faced this same issue in the past, is this a limitation of OK/PyKrige?

@MuellerSeb
Copy link
Member

I guess your input arrays x,y,z have the size (114671,), correct?
This is much to big for most RAM, so you get a MemoryError.

What could help in your case, is to sample from this big amount of data and reduce to about 10.000 ~ 20.000 datapoints:

sample_size = 10000
choice = np.random.choice(np.arange(x.size), sample_size)
x_smpl = x[choice]
y_smpl = y[choice]
z_smpl = z[choice]

Now you can use x_smpl instead of x for the OrdinaryKriging class.

@hoax-killer
Copy link

Thanks @MuellerSeb for getting back.

This is much to big for most RAM, so you get a MemoryError.

At least in my case its not a RAM error. The system has about 396 GB RAM. Moreover, I can read the entire file into the RAM, the file isn't big too. numpy can also load the data into memory, and I am able to run almost all other methods (e.g. KNN, RF, SVC, etc.) The issue only arose when running OrdinaryKriging().

@MuellerSeb
Copy link
Member

When using float64 with an array with a size of 114671*114671 (the cdist matrix), you result in minimum in 100GB of data. And this has at least the same order of magnitude as your provided RAM. With an overhead of numpy and some additional arrays of this kind, RAM can be a problem.
And you have written that you get a MemoryError, which states that you are running out of ram. You could try setting n_closest_points in the execute call, so the full cdist matrix is not created.

@hoax-killer
Copy link

hoax-killer commented Sep 17, 2020

The cause of memory error is more specifically on this line:

dlon = (lon1 - lon2) * np.pi / 180.0

More specifically, in executing lon1 - lon2

I am not aware of the specifics of this operation, and why we calculate the difference between lon1 and lon2.
The shapes of both numpy arrays are-

lon1.shape
Out[2]: (1, 114671)
lon2.shape
Out[3]: (114671, 1)

@mjziebarth
Copy link
Contributor

To chime in, having written that part of the code:

That part of the code is a fairly simple implementation of the third equation of the section Computational formulas of this Wikipedia article. It was written using a simple vectorized version of the equation which creates a number of temporary arrays corresponding to terms of the rather large equation.

If you are working so tightly at your RAM limit, these additional terms could be the icing on the cake. Apart from random subsampling, you could try to work in Euclidean space (see #149), if you don't explicitly want to use the great-circle distance at large distances. Specifically, this would mean to compute Euclidean coordinates x,y,z from your latitudes and longitudes, and then kriging without the coordinates_type='geographic' option. Maybe this saves just enough temporary arrays to fit into your RAM.

Hope that helps!

@hoax-killer
Copy link

@mjziebarth Thanks for getting back.

If you are working so tightly at your RAM limit

I have about 396 GB RAM, much more than normal limits. Hence, it was a bit surprising to hit the limit which we rarely have with much larger datasets.

It was written using a simple vectorized version of the equation which creates a number of temporary arrays corresponding to terms of the rather large equation.

TBH, I don't see anything wrong in your code, its just a bit surprising that its hitting the memory limits.

Since we are trying to benchmark, I was hesitant to change the parameters, but worst case, we can switch to Euclidean space for all the benchmarking tests. We are anyways using Google's Pixel coordinates system, hence using euclidean might make more sense.

@robo-warrior
Copy link
Author

Hi, using Euclidean coordinates seemed to work for us. Thank you all for super quick responses!

@MuellerSeb
Copy link
Member

That is really interesting! Thanks for sharing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants