Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decreasing autograd memory usage #219

Closed
alexbw opened this issue May 3, 2017 · 21 comments
Closed

Decreasing autograd memory usage #219

alexbw opened this issue May 3, 2017 · 21 comments

Comments

@alexbw
Copy link
Contributor

alexbw commented May 3, 2017

I don't mean "memory leak" in terms of unreachable memory after the Python process quits, I mean memory that is being allocated in the backwards pass, when it should be being freed. I mentioned this problem in #199 , but I think it should be opened as an issue.

For a simple function

import autograd.numpy as np
from autograd import grad

def F(x,z):
    for i in range(100):
        z = np.dot(x,z)
    return np.sum(z)
dF = grad(F)

and a procedure to measure memory usage

from memory_profiler import memory_usage
def make_data():
    np.random.seed(0)
    D = 1000
    x = np.random.randn(D,D)
    x = np.dot(x,x.T)
    z = np.random.randn(D,D)
    return x,z

def m():
    from time import sleep
    x,z = make_data()
    gx = dF(x,z)
    sleep(0.1)
    return gx

mem_usage = np.array(memory_usage(m,interval=0.01))
mem_usage -= mem_usage[0]

and a manual gradient of the same function

def grad_dot_A(g,A,B):
    ga = np.dot(g,B.T)
    ga = np.reshape(ga,np.shape(A))
    return ga

def grad_dot_B(g,A,B):
    gb = np.dot(A.T,g)
    gb = np.reshape(gb, np.shape(B))
    return gb

def dF(x, z):
    z_stack = []
    for i in list(range(100)):
        z_stack.append(z)
        z = np.dot(x, z)
    retval = np.sum(z)

    # Begin Backward Pass
    g_retval = 1
    g_x = 0

    # Reverse of: retval = np.sum(z)
    g_z = repeat_to_match_shape(g_retval, z)
    for i in reversed(list(range(100))):

        # Reverse of: z = np.dot(x, z)
        z = z_stack.pop()
        tmp_g0 = grad_dot_A(g_z, x, z)
        tmp_g1 = grad_dot_B(g_z, x, z)
        g_z = 0
        g_x += tmp_g0
        g_z += tmp_g1
    return g_x

I get the following memory usage profile:

image

If I replace the dot gradient with the ones used in the manual code, I get the same memory profile, nothing improves.

If I replace the dot product with element-wise multiply, I get a different memory profile, but still not what I would expect:

image

I would love to help figure this out, but I'm not sure where to start. First thing is of course to document the problem.

@alexbw alexbw changed the title Memory leak in dot gradient Memory leak in autograd May 3, 2017
@mattjj
Copy link
Contributor

mattjj commented May 4, 2017

Are you sure there's a leak here? Neither version seems to end at exactly zero net memory usage in the plots (and indeed the manual ones go negative at first), and we would expect autograd to require more memory because it has to trace the forward computation (not just store the values, but box each of them, etc.). This doesn't seem out of line with my expectations, and Python memory deallocation and measurement can be pretty complicated [1] [2].

@mattjj
Copy link
Contributor

mattjj commented May 4, 2017

Sorry, I don't understand. Doesn't autograd end with less net memory usage than pytorch in that plot? Where is the memory leak shown?

@alexbw alexbw changed the title Memory leak in autograd Decreasing autograd memory usage May 4, 2017
@alexbw
Copy link
Contributor Author

alexbw commented May 4, 2017

image

Updated image of memory usages.

@alexbw
Copy link
Contributor Author

alexbw commented May 4, 2017

Here's a gist on a more "real-world" benchmark, taking the gradients of an RNN.

https://gist.github.com/alexbw/7b2a0682f65dd1bcb7120ca2d47a2823

Here's the memory usage:

image

@alexbw
Copy link
Contributor Author

alexbw commented May 4, 2017

For the RNN benchmark, here's separating the forwards and backwards pass of autograd, versus just calling grad(f)(*args).

image

(apologies, benchmark code is getting a little hairy)
https://gist.github.com/alexbw/5730d3a7f61d36ff482e17b1024ec105

@mattjj
Copy link
Contributor

mattjj commented May 4, 2017

It's probably from the outgrads being accumulated in a list and summed all at once rather than incrementally summed.

@mattjj
Copy link
Contributor

mattjj commented May 4, 2017

I just pushed a commit that does in-place gradient accumulation; maybe check how the memory looks with that (though I might not have done it right).

@alexbw
Copy link
Contributor Author

alexbw commented May 4, 2017

Much better with inplace-grad-accumulation, now same order of magnitude.

image

@mattjj
Copy link
Contributor

mattjj commented May 4, 2017

Btw, thanks for these benchmarks! The latter ones are quite stark. (At first I was just focusing on the 'leak' idea, but not enough on the total memory usage issue.)

@alexbw
Copy link
Contributor Author

alexbw commented May 4, 2017

For clarity, just PyTorch v Autograd, so that we don't have to think about anything other than OO AD systems.

PyTorch being a shorter line, btw, means the program is faster (here 2x)

image

@mattjj
Copy link
Contributor

mattjj commented May 4, 2017

Is there a potential factor of 2 because of float32 vs float64, or are all these propagating the same dtypes throughout?

I think in-place accumulation makes a lot of sense for computation graphs with high fan-out of some nodes (like an RNN that reuses the same parameters many times; great test!). Fortunately @dougalm's vector space type system made the initial implementation a breeze!

@alexbw
Copy link
Contributor Author

alexbw commented May 4, 2017

If I switch everything to float64, I don't see anything appreciably different.

image

All I did is change all the input and parameter types to float64, that might not be sufficient, though?

@alexbw
Copy link
Contributor Author

alexbw commented May 5, 2017

FYI (just updating the thread with a conversation Matt and I had), PyTorch does NOT do anything special with gradients of indexing operations. They materialize the full-sized mostly-zero gradient, and add that in directly to their equivalent of outgrad.

It seems this is the common case for their users (and perhaps autograd's users!), and hasn't given rise to any serious performance complaints. They're aware it's suboptimal, but it doesn't seem that it's anywhere near the top of their todo list.

@alexbw
Copy link
Contributor Author

alexbw commented May 6, 2017

With matt's inplace-grad-accumulation branch:

image

With current master (3d018e8)

image

I added the benchmark file I'm using as a branch from master, benchmark-rnn

@Etragas
Copy link
Contributor

Etragas commented May 8, 2017

I am fairly confident I've figured out the memory issue.
When autograd is doing a backward pass, it does a toposort over nodes, and then for each node passes the gradient on to its parents.
As far as I can tell, after this operation is done, the node is useless (imagine conditional independence in graphical models?)
But autograd doesn't clear that dictionary entry, so over time the graph becomes a gigantic list of nodes which will no longer be used.
I'm submitting a pull request with the change, which is just: "outgrads[node] = None" on line 42 of code.py

Comparison of memory usage:
image

I checked the results out backward_pass from master to backward_pass with this addition and the results were identical in my use case.

#221

@alexbw
Copy link
Contributor Author

alexbw commented May 8, 2017

I don't see any change in peak memory usage in my RNN benchmark.

In order to replicate, you can do two things –

  1. Merge the "rnn-benchmark" branch and do python tests/rnn_benchmark.py to make a plot like the one above. Or
  2. You can make a branch from your PR, merge in the asv branch, do pip install asv, and then do asv run 3d018e8..HEAD to see how performance evolves per commit.

Could you post a version of your benchmark you're comfortable sharing?

@mattjj
Copy link
Contributor

mattjj commented May 8, 2017

These are two separate issues. @Etragas's change in #221 ensures that the values in an outgrads list associated with a node are garbage collected after they're finished being used. The issue that @alexbw is raising is that it's expensive to accumulate the outgrads in a list to begin with (in common use cases with high fan-out). That is, @Etragas saves memory across multiple nodes' outgrads lists, whereas @alexbw is concerned about memory usage for a single node's outgrad list.

I believe @dougalm is implementing a different accumulation strategy for the outgrads that will primarily address the issue @alexbw raised, though the change @Etragas suggested (explicitly deleting the entry from the outgrads dict) should also be included, since we don't need to keep those around.

@alexbw
Copy link
Contributor Author

alexbw commented May 9, 2017

Old:

image

New (bb8c0bc):


image

@Etragas
Copy link
Contributor

Etragas commented May 9, 2017

As far as I can tell, the memory consumption during the backward pass is now essentially flat.
It's almost definitely the case that the remaining difference is a byproduct of the forward pass.
I'm not sure how to optimize this, since we need the graph in memory, to traverse it.
Perhaps we could get rid of nodes in the forward graph as we traverse them, but that would only improve the average memory consumption, the peak would still be the same.

Thoughts?

@mattjj
Copy link
Contributor

mattjj commented May 9, 2017

I think the remaining difference in memory in these plots is from float64 vs float32. @alexbw showed me some other plots in which he forced PyTorch to use float64, and in those plots autograd and PyTorch were essentially identical in terms of total memory usage.

To make autograd maintain float32s in computations (given float32 inputs), we need to go through all our VJPs systematically and avoid any up-casting.

However, I think this fan-out memory usage issue is now solved with bb8c0bc, so I'm going to close this issue.

@mattjj mattjj closed this as completed May 9, 2017
@j-towns
Copy link
Collaborator

j-towns commented May 9, 2017

On the issue of float32(/float16) support, I believe that arrays are generally already handled correctly by most of the vjps, its only when primitives are applied to scalar values that things get messy.

This is a reflection of numpy's type system, which respects the dtype of arrays and seems to largely ignore the dtype of scalar values (even if they are wrapped as an ndarray with shape ()).

I think this needs to be thoroughly tested but assuming the above is correct, we could say in the docs that float32 is supported but only for arrays.

In [1]: from autograd import grad

In [2]: import autograd.numpy as np

In [3]: def f(x):
   ...:     return x**2
   ...:

In [4]: grad(f)(np.array([3], dtype=np.float32))
Out[4]: array([ 6.], dtype=float32)

In [5]: grad(f)(np.array(3, dtype=np.float32))
AssertionError:
Grad of power returned unexpected vector space
Vector space is ArrayVSpace_{'shape': (), 'size': 1, 'dtype': dtype('float64'), 'scalartype': <class 'float'>}
Expected        ArrayVSpace_{'shape': (), 'size': 1, 'dtype': dtype('float32'), 'scalartype': <class 'float'>}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants