Decreasing autograd memory usage #219

alexbw · 2017-05-03T21:18:41Z

I don't mean "memory leak" in terms of unreachable memory after the Python process quits, I mean memory that is being allocated in the backwards pass, when it should be being freed. I mentioned this problem in #199 , but I think it should be opened as an issue.

For a simple function

import autograd.numpy as np
from autograd import grad

def F(x,z):
    for i in range(100):
        z = np.dot(x,z)
    return np.sum(z)
dF = grad(F)

and a procedure to measure memory usage

from memory_profiler import memory_usage
def make_data():
    np.random.seed(0)
    D = 1000
    x = np.random.randn(D,D)
    x = np.dot(x,x.T)
    z = np.random.randn(D,D)
    return x,z

def m():
    from time import sleep
    x,z = make_data()
    gx = dF(x,z)
    sleep(0.1)
    return gx

mem_usage = np.array(memory_usage(m,interval=0.01))
mem_usage -= mem_usage[0]

and a manual gradient of the same function

def grad_dot_A(g,A,B):
    ga = np.dot(g,B.T)
    ga = np.reshape(ga,np.shape(A))
    return ga

def grad_dot_B(g,A,B):
    gb = np.dot(A.T,g)
    gb = np.reshape(gb, np.shape(B))
    return gb

def dF(x, z):
    z_stack = []
    for i in list(range(100)):
        z_stack.append(z)
        z = np.dot(x, z)
    retval = np.sum(z)

    # Begin Backward Pass
    g_retval = 1
    g_x = 0

    # Reverse of: retval = np.sum(z)
    g_z = repeat_to_match_shape(g_retval, z)
    for i in reversed(list(range(100))):

        # Reverse of: z = np.dot(x, z)
        z = z_stack.pop()
        tmp_g0 = grad_dot_A(g_z, x, z)
        tmp_g1 = grad_dot_B(g_z, x, z)
        g_z = 0
        g_x += tmp_g0
        g_z += tmp_g1
    return g_x

I get the following memory usage profile:

If I replace the dot gradient with the ones used in the manual code, I get the same memory profile, nothing improves.

If I replace the dot product with element-wise multiply, I get a different memory profile, but still not what I would expect:

I would love to help figure this out, but I'm not sure where to start. First thing is of course to document the problem.

The text was updated successfully, but these errors were encountered:

mattjj · 2017-05-04T06:07:58Z

Are you sure there's a leak here? Neither version seems to end at exactly zero net memory usage in the plots (and indeed the manual ones go negative at first), and we would expect autograd to require more memory because it has to trace the forward computation (not just store the values, but box each of them, etc.). This doesn't seem out of line with my expectations, and Python memory deallocation and measurement can be pretty complicated [1] [2].

mattjj · 2017-05-04T13:59:12Z

Sorry, I don't understand. Doesn't autograd end with less net memory usage than pytorch in that plot? Where is the memory leak shown?

alexbw · 2017-05-04T15:13:58Z

Updated image of memory usages.

alexbw · 2017-05-04T15:45:09Z

Here's a gist on a more "real-world" benchmark, taking the gradients of an RNN.

https://gist.github.com/alexbw/7b2a0682f65dd1bcb7120ca2d47a2823

Here's the memory usage:

alexbw · 2017-05-04T18:42:54Z

For the RNN benchmark, here's separating the forwards and backwards pass of autograd, versus just calling grad(f)(*args).

(apologies, benchmark code is getting a little hairy)
https://gist.github.com/alexbw/5730d3a7f61d36ff482e17b1024ec105

mattjj · 2017-05-04T20:54:32Z

It's probably from the outgrads being accumulated in a list and summed all at once rather than incrementally summed.

mattjj · 2017-05-04T21:07:10Z

I just pushed a commit that does in-place gradient accumulation; maybe check how the memory looks with that (though I might not have done it right).

alexbw · 2017-05-04T21:09:47Z

Much better with inplace-grad-accumulation, now same order of magnitude.

mattjj · 2017-05-04T21:09:59Z

Btw, thanks for these benchmarks! The latter ones are quite stark. (At first I was just focusing on the 'leak' idea, but not enough on the total memory usage issue.)

alexbw · 2017-05-04T21:11:14Z

For clarity, just PyTorch v Autograd, so that we don't have to think about anything other than OO AD systems.

PyTorch being a shorter line, btw, means the program is faster (here 2x)

mattjj · 2017-05-04T21:13:41Z

Is there a potential factor of 2 because of float32 vs float64, or are all these propagating the same dtypes throughout?

I think in-place accumulation makes a lot of sense for computation graphs with high fan-out of some nodes (like an RNN that reuses the same parameters many times; great test!). Fortunately @dougalm's vector space type system made the initial implementation a breeze!

alexbw · 2017-05-04T21:14:42Z

If I switch everything to float64, I don't see anything appreciably different.

All I did is change all the input and parameter types to float64, that might not be sufficient, though?

alexbw · 2017-05-05T01:56:10Z

FYI (just updating the thread with a conversation Matt and I had), PyTorch does NOT do anything special with gradients of indexing operations. They materialize the full-sized mostly-zero gradient, and add that in directly to their equivalent of outgrad.

It seems this is the common case for their users (and perhaps autograd's users!), and hasn't given rise to any serious performance complaints. They're aware it's suboptimal, but it doesn't seem that it's anywhere near the top of their todo list.

alexbw · 2017-05-06T14:31:01Z

With matt's inplace-grad-accumulation branch:

With current master (3d018e8)

I added the benchmark file I'm using as a branch from master, benchmark-rnn

Etragas · 2017-05-08T16:40:28Z

I am fairly confident I've figured out the memory issue.
When autograd is doing a backward pass, it does a toposort over nodes, and then for each node passes the gradient on to its parents.
As far as I can tell, after this operation is done, the node is useless (imagine conditional independence in graphical models?)
But autograd doesn't clear that dictionary entry, so over time the graph becomes a gigantic list of nodes which will no longer be used.
I'm submitting a pull request with the change, which is just: "outgrads[node] = None" on line 42 of code.py

Comparison of memory usage:

I checked the results out backward_pass from master to backward_pass with this addition and the results were identical in my use case.

#221

alexbw · 2017-05-08T19:46:35Z

I don't see any change in peak memory usage in my RNN benchmark.

In order to replicate, you can do two things –

Merge the "rnn-benchmark" branch and do python tests/rnn_benchmark.py to make a plot like the one above. Or
You can make a branch from your PR, merge in the asv branch, do pip install asv, and then do asv run 3d018e8..HEAD to see how performance evolves per commit.

Could you post a version of your benchmark you're comfortable sharing?

mattjj · 2017-05-08T20:00:16Z

These are two separate issues. @Etragas's change in #221 ensures that the values in an outgrads list associated with a node are garbage collected after they're finished being used. The issue that @alexbw is raising is that it's expensive to accumulate the outgrads in a list to begin with (in common use cases with high fan-out). That is, @Etragas saves memory across multiple nodes' outgrads lists, whereas @alexbw is concerned about memory usage for a single node's outgrad list.

I believe @dougalm is implementing a different accumulation strategy for the outgrads that will primarily address the issue @alexbw raised, though the change @Etragas suggested (explicitly deleting the entry from the outgrads dict) should also be included, since we don't need to keep those around.

alexbw · 2017-05-09T15:10:44Z

Old:

New (bb8c0bc):

Etragas · 2017-05-09T18:09:07Z

As far as I can tell, the memory consumption during the backward pass is now essentially flat.
It's almost definitely the case that the remaining difference is a byproduct of the forward pass.
I'm not sure how to optimize this, since we need the graph in memory, to traverse it.
Perhaps we could get rid of nodes in the forward graph as we traverse them, but that would only improve the average memory consumption, the peak would still be the same.

Thoughts?

mattjj · 2017-05-09T20:12:53Z

I think the remaining difference in memory in these plots is from float64 vs float32. @alexbw showed me some other plots in which he forced PyTorch to use float64, and in those plots autograd and PyTorch were essentially identical in terms of total memory usage.

To make autograd maintain float32s in computations (given float32 inputs), we need to go through all our VJPs systematically and avoid any up-casting.

However, I think this fan-out memory usage issue is now solved with bb8c0bc, so I'm going to close this issue.

j-towns · 2017-05-09T20:28:30Z

On the issue of float32(/float16) support, I believe that arrays are generally already handled correctly by most of the vjps, its only when primitives are applied to scalar values that things get messy.

This is a reflection of numpy's type system, which respects the dtype of arrays and seems to largely ignore the dtype of scalar values (even if they are wrapped as an ndarray with shape ()).

I think this needs to be thoroughly tested but assuming the above is correct, we could say in the docs that float32 is supported but only for arrays.

In [1]: from autograd import grad

In [2]: import autograd.numpy as np

In [3]: def f(x):
   ...:     return x**2
   ...:

In [4]: grad(f)(np.array([3], dtype=np.float32))
Out[4]: array([ 6.], dtype=float32)

In [5]: grad(f)(np.array(3, dtype=np.float32))
AssertionError:
Grad of power returned unexpected vector space
Vector space is ArrayVSpace_{'shape': (), 'size': 1, 'dtype': dtype('float64'), 'scalartype': <class 'float'>}
Expected        ArrayVSpace_{'shape': (), 'size': 1, 'dtype': dtype('float32'), 'scalartype': <class 'float'>}

alexbw changed the title ~~Memory leak in dot gradient~~ Memory leak in autograd May 3, 2017

alexbw changed the title ~~Memory leak in autograd~~ Decreasing autograd memory usage May 4, 2017

mattjj added a commit that referenced this issue May 4, 2017

quick attempt at in-place gradient accumulation, c.f. #219

5a1cf96

Etragas mentioned this issue May 8, 2017

backward_pass now clears nodes which will not be used #221

Merged

mattjj closed this as completed May 9, 2017

etragas-fathom mentioned this issue Mar 8, 2019

RFC: What do you think about TRAX? tensorflow/tensor2tensor#1478

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decreasing autograd memory usage #219

Decreasing autograd memory usage #219

alexbw commented May 3, 2017 •

edited

Loading

mattjj commented May 4, 2017 •

edited

Loading

mattjj commented May 4, 2017

alexbw commented May 4, 2017

alexbw commented May 4, 2017

alexbw commented May 4, 2017 •

edited

Loading

mattjj commented May 4, 2017

mattjj commented May 4, 2017

alexbw commented May 4, 2017

mattjj commented May 4, 2017

alexbw commented May 4, 2017 •

edited

Loading

mattjj commented May 4, 2017

alexbw commented May 4, 2017

alexbw commented May 5, 2017

alexbw commented May 6, 2017

Etragas commented May 8, 2017 •

edited

Loading

alexbw commented May 8, 2017

mattjj commented May 8, 2017 •

edited

Loading

alexbw commented May 9, 2017

Etragas commented May 9, 2017

mattjj commented May 9, 2017

j-towns commented May 9, 2017 •

edited

Loading

Decreasing autograd memory usage #219

Decreasing autograd memory usage #219

Comments

alexbw commented May 3, 2017 • edited Loading

mattjj commented May 4, 2017 • edited Loading

mattjj commented May 4, 2017

alexbw commented May 4, 2017

alexbw commented May 4, 2017

alexbw commented May 4, 2017 • edited Loading

mattjj commented May 4, 2017

mattjj commented May 4, 2017

alexbw commented May 4, 2017

mattjj commented May 4, 2017

alexbw commented May 4, 2017 • edited Loading

mattjj commented May 4, 2017

alexbw commented May 4, 2017

alexbw commented May 5, 2017

alexbw commented May 6, 2017

Etragas commented May 8, 2017 • edited Loading

alexbw commented May 8, 2017

mattjj commented May 8, 2017 • edited Loading

alexbw commented May 9, 2017

Etragas commented May 9, 2017

mattjj commented May 9, 2017

j-towns commented May 9, 2017 • edited Loading

alexbw commented May 3, 2017 •

edited

Loading

mattjj commented May 4, 2017 •

edited

Loading

alexbw commented May 4, 2017 •

edited

Loading

alexbw commented May 4, 2017 •

edited

Loading

Etragas commented May 8, 2017 •

edited

Loading

mattjj commented May 8, 2017 •

edited

Loading

j-towns commented May 9, 2017 •

edited

Loading