Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does DEQ use less memory compared to explicit network? #33

Open
SeminKim opened this issue Oct 4, 2024 · 0 comments
Open

How does DEQ use less memory compared to explicit network? #33

SeminKim opened this issue Oct 4, 2024 · 0 comments

Comments

@SeminKim
Copy link

SeminKim commented Oct 4, 2024

Hi, I have a question about memory footprint of DEQ.
As far as I understand, DEQ does not need to store intermediate activations, and thus able to approximate infinite-layer model at the cost of only one layer. (So that training with NFE=30 iteration will cost just as a single iteration)
However, in the first DEQ paper, Table.3, explicit Transformer-XL with 16 layers consume much more VRAM compared to DEQ-Transformer(medium).
It seems they both have nearly same architecture with nearly same number of parameters. In this setting, as far as I understand, DEQ should perform better because it is effectively modeling much deeper model than its explicit counterpart, while consuming same memory. Why DEQ consumes less VRAM? Shouldn't it be same?
(I also found that the forward function of DEQ transformer contains one regular explicit forward that tracks gradient:

z1s.requires_grad_()
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant