Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence Buffer Sampling Performance #180

Open
AlexanderKoch-Koch opened this issue Jul 30, 2020 · 3 comments
Open

Sequence Buffer Sampling Performance #180

AlexanderKoch-Koch opened this issue Jul 30, 2020 · 3 comments

Comments

@AlexanderKoch-Koch
Copy link
Contributor

I have some performance issues with the sequence buffers. I have traced it to the extract_sequence function in rlpyt/utils/misc.py. It's implemented with a loop over all batch elements. This seems to be quite slow. But I wasn't able to find torch functions that could replace the python loop.
When I run my RL algo on a V100 the optimization loop spends about 50% of the time in the extract_batch() function.

Has anyone else encountered this problem before and has a solution?

@astooke
Copy link
Owner

astooke commented Aug 5, 2020

Oooh that's not ideal. Let's see if we can speed this up.

Did you profile with CUDA_LAUNCH_BLOCKING=1? Without that, some GPU time is not always accounted correctly, bc it runs asynchronously, and it can look like other parts of the program are more time consuming.

Is it mostly extracting the observations that is slow?

I have also noticed that this function can be time consuming, it's just a lot of data to copy if you're dealing with images. In that case I would think the really advanced thing to do would be to make some parallel & pipelined data loader, like people use in supervised learning. Would need to use the read-write lock as in the asynchronous mode. Might get complicated!

@AlexanderKoch-Koch
Copy link
Contributor Author

CUDA_LAUNCH_BLOCKING doesn't make a difference.

I have tested it with just reading the one large continuous part of the buffer and reshaping it. This is much faster (about 20x). So I think it's not the amount of data but the different positions.

I have also tried creating a list with all the indexes that should be read. And then I used one torch call to read all of them. But this is just as slow as your implementation with a loop.

A parallel data loader would probably be the most elegant solution. But I have given up on it because it got to complicated.

@astooke
Copy link
Owner

astooke commented Aug 7, 2020

Dang, bummer to hear :(

An intermediate solution could be to use the asynchronous runner, so that the sampling runs continuously in one process while optimization runs continuously in another. If sampling is the slower part anyway, then this would hide the memory copy time.

Does it make it so you can't run the experiment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants