Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DownloadBuffer: Improve usability and documentation #6426

Open
9SMTM6 opened this issue Oct 19, 2024 · 15 comments
Open

DownloadBuffer: Improve usability and documentation #6426

9SMTM6 opened this issue Oct 19, 2024 · 15 comments
Labels
area: documentation Documentation for crate items, public or private

Comments

@9SMTM6
Copy link
Contributor

9SMTM6 commented Oct 19, 2024

Is your feature request related to a problem? Please describe.
I recently tried to use DownloadBuffer, but stumbled upon some issues with the callback not actually being executed.

After looking at the source-code, specifically the usage of DownloadBuffer::map_async and copy_buffer_to_buffer and the documentation of map_async, things clarified for me:

For the callback to complete, either queue.submit(..), instance.poll_all(..), or device.poll(..) must be called elsewhere in the runtime, possibly integrated into an event loop or run on a separate thread.

The callback will be called on the thread that first calls the above functions after the gpu work has completed. There are no restrictions on the code you can run in the callback, however on native the call to the function will not complete until the callback returns, so prefer keeping callbacks short and used to set flags, send messages, etc.

However I still don't have a solution that's entirely satisfying to me. And regardless of getting that, the documentation of DownloadBuffer and/or DownloadBuffer::read_buffer really should mention or reference these considerations too. Its most likely to be used by beginners, and these are more likely to stumble upon issues caused by these issues.

Now regarding a satisfying solution. I think ideally we should be able to 'poll until callback was executed (and no further)'. But that's not possible right now. We have no way of getting the SubmissionIndex returned by the call to copy_buffer_to_buffer, and I'm not even sure if doing device_arc.poll(wgpu::Maintain::WaitForSubmissionIndex(subm_idx)) after the call to DownloadBuffer::read_buffer would be sufficient (?).

Describe the solution you'd like
At the very least the documentation of DownloadBuffer::read_buffer should contain some hint at the issue.

Ideally we would also be provided a way of polling until the callback is executed, e.g. by returning the SubmissionIndex created in the call to copy_buffer_to_buffer from read_buffer, if that is sufficient. If not some other way would be appreciated.

Describe alternatives you've considered

I can not create my own variant of DownloadBuffer::read_buffer without patching wgpu, or making considerable changes to its logic, as that method uses a lot of private fields, methods etc. I'm perfectly willing to patch wgpu for experimentation, but that's not a permanent solution.

Meanwhile using device_arc.poll(wgpu::Maintain::Wait), from my understanding, might wait a lot longer than required, if another thread happened to submit a long job in the meantime.

Additional context

Note that this behavior can only be reproduced as is on native, on the web the browser polls webgpu in its event loop.

Also, I noticed that the documentation of Device::poll is outdated. It doesnt seem problematic to me, but thought I'd let you know. The documentation states that this method returns a boolean, when it does in fact return an enum. Also the documentation does not mention Maintain::WaitForSubmissionIndex at all. And the operation on the web could be clarified: I assume that the text means that polling will never block, however if the sentence uses terms loosely - concretely what noop means - , it could also be interpreted as blocking until its done but - in contrast to native - not actively driving the completion.

@Wumpf
Copy link
Member

Wumpf commented Oct 19, 2024

Huh yeah the doc for Device::poll being outdated is a straight up bug. Not sure how to be much clearer than "no-op" it can be: The call does literally nothing when targeting WebGPU, maybe it should spell it out like that with less jargon :)

Ideally we would also be provided a way of polling until the callback is executed, e.g. by returning the SubmissionIndex created in the call to copy_buffer_to_buffer from read_buffer, if that is sufficient.

It's likely possible to do this on native, but afaik + from a brief scroll over the relevant spec parts there's no way to do this on WebGPU. There is no guarantee of when the promise will be resolved (== the callback will be called).

That said, "polling until the callback is executed" is fairly easy, isn't it? Just use a channel in the callback to signify that enough polling has been done.

@Wumpf Wumpf added the area: documentation Documentation for crate items, public or private label Oct 19, 2024
@9SMTM6
Copy link
Contributor Author

9SMTM6 commented Oct 19, 2024

Not sure how to be much clearer than "no-op" it can be: The call does literally nothing when targeting WebGPU, maybe it should spell it out like that with less jargon

I mean, people do all kinds of crimes with all kinds of terms. I mean, IIRC from a video from fasterthanlime, you can use the noop instruction in x86 on registers and that has some special meaning when your CPU is new enough. That was just something I had on the back of the mind when I read that, having it stated in a clearer way would've cleared me of that nagging doubt. Something like "on the web, polling is already done by the browser, and polling with wait does not block".

That said, "polling until the callback is executed" is fairly easy, isn't it? Just use a channel in the callback to signify that enough polling has been done.

Hmm. I guess that's possible. But considering that there exists the Maintain::WaitForSubmissionIndex mechanic, this seems like a bandaid, caused by the SubmissionIndex from copy_buffer_to_buffer is never being exposed. I repeat my assumption:

getting the SubmissionIndex returned by the call to copy_buffer_to_buffer, [...] device_arc.poll(wgpu::Maintain::WaitForSubmissionIndex(subm_idx)) after the call to DownloadBuffer::read_buffer [will probably wait until the callback is executed?].

I don't really know how this is supposed to work, but this being the way it works is one of very few possible explanations for everything I've seen. This is also regarding:

It's likely possible to do this on native, .

So, in the docs for map_async is states:

The callback will be called on the thread that first calls the above functions

So unless map_async creates some kind of hidden SubmissionIndex, or integrates otherwise into wgpu, duing a device_arc.poll(wgpu::Maintain::WaitForSubmissionIndex(subm_idx)) with the SubmissionIndex of the copy_buffer_to_buffer should likely get me to the target?

but afaik + from a brief scroll over the relevant spec parts there's no way to do this on WebGPU

I mean, yeahish, but on webgpu I don't have to poll at all. With that I already have a solution, something that I do on both native and web, and that is a channel, similar to what you described.

In general It'd be nice to be able to somehow integrate with an rust async runtime on native btw.
At the moment this requires a bunch of extra work (both for the programmer, and the PC). One issues is the polling being blocking, not async. My current solution is to put the polling into a separate thread with tokio spawn_blocking when native, do nothing on the web, and then use the above mentioned channel with async, but Ideally Id be able to await the GPU being finished, doing the polling on the GPU on that very same task, with no need for a callback.

But thats admittedly a high ask.

Also, more notes on the documentation: The documentation of map_async should note that this call isnt required on the web, and maybe clarify that not just any call to Device::poll suffices, since if you wait for a SubmissionIndex before the callback could possibly run, this does not work (at least it did not for me). Which makes perfectly sense, still.

@Wumpf
Copy link
Member

Wumpf commented Oct 20, 2024

#6360 is actually adding submission index to map_async. There's still some open questions there though, haven't looked deeply into it yet.

@Wumpf
Copy link
Member

Wumpf commented Oct 20, 2024

I mean, yeahish, but on webgpu I don't have to poll at all. With that I already have a solution, something that I do on both native and web, and that is a channel, similar to what you described.

Something I'd like to better understand is what a good cross platform application would look like. Let's say wgpu were to expose submission index on native and just a useless placeholder on webgpu, how would you use that in an application that has to support both? In the end you'll always end up with continuous polling until the callback fires either way. In other words, I can't entirely get my head around yet how the submission index would help 🤔. For a pure native app and/or compute workloads it might be nice-to-have I guess.
Btw. if you have a frame pump that does submit each frame you don't even have to because submit polls for you... I guess that should also go to the docs!

@eliemichel
Copy link
Contributor

eliemichel commented Oct 22, 2024

how would you use that in an application that has to support both?

By WebGPU spec, for any operation that in the JavaScript API returns a Promise, the C API uses a WGPUFuture object (which just contains an opaque u64 in practice). Details can be found in the Asynchronous Operations article, from the (WIP) official documentation of WebGPU C API.

Polling until a particular async operation has been completed is then supposed to be done using wgpuInstanceWaitAny.

When cross-compiling to the Web, wgpuInstanceWaitAny corresponds to yielding back to the browser until the Promise that corresponds to the WGPUFuture gets resolved (conceptually, it is just replaced by await promise;).

Currently, wgpu-native does not support this WGPUFuture mechanism. My feeling is that it could be implemented by using submission indices as the WGPUFuture's u64 payload, and implement wgpuInstanceWaitAny as calling device.poll with WaitForSubmissionIndex(subm_idx) like suggested above.

Hence the need to have a submission index wherever the WebGPU JavaScript API returns a Promise, e.g. DownloadBuffer::map_async. Now, this might be a wrong approach, in which case I will be happy to consider alternatives!

@Wumpf
Copy link
Member

Wumpf commented Oct 22, 2024

I see how this makes see how this makes sense for implementing wgpu-native, so is possible exposing this on wgpu-core would make things a lot easier.
But the original issue here is about the wgpu rust api which doesn't have wgpuInstanceWaitAny and futures to begin with (yielding back to the browser typically means exiting an async update method, tbh I'm not even aware of a different way). Currently at least everything is solved with a callback mechanism rather than futures this since is easier to work with in the absence of an async runtime (we wouldn't want to force tokio or similar on anyone using wgpu)
DownloadBuffer in particular is a utility on top of the wgpu rust api.

All somewhat related to this older issue

@eliemichel
Copy link
Contributor

eliemichel commented Oct 22, 2024

everything is solved with a callback mechanism rather than futures this since is easier to work with in the absence of an async runtime

The C API also relies on callbacks (clearly, there is no async runtime in C), the term "future" in WGPUFuture is just a fancy way to call what is conceptually just a submission ID (native) or a Promise (JavaScript API).

So, back to the original question:

I think ideally we should be able to 'poll until callback was executed (and no further)'. But that's not possible right now. We have no way of getting the SubmissionIndex returned by the call to copy_buffer_to_buffer

What OP wants is that copy_buffer_to_buffer (or rather map_read I believe) returns a thing, with which we can poll the device until completion of the operation. In case of wgpu running on wgpu-core, that "thing" can be the submission ID. When wgpu runs on an arbitrary WebGPU implem through the JavaScript API, this "thing" is the Promise returned by buffer.map_read. And when wgpu runs on an arbitrary WebGPU implem through the C API (maybe not possible for now, but a future prospect), the "thing" is a WGPUFuture (a.k.a. a u64, exactly like the submission ID).

I believe that the rust wgpu API should expose an opaque Thing type that is either a Promise or a u64, which operations like map_read return and with which we can poll the device. The Thing could be called Future to match WebGPU C API wording, or something else if really that feels to confusing for rust users.

edit: This was somehow what @cwfitzgerald mentionned in the last message of the discussion you link!

@9SMTM6
Copy link
Contributor Author

9SMTM6 commented Oct 25, 2024

Sorry for taking a bit to come back to this, there was a lot of context added that I wanted to take a bit of time to read first.

Something I'd like to better understand is what a good cross platform application would look like.

To be honest, IDK. I don't have a terrible lot of experience with web+native cross platform, and in addition my use case might not reflect other users.

In other words, I can't entirely get my head around yet how the submission index would help 🤔. For a pure native app and/or compute workloads it might be nice-to-have I guess.

The reason I stumbled upon DownloadBuffer and the connected bits was for a compute workload, thus you can guess why. If I should at some point submit multiple workloads at the same time, polling with wgpu::Maintain::Wait might increase latency in some difficult to debug situations.
Though, in all honesty, at the current complexity of the application, that's not going to happen.

Btw. if you have a frame pump that does submit each frame you don't even have to because submit polls for you...

Since there currently is a single queue per (wgpu)device, egui/eframe also runs on wgpu on my app, and I don't need the added difficulty of slicing my compute into small bits, I need to create a different device anyways currently. Otherwise the UI gets entirely unusable whenever I compute something. So there is nothing else polling that compute device.

I believe that the rust wgpu API should expose an opaque Thing type that is either a Promise or a u64, which operations like map_read return and with which we can poll the device

Yeah, that would at least be my expectation if I could await this somehow. So in whatever way I await this, on native it will actually drive progress on the GPU and, and thus indeed complete when the work on that is done, while on the web its backed by a promise, since the browser drives the progress on its own.

This is another difference, between native and web, but as it stands, there are already plenty differences, and they do actually cost performance on the browser. But I'll just have to live with that.

What OP wants is that copy_buffer_to_buffer (or rather map_read I believe) returns a thing, with which we can poll the device until completion of the operation.

I was under some misconception of whats happening there when I wrote this. I would have expected map_async and similar OPs to return a submission index, if they need to be polled, and then have some way of accessing that in DownloadBuffer::read_buffer.

Although it should be noted, that the current API of map_async - and by extension DownloadBuffer::read_buffer - really doesn't lend itself to being executed on Rust async runtimes (I'll get to that).

And, just because you expose some async APIs, doesn't mean you will be bound to any async runtime. If you don't spawn background tasks on your own or communicate with system APIs (filesystems etc) in these APIs, you won't need to do anything specific.
IDK how talking to the GPU fits in there, frankly I've never actively used anything but OpenCL and now wgpu/webgpu, and passively looked at a bit of various flavors of OpenGL - which convinced me not to touch OpenGL if I can avoid it.

Now why the API is problematic. You're not supposed to block in async tasks, thus calling device.poll and its bretheren in there isn't acceptable. You'll have to use something like tokios spawn_blocking.
But then, if something like the outdated promise in device.poll is still going to happen, the callback will be invoked on the separate thread.
This means, that you require some synchronization with its connected overhead and annoyances, like that the result has to be Send for channels or sync for most other synchronizations, etc.

Admittedly I can't come up with a reason the result of the callbacks for these particular APIs would have to be not Send, but IDK.

If you've already got an async task for the GPU, like I do in my compute code, then ideally I'd like to have the ability to poll the GPU on that task. In my case I want to have backpressure there anyways.

Issue #1871 would be a nice option to have more of a parity between native and web if the potential drawbacks are worth it to you, however as I said before, there are currently already plenty of other differences, and since they have a reason, in you having more control over what happens when, I don't see these all going away.

Sorry if this is a bit all over the place. I tried to address a few things that came up, and, again, I don't feel like I'm necessarily representative for every user. I'll leave it at this, I hope I was able to bring a bit of my views across.

@eliemichel
Copy link
Contributor

And, just because you expose some async APIs, doesn't mean you will be bound to any async runtime.

I barely know about rust's async (coming from the C side of the project), but I tend to think that whatever the language async operations need some sort of runtime, either implicit or explicitly managed by having to "poll" somewhere.

You're not supposed to block in async tasks, thus calling device.poll and its bretheren in there isn't acceptable.

Unless device.poll is implemented in a way that yields back to the browser and tells basically "wake me up when something happens".

the callback will be invoked on the separate thread

The callbacks are invoked in the thread that calls device.poll, so you can make it run on the same thread.

If you've already got an async task for the GPU, like I do in my compute code, then ideally I'd like to have the ability to poll the GPU on that task.

That's the point of the WGPUFuture (not to be confused with rust's Future) that latest versions of WebGPU standard is using for the native API. My whole presence in this thread is to say "I wish wgpu would comply with this part of the standard" and I think it would solve all your issues as well :) For instance:

// In WebGPU standard native API, on_submitted_work_done returns a WGPUFuture
// object that we can internally implement as a submission index (native) or a Promise (web).
let done: WgpuFuture = queue.on_submitted_work_done(|| { do_something(); });

// In native, this is blocking until the submission index in `done` is processed.
// In web, this yields back to the browser until the Promise in `done` is resolved.
instance.wait_any([ done ]);
// ^^^^^^ The do_something() callback above is invoked within this call, right
// after the submission index/Promise is resolved, in the same thread.

do_something_else();

@9SMTM6
Copy link
Contributor Author

9SMTM6 commented Oct 25, 2024

@eliemichel I put a bit more thought in my comments than you give me credit for. No hard feelings, just wanted to clarify that. Perhaps I formulated stuff in a way that it was hard to understand what I meant.

I tend to think that whatever the language async operations need some sort of runtime, either implicit or explicitly managed by having to "poll" somewhere.

Oh yeah, sure. For something more complicated than running a single task, you likely want an full blown executor or even a runtime to run that program.

I was referring to the fact that wgpu, as in the library, maybe doesn't need to bind to any specific runtime, and these Single functions from wgpu side can be run with fake runtimes such as pollster, or future crates block_on.

Async Rust is made that you can write some parts independent of runtimes.

But yeah. Thinking about it now, it's likely going to have to integrate with a waker and these are runtime specific in Rust.

Unless device.poll is implemented in a way that yields back to the browser and tells basically "wake me up when something happens".

Thats not gonna work. No async runtime in Rust plugs into the browser runtime in a way that works, so you're going to end up mixing 2 runtimes (browser and the Rust runtime), that's going to mess with the Rust runtime. If you mix runtimes, any yield in one runtime is gonna look exactly like blocking to the other runtime. The way I look at it, when you have blocking system APIs, that essentially just means that the OS is your runtime. But that comes with a lack of control and usually as a result performance.

The callbacks are invoked in the thread that calls device.poll, so you can make it run on the same thread.

As I stated, the problem is that I can't call device.poll on the thread the async task runs on. So different thread.

not to be confused with rust's Future

Yeah, sure, I suspected that from the beginning, and when I read the threads that got linked to, that suspicion was confirmed.

"I wish wgpu would comply with this part of the standard" and I think it would solve all your issues as well :)

I agree that the potential is there. Not in the way you demonstrated, because of the issues mentioned above. But if there were a way to integrate this future like object with an actual rust future, without requiring explicit bindings to any specific runtime, then my wish would be granted. If that's not possible, in the end that's fine for me, I'll just keep the current way I do things, I don't think that's gonna cost me any significant performance in my specific use case.

I just wanted to point out the issues that the current design could have if that API is used with high frequency and/or low latency requirements. And that for all programmers it's going to be a bit awkward to deal with at the very least, having it go though the boilerplate of spawning a thread/calling block_on and then putting the data through a channel etc.

@eliemichel
Copy link
Contributor

@eliemichel I put a bit more thought in my comments than you give me credit for.

Ow, sorry for this feeling, I replied along the way as I was reading without a proper introduction to my message, my bad and apologies!

Async Rust is made that you can write some parts independent of runtimes.

I may have incorrectly interpreted @Wumpf 's answer above, but I was under the impression that they want to avoid having wgpu user even need to decide on a runtime at all, by having the API not expose any async thing.

But again, not used to the rust API and its various future runtimes (aren't they built on top of the browser's runtime when cross-compiling to WebAssembly?). My mental model is based on how I see WebGPU implemented in C/emscripten (e.g., an explicit call emscripten_sleep() yields back to the browser), and my concerns are (a) to figure out whether there are issues with the current state of WebGPU specification that prevents wgpu to implement it as expected (issues which we should then escalate to the spec) and (b) if there is no problem, motivate wgpu to actually implement the spec (I'm trying myself, but am no expert in this implementation).

@grovesNL
Copy link
Collaborator

@eliemichel

e.g., an explicit call emscripten_sleep() yields back to the browser

For what it's worth, Rust WebAssembly projects generally don't yield like this, because it usually means doing something like pausing the entire WebAssembly module while waiting for a promise to resolve (e.g., emscripten's asyncify transform or JSPI https://emscripten.org/docs/porting/asyncify.html). Even in Emscripten people should generally prefer to avoid doing that if possible because of how it blocks otherwise (e.g., new projects that can await promises the usual way).

@9SMTM6
Copy link
Contributor Author

9SMTM6 commented Oct 25, 2024

Ow, sorry for this feeling, I replied along the way as I was reading without a proper introduction to my message, my bad and apologies!

Nah, honestly I should say sorry. I misinterpreted your lack of knowledge about rusts async ecosystem as you explaining things to me thinking I didn't understand them. Sorry.

I may have incorrectly interpreted @Wumpf 's answer above, but I was under the impression that they want to avoid having wgpu user even need to decide on a runtime at all, by having the API not expose any async thing.

Hmm, I'm not sure, you might be right though. However, offering an async API doesn't mean that you can't offer synchronous (e.g. the current callback + poll on native, or maybe it blocks and returns a RAII guard for access instead) alternatives.

And there are simple (low-dependency etc) solutions to executing a simple async task without requiring a full executor. The previously mentioned pollster, on the web there's wasm_bindgen_futures.

My interpretation was that wgpu would be bound to a certain runtime, this is something of an issue with Rust async. Rusts async abstracts over awaiting, rebuilding that into a state machine, but other than generating state machines, it doesn't offer any other abstractions, like spawning other tasks, combining different Futures in different ways (first one wins, merge, etc). This means that many libraries, when they want to offer any async functions, need to decide for one of the runtimes. Generally tokio is supported, but tokio doesn't work for all usecases, that was the reason async Rust doesn't bring its own standard runtime, so thats... annoying.

But again, not used to the rust API and its various future runtimes (aren't they built on top of the browser's runtime when cross-compiling to WebAssembly?)

I should state that there's quite a few crates that promise to plug into browser APIs. But most of them are abandoned and were web specific. There are just subtle difficulties mapping things to work equivalently on the browser and natively. Like that you can't block the main thread in the browser, that timers tend to work differently than system timers (and may work differently depending on context, with different precisions), etc.
And then you can't even really rely on browser APIs, cause some code may be run in WASI contexts instead, or on an entirely sandboxed environment with only manual bindings. The only non-abandoned crate is tokio_with_wasm, but IMO its not well documented (regarding limitations), and its still very limited.

The prominent runtimes just end up with very limited support on WASM. tokio doesn't support sleeping, and doesn't support the actual runtimes. You can use simpler APIs, like join and merge, async channels and other synchronization primitives (though using them improperly on the browser main thread will panic).

Even with runtimes that 'use the browser runtime as their runtime' they probably still would need to be explicitly called, just binding to a browser promise directly will likely still break the runtime. And even so. Assuming that there were a runtime that somehow managed to work well with code using the browser API directly. If WGPU would then use this approach, this would mean that it only works with that specific runtime.

@eliemichel
Copy link
Contributor

Thanks for the details 🙏

I still need to read more about rust async, but meanwhile this makes me keep on thinking that my (possibly wrong) interpretation of @Wumpf 's answer is the safest way to go, namely not exposing any async and letting third party connect the dots for their preferred async runtime (if that is possible to do?).

Is it possible, when a Rust codebase has explicit continuation state management and no async at all, to somehow "hand" the state management to an async runtime without requiring changes to the APIs that are used? If so, that would be a good sign for keeping the wgpu API async-less. The way I tend to use the WebGPU API in C is to place the whole application's logic in a requestAnimationFrame loop, where I indeed end up manually managing the async state machine (that avoids using emscripten's "asyncify" mechanism, which @grovesNL recalled is usually discouraged, at least because it makes heavier builds).

(Okey, now I want to try giving a shot at running new C++ coroutines in a Web-ready way)

@9SMTM6
Copy link
Contributor Author

9SMTM6 commented Oct 27, 2024

I seem to have forgotten about that wgpu already has APIs that return (rust) futures, that are hard to avoid (request_adapter and request_device).

So honestly, unless they get changed to work in different ways, if it were (!) possible to have map_async return a future without binding to an explicit runtime, that'd be what I think should be done.

letting third party connect the dots for their preferred async runtime (if that is possible to do?).

The core of my issue wasn't that its a bit more effort to the programmer to do things, but that doing things in a performant way - at least without unsafe or potentially messing up the runtime - isn't possible, as far as I can see.

Is it possible, when a Rust codebase has explicit continuation state management and no async at all, to somehow "hand" the state management to an async runtime without requiring changes to the APIs that are used?

Unless they interface with the lower level Future APIs, no, AFAIK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: documentation Documentation for crate items, public or private
Projects
None yet
Development

No branches or pull requests

4 participants