-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support zero-copy send #123
Conversation
Now instead of completing, first checks if more CQE's are required. This changes the semantic meaning of Lifecycle complete from "Operation has completed" -> "Operation has a CQE" Also update Drop for Op<T> to take into account multi CQEs
Operations with multi CQE's may post multiple completions before a waker has been registered with the Lifecycle. This hit the "entered unreachable code" state in Lifecycle::complete(). To fix this, this commit changes completion to hold completions in a smallvec, which captures and records all completions that occur before the first poll occurs. This is safe, as smallvec will grow if needed, but shouldn't introduce any performance hit for single CQE operations.
@FrankReh I'd appreciate your thought on the driver changes here, w.r.t support for #112, which has the same 1 : many SQE/CQE behaviour. In the udp send_zc case, I believe the number of completions which can occur is limited to 64, which is the GSO segment limit. In the case of multishot recv, the limit is the number of buffers pre-registered and available to the ring. would that be deemed acceptable? |
@Noah-Kennedy , do you have any thoughts on this? |
@ollie-etl I have a fork that implements a POC for streaming the results of multi operations. I'll try to get my forks of the two relevant crates committed tomorrow so you and Noah can have a look. Do you have an example that uses this new operation yet? I could try it on my fork. |
@FrankReh Its now in use on our main dev branch, shortly to go into production. I could probably carve off a small benchmark or example and commit, but i won't have a chance until early next week. We see a ~0.9->3x performance gain over Critically, this now exceeds the performance we managed to get with sendmmsg with zero copy + GSO. Thats not to say a better benchmark implementation couldn't of course. I've literally no recent experience using TCP, so I can't comment there. Maybe it's even better, if larger segments can be offloaded? There is a reason I didn't add it to the tcp socket implementation here, I don't know the gotchas as well. |
Nice you got changes merged to io-uring. I'm looking at those diffs now. That does include a test of the new op so that's a big help. First question, why didn't you want to expose the CQE_F_NOTIF bit flag the way you did the more flag? I think the tokio-uring crate should use the CQE_F_NOTIF bit flag the way the liburing tests for send_zc do.
Did you mean issue #104, Multishot operations? I was confused for a bit how these changes are related to buffering. Anyway, I found my #112 entry had a typo in the first comment which might lead one to think it was an issue for all unimplemented operands. I've cloned your branch and squashed the six commits. It was hard to see what was going on because the 'fix for a race' came in later. But I have to admit I don't understand the Lifecycle changes. A few comments in the poll area vs the complete area could go a long way. For example, I don't understand how the complete Ignored(op) match is the only one that looks at the More flag, but probably that's because my own handling of More flags was being done differently - in a branch I hadn't shared with anyone yet. Other thoughts regarding naming and comments, so they aren't very important as Noah will be the final arbiter, are why go with the name 'update/Update' rather than more and More. And why does the Completable trait get a default update that creates an Update::Finished. Probably obvious to someone writing it, but a comment about why it is the right default and what the user of the trait might want to do when providing their own update method would be very helpful. What is the exposure of tokio to the smallvec crate and is the default size of four that important? Most uses of tokio-uring operations will be things other than zero copy send so maybe the some other way of burdening only the operations that need the extra flexibility? I've created a mirror type to Op, named StrOp, for those operations that will result in streaming responses - but Noah has yet to sign off on that yet too, as I haven't pushed that up to my GitHub repo for him to see. And about the use of fold on the iterator: its used twice, once in the poll path, once in the cqe complete path, it looks like the logic is designed to iterate through a growing smallvec until the Finished state is reached. Is this because you didn't want to modify the Lifecycle enum states too much? I went a different route for handling 'more' with streams so no growing loops like that were necessary. I did track a vec of more results along the way, to deal with something of the same issue you are tackling here. Generally, I would say the send_zc and sendmsg_zc are very unusual operations for the io_uring driver at this point and I wouldn't mind a separate type of Lifecycle for such operations, defined in their own module. I've named my own new module strop, for streaming Op, and could leave the original Op basically unchanged. There is some logic duplication between op.rs and strop.rs, but as the nature of the streaming operations are so different from the single shot operations, I favored creating a new Lifecycle implementation. It does mean the entry in the slab has an extra level enum to split between the original Op and the new StrOp. Noah and I will have some work to do to get on the same page as that decision comes with its own long term maintenance questions. But if that design is generally accepted, perhaps a NotiferOp or a ZcOp would allow special tailoring for special operations like this one. Again generally, I see the Linux io_uring driver getting more and more specialty features over the years and it will be harder and harder to keep a single Op implementation in tokio-uring that handles all the cases with a general solution, while it seems more likely to me that "forking" versions of Op for the new flavors of io_uring feature will allow us to present the various semantic differences of the new operations to the end user in the user land app more easily while not having to change earlier versions of Op. But that is very debatable and probably no one right answer. So perhaps you can wait to see how stream support is added to tokio-uring although it certainly helps all of us to already see there is this zerocopy flavor of operations to support as well. -- This ended up being very long and thinking about the send_zc and sendmsg_zc was completely new to me yesterday. So please take everything with a grain of salt. And it could very well be that Noah merges your work in before any of my stream work. |
Still me, not on a work account. @FrankReh thank you for your considered reply
Oversight really. It's not strictly needed either. you either get CQE_F_NOTIF or CQR_F_MORE, so testing for one is sufficient.
I misunderstood the scope of the issue.
Lifetime doesn't know what to do with a completion. So all it does when called by the completion handling logic is store them in a smallvec, and await for Poll to deal with them. smallvec was chosen to avoid allocating for the default case (single completion) and when you're actively polling, when you're less likely to have many unacknowledged completions. The value of 4 was chosen arbitrarily to be small, so completion state should still fit on a cache line.
I added update to avoiding breaking API changes on existing single completion types. My rational is that update consumes the completion, updating state, and returns if it needs more or has finished. For the case for single completion event types, this is just complete.
I could create a MultiCompleted enum varient. I don't believe smallvec will incur significant overhead, but I'm more than happy to put it in its own variant. Vec would be a bad idea I believe. That would incur significant numbers of allocations, which would wipe out gains seen by zero copy. Micro seconds matter here, for me anyway.
I'm just consuming all completions currently recieved, in order, and checking if we still need more or are finished. I've no knowledge about how many mores I'll get before the op completes. A for loop would probably be better, but I'm a recovering Haskell developer.
I'm very keen to see how you avoided growing a collection of results in the presence of the race between the ring and Poll. That sounds better. I also think send_zc could have just been a change to send. It is after all just a socket option on the normal syscall. I decided not to do that due to the scope of that change. |
I think we should split this into two PRs. One for the multishot support in the driver, and another for the ZC send ops. |
Ok, happy to do so. W.r.t the driver bit, do you hold any opinions on the areas discussed above? |
This is used to parameterize Op<T, CqeType>. It is now possible to specialize blanket implmentations, such as Future, to classes of Ops
Adds infrastructure to store Completions a slab backed indexed linked list
Provides the infrastructure for recieving multiple CQE's for a single Op's lifecycle
Changes made to match new lifecycle infrastructure
@FrankReh @Noah-Kennedy I've updated this based on #130. I think its cleaner, and shows off the Op specializations . |
@ollie-etl I haven't forgotten to look at this. Just want to get a little closure on my buf_ring work first before loosing focus. |
I have been pretty busy all week, but I should be able to get to this over the weekend. |
@FrankReh I think the comments are addressed? |
@FrankReh thanks for the quick review. I think those are done now |
Just adding a link to #54 to mark as relevant for when that merges. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Thanks for your patience. Nice addition.
Let's hold off on merge for a little bit. We might need to add some more code around multishot ops to make shutdown work properly. |
@ollie-etl Care to offer a title and short paragraph here? |
@Noah-Kennedy : this isn't like a multi-shot op in that sense. There is no uncertainty about the state like an |
@FrankReh How about Add a udp socket zero copy send variant Uses the IORING_OP_SEND_ZC flag introduced in linux 6.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ollie-etl @FrankReh due to shutdown changes that had to be made, we can't merge this right now. We need a way to tell if a particular multishot completion is finished. I'll open up a PR tonight with this.
@ollie-etl FYI
|
@Noah-Kennedy I don't think there is anything blocking now |
You're right, thanks for reminding me that this isn't merged! |
Somewhere, in the myriad of refactorings that occurred in #123, I dropped the marker type. This leads to wrongly associating the Future for handling SingleCQE types with SendZc, which is a multi completion type. Co-authored-by: ollie-etl <Oliver [email protected]>
Linux 6.0 introduced
IORING_OP_SEND_ZC
. This PR introduces the functionsend_zc
on udp (todo: tcp) sockets which takes advantage of it.This is blocked on this PR on io-uringMergedAs IORING_OP_SEND_ZC can return multiple CQE's per SQE, this also extends the
Lifecycle
handling inimpl Fut for Op<T>
to check for these occurences. If does this using a new methodupdate
inCompletable
, which I added to minimise the scope of the changes - it just falls back to thecomplete
method for implementations which are just single entry.This approach could probably be used as a the basis of streaming interfaces for some other submission types.
Since being split, is now blocked on #130