Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support zero-copy send #123

Merged
merged 40 commits into from
Nov 7, 2022
Merged

Support zero-copy send #123

merged 40 commits into from
Nov 7, 2022

Conversation

ollie-etl
Copy link
Contributor

@ollie-etl ollie-etl commented Oct 4, 2022

Linux 6.0 introduced IORING_OP_SEND_ZC. This PR introduces the function send_zc on udp (todo: tcp) sockets which takes advantage of it.

This is blocked on this PR on io-uring Merged

As IORING_OP_SEND_ZC can return multiple CQE's per SQE, this also extends the Lifecycle handling in impl Fut for Op<T> to check for these occurences. If does this using a new method update in Completable, which I added to minimise the scope of the changes - it just falls back to the complete method for implementations which are just single entry.

This approach could probably be used as a the basis of streaming interfaces for some other submission types.

Since being split, is now blocked on #130

Now instead of completing, first checks if more CQE's are required.
This changes the semantic meaning of Lifecycle complete from

"Operation has completed" -> "Operation has a CQE"

Also update Drop for Op<T> to take into account multi CQEs
Operations with multi CQE's may post multiple completions
before a waker has been registered with the Lifecycle.

This hit the "entered unreachable code" state in Lifecycle::complete().

To fix this, this commit changes completion to hold completions in a
smallvec, which captures and records all completions that occur before
the first poll occurs. This is safe, as smallvec will grow if needed,
but shouldn't introduce any performance hit for single CQE operations.
@ollie-etl
Copy link
Contributor Author

ollie-etl commented Oct 4, 2022

@FrankReh I'd appreciate your thought on the driver changes here, w.r.t support for #112, which has the same 1 : many SQE/CQE behaviour.

In the udp send_zc case, I believe the number of completions which can occur is limited to 64, which is the GSO segment limit. In the case of multishot recv, the limit is the number of buffers pre-registered and available to the ring. would that be deemed acceptable?

src/driver/op.rs Outdated Show resolved Hide resolved
@ollie-etl
Copy link
Contributor Author

@Noah-Kennedy , do you have any thoughts on this?

@FrankReh
Copy link
Collaborator

FrankReh commented Oct 5, 2022

@ollie-etl I have a fork that implements a POC for streaming the results of multi operations. I'll try to get my forks of the two relevant crates committed tomorrow so you and Noah can have a look. Do you have an example that uses this new operation yet? I could try it on my fork.

@ollie-etl
Copy link
Contributor Author

@FrankReh Its now in use on our main dev branch, shortly to go into production.

I could probably carve off a small benchmark or example and commit, but i won't have a chance until early next week.

We see a ~0.9->3x performance gain over write. This will depend hugely on your hardware setup obviously, and the packets you shift. The best comes, as you would expect, when doing large buffers (64 kB) with hardware GSO enabled. Note that local loopback is not a good way to test the performance of this - it won't do zero copy (and will fall back to copying) - I think you actually need a NIC with scatter gather support.

Critically, this now exceeds the performance we managed to get with sendmmsg with zero copy + GSO. Thats not to say a better benchmark implementation couldn't of course.

I've literally no recent experience using TCP, so I can't comment there. Maybe it's even better, if larger segments can be offloaded? There is a reason I didn't add it to the tcp socket implementation here, I don't know the gotchas as well.

@FrankReh
Copy link
Collaborator

FrankReh commented Oct 6, 2022

Nice you got changes merged to io-uring. I'm looking at those diffs now. That does include a test of the new op so that's a big help.

First question, why didn't you want to expose the CQE_F_NOTIF bit flag the way you did the more flag? I think the tokio-uring crate should use the CQE_F_NOTIF bit flag the way the liburing tests for send_zc do.

I'd appreciate your thought on the driver changes here, w.r.t support for #112, which has the same 1 : many SQE/CQE behaviour.

Did you mean issue #104, Multishot operations? I was confused for a bit how these changes are related to buffering. Anyway, I found my #112 entry had a typo in the first comment which might lead one to think it was an issue for all unimplemented operands.

I've cloned your branch and squashed the six commits. It was hard to see what was going on because the 'fix for a race' came in later. But I have to admit I don't understand the Lifecycle changes. A few comments in the poll area vs the complete area could go a long way.

For example, I don't understand how the complete Ignored(op) match is the only one that looks at the More flag, but probably that's because my own handling of More flags was being done differently - in a branch I hadn't shared with anyone yet.

Other thoughts regarding naming and comments, so they aren't very important as Noah will be the final arbiter, are why go with the name 'update/Update' rather than more and More. And why does the Completable trait get a default update that creates an Update::Finished. Probably obvious to someone writing it, but a comment about why it is the right default and what the user of the trait might want to do when providing their own update method would be very helpful.

What is the exposure of tokio to the smallvec crate and is the default size of four that important? Most uses of tokio-uring operations will be things other than zero copy send so maybe the some other way of burdening only the operations that need the extra flexibility? I've created a mirror type to Op, named StrOp, for those operations that will result in streaming responses - but Noah has yet to sign off on that yet too, as I haven't pushed that up to my GitHub repo for him to see.

And about the use of fold on the iterator: its used twice, once in the poll path, once in the cqe complete path, it looks like the logic is designed to iterate through a growing smallvec until the Finished state is reached. Is this because you didn't want to modify the Lifecycle enum states too much? I went a different route for handling 'more' with streams so no growing loops like that were necessary. I did track a vec of more results along the way, to deal with something of the same issue you are tackling here.

Generally, I would say the send_zc and sendmsg_zc are very unusual operations for the io_uring driver at this point and I wouldn't mind a separate type of Lifecycle for such operations, defined in their own module. I've named my own new module strop, for streaming Op, and could leave the original Op basically unchanged. There is some logic duplication between op.rs and strop.rs, but as the nature of the streaming operations are so different from the single shot operations, I favored creating a new Lifecycle implementation. It does mean the entry in the slab has an extra level enum to split between the original Op and the new StrOp. Noah and I will have some work to do to get on the same page as that decision comes with its own long term maintenance questions. But if that design is generally accepted, perhaps a NotiferOp or a ZcOp would allow special tailoring for special operations like this one.

Again generally, I see the Linux io_uring driver getting more and more specialty features over the years and it will be harder and harder to keep a single Op implementation in tokio-uring that handles all the cases with a general solution, while it seems more likely to me that "forking" versions of Op for the new flavors of io_uring feature will allow us to present the various semantic differences of the new operations to the end user in the user land app more easily while not having to change earlier versions of Op. But that is very debatable and probably no one right answer. So perhaps you can wait to see how stream support is added to tokio-uring although it certainly helps all of us to already see there is this zerocopy flavor of operations to support as well.

--

This ended up being very long and thinking about the send_zc and sendmsg_zc was completely new to me yesterday. So please take everything with a grain of salt. And it could very well be that Noah merges your work in before any of my stream work.

@oliverbunting
Copy link

oliverbunting commented Oct 6, 2022

Still me, not on a work account. @FrankReh thank you for your considered reply

First question, why didn't you want to expose the CQE_F_NOTIF bit flag the way you did the more flag? I think the tokio-uring crate should use the CQE_F_NOTIF bit flag the way the liburing tests for send_zc do.

Oversight really. It's not strictly needed either. you either get CQE_F_NOTIF or CQR_F_MORE, so testing for one is sufficient.

Did you mean issue #104, Multishot operations? I was confused for a bit how these changes are related to buffering.

I misunderstood the scope of the issue.

I've cloned your branch and squashed the six commits. It was hard to see what was going on because the 'fix for a race' came in later. But I have to admit I don't understand the Lifecycle changes. A few comments in the poll area vs the complete area could go a long way.

For example, I don't understand how the complete Ignored(op) match is the only one that looks at the More flag, but probably that's because my own handling of More flags was being done differently - in a branch I hadn't shared with anyone yet.

Lifetime doesn't know what to do with a completion. So all it does when called by the completion handling logic is store them in a smallvec, and await for Poll to deal with them.

smallvec was chosen to avoid allocating for the default case (single completion) and when you're actively polling, when you're less likely to have many unacknowledged completions.

The value of 4 was chosen arbitrarily to be small, so completion state should still fit on a cache line.

Other thoughts regarding naming and comments, so they aren't very important as Noah will be the final arbiter, are why go with the name 'update/Update' rather than more and More. And why does the Completable trait get a default update that creates an Update::Finished. Probably obvious to someone writing it, but a comment about why it is the right default and what the user of the trait might want to do when providing their own update method would be very helpful.

I added update to avoiding breaking API changes on existing single completion types.

My rational is that update consumes the completion, updating state, and returns if it needs more or has finished. For the case for single completion event types, this is just complete.

What is the exposure of tokio to the smallvec crate and is the default size of four that important? Most uses of tokio-uring operations will be things other than zero copy send so maybe the some other way of burdening only the operations that need the extra flexibility? I've created a mirror type to Op, named StrOp, for those operations that will result in streaming responses - but Noah has yet to sign off on that yet too, as I haven't pushed that up to my GitHub repo for him to see.

I could create a MultiCompleted enum varient. I don't believe smallvec will incur significant overhead, but I'm more than happy to put it in its own variant.

Vec would be a bad idea I believe. That would incur significant numbers of allocations, which would wipe out gains seen by zero copy. Micro seconds matter here, for me anyway.

And about the use of fold on the iterator: its used twice, once in the poll path, once in the cqe complete path, it looks like the logic is designed to iterate through a growing smallvec until the Finished state is reached. Is this because you didn't want to modify the Lifecycle enum states too much? I went a different route for handling 'more' with streams so no growing loops like that were necessary. I did track a vec of more results along the way, to deal with something of the same issue you are tackling here.

I'm just consuming all completions currently recieved, in order, and checking if we still need more or are finished. I've no knowledge about how many mores I'll get before the op completes. A for loop would probably be better, but I'm a recovering Haskell developer.

Generally, I would say the send_zc and sendmsg_zc are very unusual operations for the io_uring driver at this point and I wouldn't mind a separate type of Lifecycle for such operations, defined in their own module. I've named my own new module strop, for streaming Op, and could leave the original Op basically unchanged. There is some logic duplication between op.rs and strop.rs, but as the nature of the streaming operations are so different from the single shot operations, I favored creating a new Lifecycle implementation. It does mean the entry in the slab has an extra level enum to split between the original Op and the new StrOp. Noah and I will have some work to do to get on the same page as that decision comes with its own long term maintenance questions. But if that design is generally accepted, perhaps a NotiferOp or a ZcOp would allow special tailoring for special operations like this one.

I'm very keen to see how you avoided growing a collection of results in the presence of the race between the ring and Poll. That sounds better.

I also think send_zc could have just been a change to send. It is after all just a socket option on the normal syscall. I decided not to do that due to the scope of that change.

@Noah-Kennedy
Copy link
Contributor

I think we should split this into two PRs. One for the multishot support in the driver, and another for the ZC send ops.

@oliverbunting
Copy link

I think we should split this into two PRs. One for the multishot support in the driver, and another for the ZC send ops.

Ok, happy to do so.

W.r.t the driver bit, do you hold any opinions on the areas discussed above?

ollie-etl and others added 6 commits October 10, 2022 10:24
This is used to parameterize Op<T, CqeType>. It is now possible to
specialize blanket implmentations, such as Future, to classes of Ops
Adds infrastructure to store Completions a slab backed
indexed linked list
Provides the infrastructure for recieving multiple CQE's for a single
Op's lifecycle
@ollie-etl
Copy link
Contributor Author

@FrankReh @Noah-Kennedy I've updated this based on #130. I think its cleaner, and shows off the Op specializations .

@FrankReh
Copy link
Collaborator

@ollie-etl I haven't forgotten to look at this. Just want to get a little closure on my buf_ring work first before loosing focus.

@Noah-Kennedy
Copy link
Contributor

I have been pretty busy all week, but I should be able to get to this over the weekend.

src/driver/send_zc.rs Outdated Show resolved Hide resolved
src/driver/op/slab_list.rs Show resolved Hide resolved
src/driver/op.rs Outdated Show resolved Hide resolved
src/driver/op.rs Show resolved Hide resolved
src/driver/op.rs Outdated Show resolved Hide resolved
src/driver/op.rs Outdated Show resolved Hide resolved
src/driver/op.rs Show resolved Hide resolved
src/driver/send_zc.rs Show resolved Hide resolved
src/driver/socket.rs Outdated Show resolved Hide resolved
src/net/udp.rs Show resolved Hide resolved
@ollie-etl
Copy link
Contributor Author

@FrankReh I think the comments are addressed?

src/driver/op.rs Outdated Show resolved Hide resolved
src/driver/op.rs Show resolved Hide resolved
src/net/udp.rs Show resolved Hide resolved
src/driver/op.rs Show resolved Hide resolved
@ollie-etl
Copy link
Contributor Author

@FrankReh thanks for the quick review. I think those are done now

@ollie-etl
Copy link
Contributor Author

Just adding a link to #54 to mark as relevant for when that merges.

Copy link
Collaborator

@FrankReh FrankReh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks for your patience. Nice addition.

@Noah-Kennedy
Copy link
Contributor

Let's hold off on merge for a little bit. We might need to add some more code around multishot ops to make shutdown work properly.

@FrankReh
Copy link
Collaborator

@ollie-etl
Now just wondering what the squashed commit message should say.

Care to offer a title and short paragraph here?

@ollie-etl
Copy link
Contributor Author

@Noah-Kennedy : this isn't like a multi-shot op in that sense. There is no uncertainty about the state like an accept call - if you've submitted a send_zc, then in fairly short order you'll get some cqe's

@ollie-etl
Copy link
Contributor Author

@FrankReh How about

Add a udp socket zero copy send variant

Uses the IORING_OP_SEND_ZC flag introduced in linux 6.0

Copy link
Contributor

@Noah-Kennedy Noah-Kennedy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ollie-etl @FrankReh due to shutdown changes that had to be made, we can't merge this right now. We need a way to tell if a particular multishot completion is finished. I'll open up a PR tonight with this.

@FrankReh
Copy link
Collaborator

FrankReh commented Nov 7, 2022

@ollie-etl FYI
Ran across this kernel submission. It says zero copy is supported for UDP and TCP, you knew that. And it says when it is not supported, it will no longer fall back to a copy but will return an error. Don't know if you knew that yet.

commit 942e01ab90151a16b79b5c0cb8e77530d1ee3dbb
Merge: d47136c28015 cc767e7c6913
Author: Linus Torvalds <[email protected]>
Date:   Sun Oct 23 09:55:50 2022 -0700

    Merge tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux
    
    Pull io_uring follow-up from Jens Axboe:
     "Currently the zero-copy has automatic fallback to normal transmit, and
      it was decided that it'd be cleaner to return an error instead if the
      socket type doesn't support it.
    
      Zero-copy does work with UDP and TCP, it's more of a future proofing
      kind of thing (eg for samba)"
    
    * tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux:
      io_uring/net: fail zc sendmsg when unsupported by socket
      io_uring/net: fail zc send when unsupported by socket
      net: flag sockets supporting msghdr originated zerocopy

@ollie-etl
Copy link
Contributor Author

ollie-etl commented Nov 7, 2022

@Noah-Kennedy I don't think there is anything blocking now

@Noah-Kennedy
Copy link
Contributor

You're right, thanks for reminding me that this isn't merged!

@Noah-Kennedy Noah-Kennedy enabled auto-merge (squash) November 7, 2022 21:16
@Noah-Kennedy Noah-Kennedy enabled auto-merge (squash) November 7, 2022 21:16
@Noah-Kennedy Noah-Kennedy merged commit 35c9495 into tokio-rs:master Nov 7, 2022
FrankReh pushed a commit that referenced this pull request Nov 17, 2022
Somewhere, in the myriad of refactorings that occurred in #123, I
dropped the marker type. This leads to wrongly associating the Future
for handling SingleCQE types with SendZc, which is a multi completion
type.

Co-authored-by: ollie-etl <Oliver [email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants