Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

be more aggressive about closing TCP connections? #9

Open
warner opened this issue Aug 29, 2019 · 0 comments
Open

be more aggressive about closing TCP connections? #9

warner opened this issue Aug 29, 2019 · 0 comments

Comments

@warner
Copy link
Collaborator

warner commented Aug 29, 2019

When looking at the server, I see lingering connections all the time. These are TCP connections that have been around for days, but are not moving any traffic. It's not really a problem, but it's weird, and I'd like to understand what's going on (and if it reflects some sort of bug in the client).

The relay will close one side as soon as it receives a close from the other side. So the only way to remain in this state is for our kernel to think that both sides are still connected. TCP has a notoriously long no-new-traffic timeout (at least hours, but maybe effectively infinite), and we don't send any sort of keepalives on this channel (and we cannot, since all the bytes are reserved for the two ends of the connection, and wormhole's Transit protocol doesn't do any keepalives (but the new Dilation protocol does)). So if both sides got partitioned (they closed their laptop before the wormhole process exits), the server might not see the connections close for a very long time.

If the sender closes their laptop before the transfer completes, the receiver will be left hanging (the progress bar showing only a partial transfer). If/when the recipient kills the program, the receiver's socket will be closed, the server will get a FIN, and the server will shut down the sender's socket too. If the recipient closes their laptop before killing the program, the server will see the sockets left open for a long time.

If the receiver closes their laptop first (before the transfer completes), the server's outgoing kernel TCP buffer will fill with unacked data, so the server will pause, so the server's incoming kernel buffer will fill, then the server's TCP stack will stop ACKing inbound data, then the sender's outgoing buffer will fill, then the sender will pause. The sender will see a partial progress bar, and no further progress being made. Looking at the server, I'd see non-empty kernel buffers for the connections (which I don't think I've ever seen, at least for connections that aren't making any progress at all). The server's kernel will retry the unacked outgoing TCP, and when those timeouts fail (which I think tends to be 5-10 minutes, maybe 15, but way shorter than the no-data-to-send case), the server will see a dropped connection, and will drop everything.

So I think the lingering connections I've observed must be from quiescent transfers, with no data being transferred at the time the partition happens. Or both sides are quiescent (transfer has finished) but they just forget to disconnect somehow.

Possible actions:

  • enable some TCP keepalive option, and hope it actually does something useful
  • have the server record how much data has been transferred (in each direction) on all sockets, at least temporarily, and find a way to correlate this with existing open sockets. So when I look at the server and see a lingering socket, I can find out whether the transfer hasn't started yet (zero bytes in both directions), has started/maybe-completed but is unacked (lots of bytes in one direction, zero bytes in the other), or has completed (lots of bytes in one direction, a small number for the ack in the other).
  • see if lingering relay connections are correlated with lingering mailbox connections, which might indicate clients that just forget to exit. The mailbox connections use websockets, which have their own ping/pong keepalive timeouts, so they'll tend to be closed more quickly in the event of partition
  • double-check that the new Dilation protocol sends periodic keepalives, and that it discovers partitions/laptop-closed in a timely fashion
warner added a commit that referenced this issue Sep 11, 2019
This timeout is notoriously long (about two hours), but it might eventually
prune stuck connections.

refs #9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant