be more aggressive about closing TCP connections? #9

warner · 2019-08-29T18:48:21Z

When looking at the server, I see lingering connections all the time. These are TCP connections that have been around for days, but are not moving any traffic. It's not really a problem, but it's weird, and I'd like to understand what's going on (and if it reflects some sort of bug in the client).

The relay will close one side as soon as it receives a close from the other side. So the only way to remain in this state is for our kernel to think that both sides are still connected. TCP has a notoriously long no-new-traffic timeout (at least hours, but maybe effectively infinite), and we don't send any sort of keepalives on this channel (and we cannot, since all the bytes are reserved for the two ends of the connection, and wormhole's Transit protocol doesn't do any keepalives (but the new Dilation protocol does)). So if both sides got partitioned (they closed their laptop before the wormhole process exits), the server might not see the connections close for a very long time.

If the sender closes their laptop before the transfer completes, the receiver will be left hanging (the progress bar showing only a partial transfer). If/when the recipient kills the program, the receiver's socket will be closed, the server will get a FIN, and the server will shut down the sender's socket too. If the recipient closes their laptop before killing the program, the server will see the sockets left open for a long time.

If the receiver closes their laptop first (before the transfer completes), the server's outgoing kernel TCP buffer will fill with unacked data, so the server will pause, so the server's incoming kernel buffer will fill, then the server's TCP stack will stop ACKing inbound data, then the sender's outgoing buffer will fill, then the sender will pause. The sender will see a partial progress bar, and no further progress being made. Looking at the server, I'd see non-empty kernel buffers for the connections (which I don't think I've ever seen, at least for connections that aren't making any progress at all). The server's kernel will retry the unacked outgoing TCP, and when those timeouts fail (which I think tends to be 5-10 minutes, maybe 15, but way shorter than the no-data-to-send case), the server will see a dropped connection, and will drop everything.

So I think the lingering connections I've observed must be from quiescent transfers, with no data being transferred at the time the partition happens. Or both sides are quiescent (transfer has finished) but they just forget to disconnect somehow.

Possible actions:

enable some TCP keepalive option, and hope it actually does something useful
have the server record how much data has been transferred (in each direction) on all sockets, at least temporarily, and find a way to correlate this with existing open sockets. So when I look at the server and see a lingering socket, I can find out whether the transfer hasn't started yet (zero bytes in both directions), has started/maybe-completed but is unacked (lots of bytes in one direction, zero bytes in the other), or has completed (lots of bytes in one direction, a small number for the ack in the other).
see if lingering relay connections are correlated with lingering mailbox connections, which might indicate clients that just forget to exit. The mailbox connections use websockets, which have their own ping/pong keepalive timeouts, so they'll tend to be closed more quickly in the event of partition
double-check that the new Dilation protocol sends periodic keepalives, and that it discovers partitions/laptop-closed in a timely fashion

The text was updated successfully, but these errors were encountered:

This timeout is notoriously long (about two hours), but it might eventually prune stuck connections. refs #9

warner added a commit that referenced this issue Sep 11, 2019

enable SO_KEEPALIVE on all connections

42a2932

This timeout is notoriously long (about two hours), but it might eventually prune stuck connections. refs #9

ewanas mentioned this issue Jun 2, 2022

Improve handling of aborted connections #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

be more aggressive about closing TCP connections? #9

be more aggressive about closing TCP connections? #9

warner commented Aug 29, 2019

be more aggressive about closing TCP connections? #9

be more aggressive about closing TCP connections? #9

Comments

warner commented Aug 29, 2019