Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timeout escapes in runner #40

Closed
mhofman opened this issue Dec 7, 2021 · 1 comment
Closed

Fix timeout escapes in runner #40

mhofman opened this issue Dec 7, 2021 · 1 comment
Assignees

Comments

@mhofman
Copy link
Member

mhofman commented Dec 7, 2021

There are a few places in the runner where some sub tasks are awaited without a timeout, and if those operations never resolve, the runner will be going forever without making progress.

In particular, any task .ready is usually awaited as-is.

One recent example, it seems the chain can get in a state where it no longer makes progress on pending loadgen tasks, without the loadgen task failing (Agoric/agoric-sdk#4155). This first happens in a middle of a stage, which will end after allotted time (after timeout of wind-down, which is currently non-fatal), however at restart it will be stuck on await orInterrupt(runLoadgenResult.ready).

The short term solution is to add explicit timeouts to all such site. Long term we could modify the async task helpers to thread a stop promise to downstream tasks signaling that they should exit immediately, and add a top-level timeout to the stage to make sure it never goes longer than anticipated. Threading an abort mechanism in async tasks is fairly complex, especially when it should accommodate a somewhat clean shutdown (finalization steps for each task).

mhofman added a commit that referenced this issue Jan 17, 2022
- Update to node 16 and Debian bullseye (with fallback to node 14 for older incompatible SDKs)
- Handle some older SDK versions which output lockdown sniffing to stdout instead of stderr
- Rewrote the config argv parsing logic, to make it behave slightly more sanely
- Fixed some deadlock issues, e.g. adding some timeouts on task ready (see #40) or slog streams not closing properly
- Capture client storage and slog file (should help track some transient seg faults in the solo)
- Automatically capture the state of the client and chain if an error occurs (see #39)
- Background the compression of the state directories after snapshotting them (overlayfs supports CoW). Closes #39 
- Avoid resetting the whole `agoric-servers` project in `local-chain` tests. Removes loadgen project `git` dependency.
- Elide long lines from the chain or solo output (improves github actions perf, see Agoric/agoric-sdk#4113)
@mhofman
Copy link
Member Author

mhofman commented Feb 7, 2022

I believe I've fixed all simple cases in #44, but we should have an holistic approach to this. Closing in favor of #60 for the longterm approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant