You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a few places in the runner where some sub tasks are awaited without a timeout, and if those operations never resolve, the runner will be going forever without making progress.
In particular, any task .ready is usually awaited as-is.
One recent example, it seems the chain can get in a state where it no longer makes progress on pending loadgen tasks, without the loadgen task failing (Agoric/agoric-sdk#4155). This first happens in a middle of a stage, which will end after allotted time (after timeout of wind-down, which is currently non-fatal), however at restart it will be stuck on await orInterrupt(runLoadgenResult.ready).
The short term solution is to add explicit timeouts to all such site. Long term we could modify the async task helpers to thread a stop promise to downstream tasks signaling that they should exit immediately, and add a top-level timeout to the stage to make sure it never goes longer than anticipated. Threading an abort mechanism in async tasks is fairly complex, especially when it should accommodate a somewhat clean shutdown (finalization steps for each task).
The text was updated successfully, but these errors were encountered:
- Update to node 16 and Debian bullseye (with fallback to node 14 for older incompatible SDKs)
- Handle some older SDK versions which output lockdown sniffing to stdout instead of stderr
- Rewrote the config argv parsing logic, to make it behave slightly more sanely
- Fixed some deadlock issues, e.g. adding some timeouts on task ready (see #40) or slog streams not closing properly
- Capture client storage and slog file (should help track some transient seg faults in the solo)
- Automatically capture the state of the client and chain if an error occurs (see #39)
- Background the compression of the state directories after snapshotting them (overlayfs supports CoW). Closes#39
- Avoid resetting the whole `agoric-servers` project in `local-chain` tests. Removes loadgen project `git` dependency.
- Elide long lines from the chain or solo output (improves github actions perf, see Agoric/agoric-sdk#4113)
There are a few places in the runner where some sub tasks are awaited without a timeout, and if those operations never resolve, the runner will be going forever without making progress.
In particular, any task
.ready
is usually awaited as-is.One recent example, it seems the chain can get in a state where it no longer makes progress on pending loadgen tasks, without the loadgen task failing (Agoric/agoric-sdk#4155). This first happens in a middle of a stage, which will end after allotted time (after timeout of wind-down, which is currently non-fatal), however at restart it will be stuck on
await orInterrupt(runLoadgenResult.ready)
.The short term solution is to add explicit timeouts to all such site. Long term we could modify the async task helpers to thread a
stop
promise to downstream tasks signaling that they should exit immediately, and add a top-level timeout to the stage to make sure it never goes longer than anticipated. Threading an abort mechanism in async tasks is fairly complex, especially when it should accommodate a somewhat clean shutdown (finalization steps for each task).The text was updated successfully, but these errors were encountered: