actions-runner-controller v0.27.0 has been released! #2156
Replies: 4 comments 1 reply
-
What is the way to apply the update to existing runners? |
Beta Was this translation helpful? Give feedback.
-
This is the first release under the |
Beta Was this translation helpful? Give feedback.
-
Are these runner images compatible with previous versions of ARC? We specifically are using |
Beta Was this translation helpful? Give feedback.
-
Hello @mumoshu |
Beta Was this translation helpful? Give feedback.
-
This release features several reliability and observability enhancements across the controller and the runner, along with new Ubuntu 22.04-based runner images.
All planned changes in this release can be found in the milestone https://github.com/actions-runner-controller/actions-runner-controller/milestone/10.
Also see v0.26.0...v0.27.0 for full changelog.
This log documents breaking changes and major enhancements
Upgrading
In case you're using our Helm chart to deploy ARC, use the chart 0.21.0 or greater. Don't miss upgrading CRDs as usual! Helm doesn't upgrade CRDs.
BREAKING CHANGE :
workflow_job
became ARC's only supported webhook event as the scale trigger.In this release, we've removed support for legacy
check_run
,push
, andpull_request
webhook events, in favor ofworkflow_job
that has been released a year ago. Since then, it served all the use-cases formely and partially supported by the legacy events, and we should be ready to fully migrate toworkflow_job
.Anyone who's still using legacy webook events should see
HorizontalRunnerAutoscaler
specs that look similar to the following examples:You need to update the spec to look like the below, along with enabling the
Workflow Job
events(and disabling unneededPush
,Check Run
, andPull Request
evenst) on your webhook setting page on GitHub.Relevant PR(s): #2001
Fix : Runner pods should work more reliably with cluster-autoscaler
We've fixed many edge-cases in the runner pod termination process which seem to have resulted in various issues, like pods stuck in Terminating, workflow jobs being stuck for 10 minutes or so when an external controller like cluster-autoscaler tried to terminate the runner pod that is still running a workflow job, a workflow job fails due to a job container step being unable access the docker daemon, and so on.
Do note that you need to set appropariate
RUNNER_GRACEFUL_STOP_TIMEOUT
for both thedocker
sidecar container and therunner
container specs to let it wait for long and sufficient time for your use-case.RUNNER_GRACEFUL_STOP_TIMEOUT
is basically the longest time the runner stop process to wait until the runner agent to gracefully stop.It's set to
RUNNER_GRACEFUL_STOP_TIMEOUT=15
by default, which might be too short for any use-cases.For example, in case you're using AWS Spot Instances to power nodes for runner pods, it gives you 2 minutes at the longest. You'd want to set the graceful stop timeout slightly shorter than the 2 minutes, like
110
or100
seconds depending how much cpu, memory and storage your runner pod is provided.With rich cpu/memory/storage/network resources, the runner agent could stop gracefully well within 10 seconds, making
110
the right setting. With fewer resources, the runner agent could take more than 10 seconds to stop gracefully. If you think it would take 20 seconds for your environment,100
would be the right setting.RUNNER_GRACEFUL_STOP_TIMEOUT
is designed to be used to let the runner stop process as long as possible to avoid cancelling the workflow job in the middle of processing, yet avoiding the workflow job to stuck for 10 minutes due to the node disappear before the runner agent cancelling the job.Under the hood,
RUNNER_GRACEFUL_STOP_TIMEOUT
works by instructing runner's signal handler to delay forwardingSIGTERM
sent by Kubernetes on pod terminatino down to the runner agent. The runner agent is supposed to cancel the workflow job only onSIGTERM
so making this delay longer allows you to delay cancelling the workfow job, which results in a more graceful period to stop the runner. Practically, the runner pod stops gracefully only when the workflow job running within the runner pod has completed before the runner graceful stop timeout elapses. The timeout can't be forever in practice, although it might theoretically possible depending on your cluster environment. AWS Spot Instances, again for example, gives you 2 minutes to gracefully stop the whole node, and thereforeRUNNER_GRACEFUL_STOP_TIMEOUT
can't be longer than that.If you have success stories with the new
RUNNER_GRACEFUL_STOP_TIMEOUT
, please don't hesitate to create aShow and Tell
discussion in our GitHub Discussions to share what configuration worked on which environment, including the name of your cloud provider, the name of managed Kubernetes service, the graceful stop timeout for nodes(defined and provided by the provider or the service) and the runner pods (RUNNER_GRACEFUL_STOP_TIMEOUT
).Relevant PR(s): #1759, #1851, #1855
ENHANCEMENT : More reliable and customizable "wait-for-docker" feature
You can now add a
WAIT_FOR_DOCKER_SECONDS
envvar to therunner
container of the runner pod spec to customize how long you want the runner startup script to wait until the docker daemon gets up and running. Previously this has been hard-coded to 120 seconds and it wasn't sufficient in some environments.Along with the enhancement, we also fixed a bug in the runner startup script that it didn't exit immediately on the docker startup timeout.
The bug resulted in that you see a job container step failing due to missing docker socket. Ideally it should have kept auto-restarting the whole runner pod until you get a fully working runner pod with the working runner agent plus the docker daemon (that started within the timeout), and therefore you should have never seen the job step failing due to docker issue.
We fixed it so it should work as intended now.
Relvant PR(s): #1999
ENHANCEMENT : New webhook and metrics server for monitoring workflow jobs
**This feature is 99% authored and contributed by @ColinHeathman. Big kudos to Colin for his awesome work! **
You can now use the new
actions-metrics-server
to expose additional GitHub webhook endpoint for receivingworkflow_job
events and calculating and collecting various metrics related to the jobs. Please see the updated chart documentation for how to enable it.We made it a separate component instead of adding the new metrics collector to our existing
github-webhook-server
to retain the ability to scale thegithub-webhook-server
to two or more replicas for availability and scalability.Also note that
actions-metrics-server
cannot be scaled to 2 or more replicas today.That's because it needs to store it's state somewhere to retain the
workflow_job
webhook event until it receives the corresponding webhook event to finally calculate the metric value, and the only supported state store is in-memory as of today.For exmaple, it needs to save
workflow_job
ofstatus=queued
until it receives the correspondingworkflow_job
ofstatus=in_progress
to finally calculate the queue duration metric value.We may add another state store that is backed by e.g. Memcached or Redis if there's enough demand. But we opted to not complicate ARC for now. You can follow the relevant discussion in this thread.
Relvant PR(s): #1814, #2057
New runner images based on Ubuntu 22.04
We started publishing new runner images based on Ubuntu 22.04 with the following tags:
The
latest
tags for the runner images will stick with Ubuntu 20.04 for a while. We'll try to submit an issue or a discussion for notice before switching the latest to 22.04. See this thread for more context.Note that we took this chance to slim down the runner images for more security, maintainability, and extensibility. That said, some packages that are present by default in hosted runners but can easily be installed using
setup-
actions (likepython
using thesetup-python
action) and other convenient but not strictly necessary packages likeftp
,telnet
,upx
and so on are no longer installed onto our 22.04 based runners. Consult below Dockerfile parts and add somesetup-
actions to your workflows or build your own custom runner image(s) based on our new 22.04 images, in case you relied on some packages present in our 20.04 images but not in our 22.04 images:20.04 runner
22.04 runner
20.04 dind-runner
22.04 dind-runner
20.04 rootless-dind-runner
22.04 rootless-dind-runner
These images are not strictly tied to the v0.27.0 release. You can freely try the new images with ARC v0.26.0, or use both 20.04 and 22.04 based images with ARC v0.27.0.
Relevant PR(s): #1924, #2030, #2033, #2036, #2050, #2078, #2079, #2080, #2098
Beta Was this translation helpful? Give feedback.
All reactions