Intermittent "Operation cancelled by user" error #3719
Unanswered
CarolMebiom
asked this question in
Questions
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, everyone.
My company has an on-prem kubernetes cluster, set up using k3s, in which we are deploying ephemeral self-hosted runners, with GPU enabled. From time to time we find this error: "Operation cancelled by user" without any intervention on our side. As we were aware that the issue could be resources' constraints, we increased the requests and limits for CPU and memory.
We are using kubernetes v1.26 in Ubuntu machines and the gha runner scale set is of version 0.9.3.
This is the workflow file:
Here are the logs of the workflow when it does not finish running because of the "Operation cancelled by user" error:
For some reason, the runner receives the shutdown signal. However, the issue is intermittent, sometimes the jobs run, other times it does not. I have looked into the resources and it does not seem to be it, I do not get OOM errors and there are no other pods in the same node requesting GPU...
Besides that, sometimes one of the tasks does not even get pods scheduled even though the Github Self-hosted runner listener presents these logs:
And the controller presents these ones:
I really do not understand what is going on and whatever help you can suggest is extremely appreciated. If there is any other information you need to provide assistance, please let me know that I will gladly it provide it.
Beta Was this translation helpful? Give feedback.
All reactions