`gha_job_execution_duration_seconds_sum` reports wrong value in some cases #3731

hpedrorodrigues · 2024-09-05T17:32:30Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.3

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Install `gha-runner-scale-set-controller` using the Helm chart via FluxCD
2. Install a few `gha-runner-scale-set`s using the Helm chart via FluxCD
3. Run a few workflows to use these runner sets (including canceling a few of them / either manually or due to `concurrency.group`)

Describe the bug

In a few cases (don't know exact reason yet) the listener reports the metric gha_job_execution_duration_seconds_sum with a wrong value.

Example:

gha_job_execution_duration_seconds_sum{enterprise="",event_name="repository_dispatch",job_name="create-gh-deployment",job_result="canceled",job_workflow_ref="[redacted]/.github/workflows/gh-deployment.yml@refs/heads/master",organization="[redacted]",repository="[redacted]",runner_id="0",runner_name=""} 1.27722295721e+11

Looking at the repository, all runs take less than 60 seconds to finish. The other ones are canceled even before starting because the branch has a new commit.

Describe the expected behavior

Not sure if this is caused only by canceled runs, but I'd expect the listener to return 0 for such runs.

Additional Context

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: arc-controller
  namespace: arc
spec:
  chart:
    spec:
      chart: gha-runner-scale-set-controller
      sourceRef:
        name: arc
        kind: HelmRepository
        namespace: flux-system
      version: '>=0.9.3'
  interval: 1m
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace
  values:
    replicaCount: 1
    image:
      repository: [redacted]
    serviceAccount:
      create: true
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 200m
        memory: 200Mi
    metrics:
      controllerManagerAddr: ':8080'
      listenerAddr: ':8080'
      listenerEndpoint: '/metrics'
    flags:
      logFormat: 'json'
      watchSingleNamespace: 'arc'
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: cp-small-runner-set
  namespace: arc
spec:
  chart:
    spec:
      chart: gha-runner-scale-set
      sourceRef:
        name: arc
        kind: HelmRepository
        namespace: flux-system
      version: '>=0.9.3'
  interval: 1m
  values:
    githubConfigUrl: [redacted]
    githubConfigSecret: gh-app-secret
    maxRunners: 10
    minRunners: 0
    runnerGroup: default
    runnerScaleSetName: cp-small
    containerMode:
      type: dind
    template:
      metadata:
        annotations:
          cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
      spec:
        nodeSelector:
          spot: 'false'
          dedicated-for: github-actions
        tolerations:
          - effect: NoSchedule
            key: dedicated-for
            value: github-actions-2x
        containers:
          - name: runner
            image: arc-default-runner
            command: ['/home/runner/run.sh']
            resources:
              requests:
                cpu: 2
                memory: 4Gi
              limits:
                cpu: 2
                memory: 4Gi
        terminationGracePeriodSeconds: 600

Controller Logs

N/A

Runner Pod Logs

N/A

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-05T17:32:59Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

Lucas-Hughes · 2024-09-09T18:13:30Z

I get the same result from canceled runs or when the runner pods failed.

I implemented a bit of a hacky fix by putting parameters in Grafana to ignore certain values above a threshold, but agree that it should be 0 for those runs.

laserpedro · 2024-10-25T12:30:27Z

I get the same result and like @Lucas-Hughes it seems to happen when the jobs are cancelled. That's too bad since this metrics is super valuable since we can create alerts to detect slower than usual github jobs ....

hpedrorodrigues added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`gha_job_execution_duration_seconds_sum` reports wrong value in some cases #3731

`gha_job_execution_duration_seconds_sum` reports wrong value in some cases #3731

hpedrorodrigues commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

Lucas-Hughes commented Sep 9, 2024

laserpedro commented Oct 25, 2024

gha_job_execution_duration_seconds_sum reports wrong value in some cases #3731

gha_job_execution_duration_seconds_sum reports wrong value in some cases #3731

Comments

hpedrorodrigues commented Sep 5, 2024

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

github-actions bot commented Sep 5, 2024

Lucas-Hughes commented Sep 9, 2024

laserpedro commented Oct 25, 2024

`gha_job_execution_duration_seconds_sum` reports wrong value in some cases #3731

`gha_job_execution_duration_seconds_sum` reports wrong value in some cases #3731