Flowing logs to opentelemetry #3252

mwflaher · 2024-01-29T15:56:11Z

mwflaher
Jan 29, 2024

We're flowing logs (and metrics) from our installation into opentelemetry and would like to see them downstream, but are struggling to figure out how to get the stdout/stderr that we see in the github actions UX. Is this possible, or is something redirecting logs in a way that we can't hook into? Thanks!

Did a search first but didn't find an existing discussion. This discussion confirms that they are not there: #2054 but perhaps the reasoning is this redaction concern?

nobbs · 2024-01-29T21:02:22Z

nobbs
Jan 29, 2024

Hey @mwflaher,

based on own experience with troubleshooting self-hosted runners as well as the official documentation, job logs - so the logs you see streamed in the web ui - can be found inside the running runner pod / container in the _diag folder inside the directory the runner files are placed.

The logs generated are not just one file, but split into multiple files (by steps? not sure, can't look it up right now). They are also not updated "live", but written in chunks. To ingest the logs into any kind of log aggregation system, you probably would need to build a custom solution. I haven't yet seen anything useful for this.

0 replies

falvarado-maven · 2024-01-29T23:52:23Z

falvarado-maven
Jan 29, 2024

We also are having issues finding the STDOUT runner handled process logs in our kubernetes logs and opentelemetry using the recommended ARC setup for ephemeral runners.
https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller

Our scaleset has multiple containers along the runner, all container logs arrive to our downstream.

template:
  spec:
    - name: runner
       command: ["/home/runner/run.sh"]
    - name: postgres
    - name: kafka
    - name: ...

We created a test java application that only writes a line to System.out and are running it with the GitHub Action gradle. The runner logs themselves arrive just fine and the GitHub UI shows the STDOUT outputted lines, but anything handled by the run.sh process that outputs to STDOUT throws a runner SDK Exception. We see the exception in the otel downstream

    public int returnsOne() {
        System.out.println("About to return '1'!");

      - name: Build with Gradle
        uses: gradle/gradle-build-action@v2
        with:
          arguments: build test -i

[WORKER 2024-01-29 22:05:13Z INFO ProcessInvokerWrapper]    at GitHub.Runner.Sdk.ProcessInvoker.WriteProcessOomScoreAdj(Int32 processId, Int32 oomScoreAdj)
[WORKER 2024-01-29 22:05:13Z INFO ProcessInvokerWrapper]    at System.IO.File.WriteAllText(String path, String contents)
[WORKER 2024-01-29 22:05:13Z INFO ProcessInvokerWrapper]    at System.IO.StreamWriter.Dispose(Boolean disposing)
[WORKER 2024-01-29 22:05:13Z INFO ProcessInvokerWrapper]    at System.IO.StreamWriter.CloseStreamFromDispose(Boolean disposing)
[WORKER 2024-01-29 22:05:13Z INFO ProcessInvokerWrapper]    at System.IO.Strategies.BufferedFileStreamStrategy.Dispose(Boolean disposing)
[WORKER 2024-01-29 22:05:13Z INFO ProcessInvokerWrapper]    at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
[WORKER 2024-01-29 22:05:13Z INFO ProcessInvokerWrapper]    at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
[WORKER 2024-01-29 22:05:13Z INFO ProcessInvokerWrapper]    --- End of inner exception stack trace ---
[WORKER 2024-01-29 22:05:13Z INFO ProcessInvokerWrapper]  ---> System.IO.IOException: Permission denied
[WORKER 2024-01-29 22:05:13Z INFO ProcessInvokerWrapper] System.UnauthorizedAccessException: Access to the path '/proc/1692/oom_score_adj' is denied.

Comes from https://github.com/actions/runner/blob/main/src/Runner.Sdk/ProcessInvoker.cs#L872

We don't allow privileged containers in our kubernetes setup, but still for testing we added the privileged capability and permission to elevate. The exception IO Exception still shows in the logs. We confirmed we're able to elevate runner user privileges in the container and write to the OOM file.

Trying to run the container as root threw an error "Must not run interactively with sudo"

template:
  spec:
    - name: runner
      securityContext:
        privileged: true
        # runAsUser: 0
        allowPrivilegeEscalation: true

I understand concerns expressed related to performance https://github.com/actions/runner/blob/main/src/Runner.Sdk/ProcessInvoker.cs#L20 but somehow this feels to me like a broken implementation. Kubernetes core recommends writing all container output to STDOUT for consolidation https://kubernetes.io/docs/concepts/cluster-administration/logging/ , we should be able to transparently see our workload logs as handled by the runner somehow.

Is there any way the ProcessInvoker can be setup to avoid OOM Score Adjustments on a kubernetes ARC setup with scalesets? This will avoid running privileged workloads which would be security compliant. Otherwise, is there a privileged setting that the runner can leverage to properly output STDOUT?

0 replies

thekuffs · 2024-10-09T23:32:51Z

thekuffs
Oct 9, 2024

I recently solved this with promtail. Our team runs github actions runners on our own karpenter-controlled nodegroups. Our standard promtail config will use the kubernetes service discovery functionality to read logs from the pods themselves. But github actions runners don't report anything particularly useful to stdout.

Folks in this thread already know that the behind-the-scenes coordination of workers happens in runner/_diag/{Worker,Runner}*.log and the output from jobs themselves happens in runner/_diag/pages/*.log

So we need to get access to the filesystem of the runner pods from within another pod on that host. In our particular setup, the runner pods have /runner mounted as an EmptyDir. TheseEmptyDir folders are temporary directories allocated on the host, following a pattern like this /var/lib/kubelet/pods/${POD_UID}/volumes/kubernetes.io~empty-dir/runner.

I've chosen to map /var/lib/kubelet/pods into the promtail pod as /var/log/kubelet-pods to illustrate intent and also avoid colliding with unknown somethings that might depend on /var/lib/kubelet to be structured in some other way. This is extremely privileged information to be sharing with the promtail pod. We've taken steps to isolate this particular configuration to only the nodes in our infra that run github actions.

So, now with the actions runner pods filesystems mounted into our promtail pod, we can configure promtail like this:

     - job_name: gha-jobs
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels:
              - __meta_kubernetes_pod_annotationpresent_actions_runner_id
            action: keep
            regex: 'true'
          - source_labels:
              - __meta_kubernetes_pod_uid
            action: replace
            replacement: >-
              /var/log/kubelet-pods/$1/volumes/kubernetes.io~empty-dir/runner/_diag/pages/*.log
            target_label: __path__
        pipeline_stages:
          - regex:
              expression: ^(?P<time>\S+T\S+Z?) (?P<message>.*)$
          - timestamp:
              source: time
              format: RFC3339Nano
          - output:
              source: message
          - regex:
              expression: >-
                runner/_diag/pages/(?P<timeline_id>[^_]+?)_(?P<job_id>[^_]+?)_.*.log$
              source: filename
          - labels:
              job_id: ''
              timeline_id: ''
      - job_name: gha-runners
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels:
              - __meta_kubernetes_pod_annotationpresent_actions_runner_id
            action: keep
            regex: 'true'
          - source_labels:
              - __meta_kubernetes_pod_uid
            action: replace
            replacement: >-
              /var/log/kubelet-pods/$1/volumes/kubernetes.io~empty-dir/runner/_diag/{Worker,Runner}_*.log
            target_label: __path__
        pipeline_stages:
          - multiline:
              firstline: ^\[\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}Z \w+\s+\w+\]
              max_wait_time: 3s
          - regex:
              expression: >-
                ^\[(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})Z
                (?P<level>\w+)\s+(?P<subsystem>\w+)\] (?P<message>(?s:.*))$
          - timestamp:
              source: time
              format: '2006-01-02 15:04:05'
          - template:
              source: message
              template: >-
                {{if .level}}[{{ .level }} {{ .subsystem }}] {{ .message
                }}{{else}}{{ .Entry }}{{end}}
          - output:
              source: message

I'm going to go over each segment in order:

     - job_name: gha-jobs
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels:
              - __meta_kubernetes_pod_annotationpresent_actions_runner_id
            action: keep
            regex: 'true'

Both jobs are configured to only run on github actions runners. Anything without that annotation gets skipped.

          - source_labels:
              - __meta_kubernetes_pod_uid
            action: replace
            replacement: >-
              /var/log/kubelet-pods/$1/volumes/kubernetes.io~empty-dir/runner/_diag/pages/*.log
            target_label: __path__

By using the pod uid in our regex template here, we're being explicit about only reading log files from github actions runner pods. And we're going to be able to associate this data with other information we have.

        pipeline_stages:
          - regex:
              expression: ^(?P<time>\S+T\S+Z?) (?P<message>.*)$
          - timestamp:
              source: time
              format: RFC3339Nano
          - output:
              source: message

This extracts the timestamp from the text of the message. We then only retain the message itself for the log entry. This makes logs easier to read on the other end. Allowing us to represent time in a browser-selected format when viewing.

          - regex:
              expression: >-
                runner/_diag/pages/(?P<timeline_id>[^_]+?)_(?P<job_id>[^_]+?)_.*.log$
              source: filename
          - labels:
              job_id: ''
              timeline_id: ''

This pulls additional metadata from the filename itself. https://github.com/actions/runner/blob/6d7446a45ebc638a842895d5742d6cf9afa3b66d/src/Runner.Common/Logging.cs#L127 calls this timelineId and timelineRecordId. I haven't yet been able to map those to terms that a github actions user might be able to correlate on. But for now we're taking them and recording them as labels.

The first few bits of the next job are the same as the first. I'm not going to repeat them. Here's what's next:

        pipeline_stages:
          - multiline:
              firstline: ^\[\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}Z \w+\s+\w+\]
              max_wait_time: 3s
          - regex:
              expression: >-
                ^\[(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})Z
                (?P<level>\w+)\s+(?P<subsystem>\w+)\] (?P<message>(?s:.*))$

The worker logs have a different format. And more importantly, they often have multiline stack traces. This pair of instructions ensures that we extract the timestamp and other info from the log message, and also correctly group stack traces into one message.

          - template:
              source: message
              template: >-
                {{if .level}}[{{ .level }} {{ .subsystem }}] {{ .message
                }}{{else}}{{ .Entry }}{{end}}
          - output:
              source: message

This reformats the log messages and drops the timestamp. For the same reasons we chose above. When I was working, I found some messages that did not match the regex I specified. This includes a failsafe that allows the log entry to be emitted if our regex did not match for some reason.

We still have a few more tweaks to apply. I'm going to work with one of my teammates to attempt to find some correlating labels that may allow us to follow a workflow or job by id more completely.

Anyway I hope this helps someone somehow. I realize it's kind of implementation specific. But there's details in here that might get you over a hump you have.

0 replies

pawelros · 2024-11-08T12:29:56Z

pawelros
Nov 8, 2024

@thekuffs thank you for the info, how did you manage to match logs with a particular run? I need to assign a proper labels to logs like repository, workflow name, run etc.

0 replies

thekuffs · 2024-11-10T23:02:26Z

thekuffs
Nov 10, 2024

For the time being, I don't. My internal requirement was to get things logged and archived. I would have liked to get that other correlating info tied together. But I didn't see a trivial solution.

So, there's some information encoded in the filename as I mentioned (https://github.com/actions/runner/blob/6d7446a45ebc638a842895d5742d6cf9afa3b66d/src/Runner.Common/Logging.cs#L127) but I wasn't able to correlate those ids with anything the user sees.

I can't remember which log has it exactly, but one of them dumps a big JSON blob at the beginning of the run. It contains the uuids involved, the repository, I think the user, and a bunch of other information. The problem is, there's no way for me to configure promtail to capture that block as a sort of context for the rest of the file. I think I'd have to start exploring a custom log parser to really make it work. And I just don't have that kind of time for such a solution.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flowing logs to opentelemetry #3252

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Flowing logs to opentelemetry #3252

mwflaher Jan 29, 2024

Replies: 5 comments

nobbs Jan 29, 2024

falvarado-maven Jan 29, 2024

thekuffs Oct 9, 2024

pawelros Nov 8, 2024

thekuffs Nov 10, 2024

mwflaher
Jan 29, 2024

nobbs
Jan 29, 2024

falvarado-maven
Jan 29, 2024

thekuffs
Oct 9, 2024

pawelros
Nov 8, 2024

thekuffs
Nov 10, 2024