cached docker volume in gha-runner-scale-set #2944

JohnYoungers · 2023-09-27T15:54:39Z

JohnYoungers
Sep 27, 2023

What would you like added?

In the existing implementation of actions-runner-controller, you could optionally come up with a way to cache docker images by customizing the /var/lib/docker path as described here: https://github.com/actions/actions-runner-controller/blob/master/docs/using-custom-volumes.md

Is there a recommended approach for doing something similar in gha-runner-scale-set?

Why is this needed?

For workflows that use containers (e.g. services), run time can be greatly improved/made more consistent if images already exist locally

Additional context

In the helm chart, you can override the container configuration of the runner by including values in an template.spec.containers[] object where name: "runner". If this docker image caching were an option, I think there would need to be similar logic applied to the dind container as well: https://github.com/actions/actions-runner-controller/blob/master/charts/gha-runner-scale-set/templates/_helpers.tpl#L98

2023-09-27T15:55:11Z

github-actions[bot]
bot Sep 27, 2023

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

0 replies

JohnYoungers · 2023-09-27T19:54:14Z

JohnYoungers
Sep 27, 2023
Author

It sounds like the recommended approach would be to unset containerMode: { type: "dind" } and then manually configure the containers based on some of the other issues/documentation.

I'll likely need to do so to remove the verbose piece from the init container (["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]) since that's blowing up our logs, but I had a question regarding the container in general: if it's purpose is only to copy over files to the dind container, is there any reason that couldn't be baked into another image (e.g. ghcr.io/actions/actions-runner-dind:latest based on docker:dind)?

FROM docker:dind

COPY --from=ghcr.io/actions/actions-runner:latest /home/runner/externals/. /home/runner/tmpDir/

0 replies

Link- · 2023-09-28T08:03:58Z

Link-
Sep 28, 2023
Maintainer

It sounds like the recommended approach would be to unset containerMode: { type: "dind" } and then manually configure the containers based on some of the other issues/documentation.

100%

if it's purpose is only to copy over files to the dind container, is there any reason that couldn't be baked into another image (e.g. ghcr.io/actions/actions-runner-dind:latest based on docker:dind)?

@JohnYoungers we want to reduce the number of container images we publish

0 replies

omri-shilton · 2023-09-28T12:42:36Z

omri-shilton
Sep 28, 2023

We are also in need for the exact same feature.

It sounds like the recommended approach would be to unset containerMode: { type: "dind" } and then manually configure the containers based on some of the other issues/documentation.

I dont quite understand what you meant by unsetting that mode. there are only two modes for the gha-runner-scale-set. dind and kubernetes. care to explain a bit further on how that could help with cached images?

From my experience with the old dynamic PVs and runnerset to control cache of docker images, we really didn't have a good experience with this feature. You can also see many issues on the subject that people aren't very happy with it.
So a better solution would be best.

0 replies

JohnYoungers · 2023-09-28T14:15:15Z

JohnYoungers
Sep 28, 2023
Author

I dont quite understand what you meant by unsetting that mode. there are only two modes for the gha-runner-scale-set. dind and kubernetes. care to explain a bit further on how that could help with cached images?

If either mode is active, the helm chart will include settings on your behalf: by unsetting, you would manually specify what the containers look like.

@Link- would it be possible to shed some insight on what's relying on those files from the initContainer? I've updated my helm chart values with the following, using the vanilla dind image: it seems to be working for my use cases, as well as using cached images from the var-lib-docker volume as was my original goal. I'm just concerned with what issues I could potentially run into in the future

Here's the typescript code which generates the yaml for the helm chart:

const runnerContainer = {
  name: "runner",
  image: `my-custom-image:latest`,
  env: [] as any[],
  volumeMounts: [] as any[],
  resources: {
    requests: {
      cpu: runner.cpu,
      memory: runner.memory,
    },
  },
};
const containers: any[] = [runnerContainer];
const volumes: any[] = [];

if (runner.enableDocker) {
  runnerContainer.env.push({ name: "RUNNER_WAIT_FOR_DOCKER_IN_SECONDS", value: "60" });
  runnerContainer.env.push({ name: "DOCKER_HOST", value: "unix:///run/docker/docker.sock" });
  runnerContainer.volumeMounts.push({ name: "dind-sock", mountPath: "/run/docker", readOnly: true });

  containers.push({
    name: "dind",
    image: "docker:dind",
    args: ["dockerd", "--host=unix:///run/docker/docker.sock", "--group=$(DOCKER_GROUP_GID)"],
    securityContext: { privileged: true },
    env: [{ name: "DOCKER_GROUP_GID", value: "123" }],
    volumeMounts: [
      { name: "dind-sock", mountPath: "/run/docker" },
      { name: "var-lib-docker", mountPath: "/var/lib/docker" },
    ],
  });

  volumes.push({ name: "dind-sock", emptyDir: {} });
  volumes.push({
    name: "var-lib-docker",
    ephemeral: {
      volumeClaimTemplate: {
        spec: {
          accessModes: ["ReadWriteOnce"],
          resources: { requests: { storage: "20Gi" } },
          storageClassName: "gp3",
          dataSource: {
            name: `docker-snapshot-12345`,
            kind: "VolumeSnapshot",
            apiGroup: "snapshot.storage.k8s.io",
          },
        },
      },
    },
  });
}

0 replies

Link- · 2023-09-29T17:47:36Z

Link-
Sep 29, 2023
Maintainer

@Link- would it be possible to shed some insight on what's relying on those files from the initContainer?

@JohnYoungers these contain the node runtime and other dependencies needed for the functioning of the runner. You'll need them in both containers.

https://github.com/actions/runner/blob/f57ecd8e3c618e1723cdf02565e5f4da188776a4/src/Misc/externals.sh#L36

2 replies

JohnYoungers Sep 29, 2023
Author

instead of the vanilla image, I'll use this for now:

FROM docker:dind

COPY --from=ghcr.io/actions/actions-runner:2.309.0 /home/runner/externals /home/runner/externals

bshelton229 Nov 27, 2023

@link we've removed the init container that copies these over, also using the RUNNER_WAIT_FOR_DOCKER_IN_SECONDS env var, in order to save container startup time. We have seen no noticeable issues running a docker:dind container without these tools in it. As far as I can tell the only container Github Actions is directly interacting with is the the runner container, while the dind container simply provides a working docker socket.

We're already pushing a custom runner image to ECR (we're in AWS) so it's super easy to also push a custom docker:dind to ECR using a multi-stage docker build which copies the tools from the runner image. I'm just really curious as to what would use or access these tools in the docker container. We can't find anything broken after simply not copying them over.

Link- · 2023-09-29T17:48:12Z

Link-
Sep 29, 2023
Maintainer

I'm converting this to a discussion

0 replies

omri-shilton · 2023-10-03T13:29:19Z

omri-shilton
Oct 3, 2023

any news regarding this issue? we are looking to transition to the new runners and we currently have no solution for caching docker images. specifically the service containers.

3 replies

JohnYoungers Oct 3, 2023
Author

the manual setup outlined above (excluding containerMode: { type: "dind" } from the helm chart values) seems to be working fine

omri-shilton Oct 5, 2023

when excluding the container mode means i need to manually configure the runner and dind containers. do you have a sample of a configuration that does docker lib caching.
also today we are using the legacy action runner with docker image caching and its not working great at all.
is the solution for image caching any different with runner-scale-set?

mrclrchtr Nov 22, 2023

@JohnYoungers, it would be great if you could share your values file!

omri-shilton · 2023-10-17T09:36:51Z

omri-shilton
Oct 17, 2023

After a few testing i found out the the ephemeral volumes are not what I needed. I need persistent docker image caching without creating a snapshot. I need the volumes to not be removed once the pods are done. How can i achieve that using runner-scale-set?
@JohnYoungers

4 replies

fw42 Jun 27, 2024

@omri-shilton: Did you ever find a solution? I'm trying to do the same thing as you.

omri-shilton Jun 27, 2024

We moved to kubernetes mode and utilize that kubernetes caches the images on the nodes

mrclrchtr Jun 27, 2024

have you found a solution to be able to build docker images? That's currently the reason why I can't switch to kubernetes mode...

bkrein-vertex Jun 27, 2024

Check out Kaniko or Buildah

irasnyd · 2023-11-16T23:56:28Z

irasnyd
Nov 16, 2023

This use-case is important for tools that utilize a semi-persistent local cache. For the docker use-case, a cache of commonly used base images builds up on runners. This significantly helps improve build times; there is no need to repeatedly pull multi-GiB images from the centralized repository.

0 replies

mrclrchtr · 2023-11-25T09:00:48Z

mrclrchtr
Nov 25, 2023

I have now invested several hours and tried different ways... unfortunately it has not worked so far. Could you please provide an example of how to do this?

0 replies

bshelton229 · 2023-11-27T23:46:40Z

bshelton229
Nov 27, 2023

@mrclrchtr I can give a high level overview of what I spiked out and got working the other day, largely based on what I learned from @JohnYoungers in the comments above. It's not integrated into this project at all, and is quite a bit of custom work. When we go to make this more production ready for ourselves, we'll be writing a bit of glue code and schedule based scripts to seed the snapshots on a cadence that gives us the best performance improvement.

Prequisites

The prerequisites you'll need are a storage driver in the k8s cluster that can provision persistent storage, as well as the kubernetes external snapshotter set up to be able to create VolumeSnapshot objects. I'm using EKS, so we were able to install the Amazon EBS CSI Driver and CSI Snapshot Controller EKS add-ons in order to have everything we needed in the cluster. We use

Summary

Create StorageClass and VolumeSnapshotClass objects to support snapshotting your seed data volume attached to /var/lib/docker.
Script (or initially do this by hand) the creation of a seeded VolumeSnapshot containing a snapshot of a /var/lib/docker ready volume:
1. Create a PersistentVolumeClaim volume
2. Boot up a standard docker:dind container with the PersistentVolumeClaim from the previous step mounted at /var/lib/docker
3. Inside the container, authenticate docker, and docker pull as many images as appropriate to seed your particular cache. In our case we authenticate to ghcr.io, dockerhub, and ECR, and pull several images with bigger intermediate layers in order to get a good cache. May of our larger images have bigger layers that don't change as frequently so even without having a completely fresh image, we still see significant savings in pull times from the workers.
4. Once you have the volume seeded, shut down the docker:dind container, leaving the PersistentVolumeClaim
5. Snapshot your PersistentVolumeClaim volume by creating a VolumeSnapshot object in kubernetes. You'll want to leave this VolumeSnapshot object around, but you can now delete the PersistenVolumeClaim object (which should in most cases delete the underlying volume, if set to the default.) We've made sure our VolumeSnapshotClass uses AWS fast snapshotting so we get a nice quick restore.
You're now free to modify the default runner template to use ephemeral volumes based on the snapshot you've taken, mounted in /var/lib/docker. This means that every time a container comes up, the /var/lib/docker mounted volume will be unique and ephemeral, but it will start out with all the data from your snapshot procedure.

Assets

Example `VolumeSnapshotClass` for AWS EBS CSI supporting fast snapshots

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-aws-vsc
driver: ebs.csi.aws.com
deletionPolicy: Delete
parameters:
  fastSnapshotRestoreAvailabilityZones: "us-east-1a, us-east-1b, us-east-1c"

Example container spec for creating the seed volume

Boot up a dind container against any PVC, for later snapshotting. Script, or by hand, go into the container and docker login ... and docker pull ..... your way to a well seeded volume.

apiVersion: v1
kind: Pod
metadata:
  name: docker-seed
spec:
  containers:
    - name: dind
      image: docker:dind
      args:
        - dockerd
        - --host=unix:///run/docker/docker.sock
        - --group=$(DOCKER_GROUP_GID)
      env:
        - name: DOCKER_GROUP_GID
          value: "123"
      securityContext:
        privileged: true
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /run/docker
        - name: persistent-storage
          mountPath: /var/lib/docker
  volumes:
    - name: persistent-storage
      persistentVolumeClaim:
        claimName: MY-PVC-TO-LATER-BE-SNAPSHOT
    - name: work
      emptyDir: {}
    - name: dind-sock
      emptyDir: {}

Example helm values override using the snapshot volume as the basis for /var/lib/docker

template:
  spec:
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
        - name: DOCKER_HOST
          value: unix:///run/docker/docker.sock
        - name: RUNNER_WAIT_FOR_DOCKER_IN_SECONDS
          value: "60"
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /run/docker
          readOnly: true
    - name: dind
      image: docker:dind
      args:
        - dockerd
        - --host=unix:///run/docker/docker.sock
        - --group=$(DOCKER_GROUP_GID)
      env:
        - name: DOCKER_GROUP_GID
          value: "123"
      securityContext:
        privileged: true
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /run/docker
        - name: snapshot-docker
          mountPath: /var/lib/docker
    volumes:
    - name: snapshot-docker
      ephemeral:
        volumeClaimTemplate:
          spec:
            accessModes:
              - ReadWriteOnce
            storageClassName: ebs-sc
            dataSource:
              name: SNAPSHOT-OBJECT-CREATED-FROM-PVC-DOCKER-VOLUME
              kind: VolumeSnapshot
              apiGroup: snapshot.storage.k8s.io
            resources:
              requests:
                storage: 4Gi
    - name: work
      emptyDir: {}
    - name: dind-sock
      emptyDir: {}

Alternatives we may explore

This seems like it's going to work pretty well for us. A lot of our images are either fairly static throughout most days, and will see a fully seeded cache most of the time, or the intermediary layers are the biggest ones, and the intermediary layers change less often than the smaller layers that change more frequently. The scheduled creation of seed volumes is probably going to be enough for us.

We were brainstorming implementing something more along the lines of what this project originally had through a custom controller operating as a mutating webhook. We would maintain our own pool of PVCs and use a mutating webhook to intercept newly created runner pods, pick a free PVC volume from the pool, and just attach it. This would have the volume naturally build up caches as before. The more I thought about it, the solution of maintaining fresh snapshots of seed data is probably going to be simpler and will actually work better.

2 replies

JohnYoungers Nov 28, 2023
Author

The strategy we're using to create the snapshot is outlined here: #2253 (comment)

You can have a workflow populate the images using normal docker pull commands, then have it take a snapshot of its own volume for re-use

dongho-jung Dec 14, 2023

If anyone encounter the error in a dind container:

"Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"

then try adding under name: DOCKER_GROUP_GID:

    - name: DOCKER_HOST
      value: unix:///run/docker/docker.sock

tuxillo · 2024-11-09T19:02:54Z

tuxillo
Nov 9, 2024

Is this still the best solution? Or are there any new developments? I think the pre-seed approach is just a workaround unless you have identified all images that you need beforehand and keep updating the pre-seed volume (which in a busy build environment might even be problematic).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cached docker volume in gha-runner-scale-set #2944

{{title}}

Replies: 13 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

cached docker volume in gha-runner-scale-set #2944

What would you like added?

Why is this needed?

Additional context

Replies: 13 comments · 11 replies

github-actions[bot] bot Sep 27, 2023

JohnYoungers Sep 27, 2023 Author

Link- Sep 28, 2023 Maintainer

JohnYoungers Sep 28, 2023 Author

Link- Sep 29, 2023 Maintainer

JohnYoungers Sep 29, 2023 Author

Link- Sep 29, 2023 Maintainer

JohnYoungers Oct 3, 2023 Author

Prequisites

Summary

Assets

Example VolumeSnapshotClass for AWS EBS CSI supporting fast snapshots

Example container spec for creating the seed volume

Example helm values override using the snapshot volume as the basis for /var/lib/docker

Alternatives we may explore

JohnYoungers Nov 28, 2023 Author

Replies: 13 comments 11 replies

github-actions[bot]
bot Sep 27, 2023

JohnYoungers
Sep 27, 2023
Author

Link-
Sep 28, 2023
Maintainer

JohnYoungers
Sep 28, 2023
Author

Link-
Sep 29, 2023
Maintainer

JohnYoungers Sep 29, 2023
Author

Link-
Sep 29, 2023
Maintainer

JohnYoungers Oct 3, 2023
Author

Example `VolumeSnapshotClass` for AWS EBS CSI supporting fast snapshots

JohnYoungers Nov 28, 2023
Author