Onboard AWS Scale tests to Boskos #33183

hakuna-matatah · 2024-07-31T23:02:25Z

What would you like to be added:

Onboard AWS scale test account to Boskos for janitor to kick in and cleanup left out resources.

Why is this needed:
When kubetest2 tear down doesn't fully succeed, it will end up leaking resources and until next run (which is after 24hours) these resources will not be cleaned up.
This is not cost effective to leave out leaked resources until next run.

W0731 10:50:32.362521   50536 executor.go:141] error running task "AutoscalingGroup/nodes.scale-1000.periodic.test-cncf-aws.k8s.io" (6m1s remaining to succeed): error creating AutoScalingGroup: AutoScalingGroup by this name already exists - A group with the name nodes.scale-1000.periodic.test-cncf-aws.k8s.io already exists and is pending delete.  Please wait until Autoscaling completes the deletion process before creating another group with the same name.

Example run - https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1818587581143584768

The text was updated successfully, but these errors were encountered:

dims · 2024-07-31T23:43:43Z

we have to sit on this one until we can figure out automation to bump any given AWS account to be able to run 5k tests as limits need to be increased etc for various services like ec2, needs explicit bumps working with aws support folks etc.

BenTheElder · 2024-08-05T17:48:41Z

we have to sit on this one until we can figure out automation to bump any given AWS account to be able to run 5k tests as limits need to be increased etc for various services like ec2, needs explicit bumps working with aws support folks etc.

I don't think so? You can add the existing account to a new boskos pool?

BenTheElder · 2024-08-05T17:52:52Z

We haven't done this for the GCP 5k project, we just stuck the one project we have in a dedicated pool of a single project, so it can still make use of boskos's lifecycling features and be rented to multiple jobs.

I recommend also putting any such multiple jobs into a job queue that matches the boskos pool, for too long we have only relied on manual scheduling.

(job_queue_name and job_queue_capacities, not very well documented at the moment but

test-infra/config/prow/config.yaml

Line 9 in 52cac28

job_queue_capacities:

and https://pkg.go.dev/sigs.k8s.io/prow/pkg/config)

BenTheElder · 2024-08-05T17:54:41Z

/sig scalability k8s-infra testing

ameukam · 2024-08-05T18:10:49Z

@dims I think it should be Ok to do this since we are moving the existing scale account under a boskos resource type:

https://github.com/kubernetes/k8s.io/blob/58752c8ab30f15166a1a0228f7fb461dcf0fab2b/infra/gcp/terraform/k8s-infra-prow-build/prow-build/resources/test-pods/boskos-resources-configmap.yaml#L215C1-L218C38

dims · 2024-08-05T18:27:24Z

@ameukam cool! that sounds better :) having a separate type and then making sure we use that. all i was worried about was that we can't pick a random account and run scale test on it

BenTheElder · 2024-08-05T18:56:22Z

Yeah, we definitely don't want to make the entire main pool scale test ready.

We actually have a few pools on GCP like this, e.g. the GPU projects are also special and we're not setting up that quota for every project. We actually even have a secondary scale pool with a few projects for smaller scalability jobs.

I think we can mimic this, just need to add a pool definition with the aws account, make sure the janitor is enabled for that pool, and switch the job to reference the pool. If we roll that out between scheduled runs it should just work without disruptions.

k8s-triage-robot · 2024-11-03T19:30:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

BenTheElder · 2024-11-18T15:32:19Z

/lifecycle frozen
we will need this eventually to streamline managing accounts etc

hakuna-matatah added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 31, 2024

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 31, 2024

hakuna-matatah changed the title ~~Onboard to Boskos~~ Onboard AWS Scale tests to Boskos Jul 31, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2024

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Onboard AWS Scale tests to Boskos #33183

Onboard AWS Scale tests to Boskos #33183

hakuna-matatah commented Jul 31, 2024

dims commented Jul 31, 2024

BenTheElder commented Aug 5, 2024

BenTheElder commented Aug 5, 2024

BenTheElder commented Aug 5, 2024

ameukam commented Aug 5, 2024

dims commented Aug 5, 2024

BenTheElder commented Aug 5, 2024

k8s-triage-robot commented Nov 3, 2024

BenTheElder commented Nov 18, 2024

Onboard AWS Scale tests to Boskos #33183

Onboard AWS Scale tests to Boskos #33183

Comments

hakuna-matatah commented Jul 31, 2024

dims commented Jul 31, 2024

BenTheElder commented Aug 5, 2024

BenTheElder commented Aug 5, 2024

BenTheElder commented Aug 5, 2024

ameukam commented Aug 5, 2024

dims commented Aug 5, 2024

BenTheElder commented Aug 5, 2024

k8s-triage-robot commented Nov 3, 2024

BenTheElder commented Nov 18, 2024