Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[bug] Failure to delete experiments in k8s if experiment podspec was invalid #5762

Open
marcmac opened this issue Jan 18, 2023 · 2 comments
Labels

Comments

@marcmac
Copy link

marcmac commented Jan 18, 2023

Describe the bug

After launching an experiment that had an invalid podspec, eg

[2023-01-13 07:07:21] [52c09f4b] Pod "exp-180-trial-219-0-180.c5109f82-4158-4f2b-be1a-81d9c0dfae7e.1-current-pigeon" is invalid: spec.containers[1].volumeMounts[22].name: Not found: "test" <error> [2023-01-13 07:07:21] || ERROR: Trial (Experiment 180) was terminated: allocation failed: task failed without an associated exit code: pod actor exited while pod was running

Deletion of that experiment fails because it attempts to access the same podspec to delete checkpoints

2023-01-18T19:04:29.849447686Z [info]: resources are requested by /delete-checkpoint-gc-2c9920e2-363f-4aaa-a728-14b2df07897c/2f662cac-2d5f-477f -a8f2-7b8e495adf45.1 (Allocation ID: 2f662cac-2d5f-477f-a8f2-7b8e495adf45.1) actor-local-addr="kubernetes" actor-system="master" go-type="kube rnetesResourcePool" 2023-01-18T19:04:29.893581143Z [error]: error creating pod gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vital-gnat actor-local-addr="kubernetes -worker-3" actor-system="master" error="Pod \"gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vital-gnat\" is invalid: spec.containers[1].volumeMou nts[17].name: Not found: \"test\"" go-type="requestProcessingWorker" handler="/pods/pod-54e7c0b4-1ede-41a3-b4df-7247cbdce93a" 2023-01-18T19:04:29.893627422Z [error]: pod actor notified that resource creation failed actor-local-addr="pod-54e7c0b4-1ede-41a3-b4df-7247cbd ce93a" actor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" error="Pod \"gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vi tal-gnat\" is invalid: spec.containers[1].volumeMounts[17].name: Not found: \"test\"" go-type="pod" pod="gc-0-6ea2373f-2090-45c4-92e0-12a0c048f 32e.1-vital-gnat" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.893656662Z [info]: requesting to delete kubernetes resources actor-local-addr="pod-54e7c0b4-1ede-41a3-b4df-7247cbdce93a" a ctor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" go-type="pod" pod="gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vita l-gnat" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.893674352Z [warning]: updating container state after pod actor exited unexpectedly actor-local-addr="pod-54e7c0b4-1ede-41a 3-b4df-7247cbdce93a" actor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" go-type="pod" pod="gc-0-6ea2373f-2090-45c4-92 e0-12a0c048f32e.1-vital-gnat" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.900406597Z [error]: allocation encountered fatal error actor-local-addr="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" actor-sys tem="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" error="task failed without an associated exit code: pod actor exited while pod was running" go-type="Allocation" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.903045789Z [info]: allocation failed: task failed without an associated exit code: pod actor exited while pod was running actor-local-addr="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" actor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" go-type= "Allocation" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.908565983Z [error]: wasn't able to delete checkpoints from checkpoint storage actor-local-addr="delete-checkpoint-gc-4c179 240-6a91-410d-a7bb-71dbcfbedefc" actor-system="master" error="task failed without an associated exit code: pod actor exited while pod was runni ng" go-type="checkpointGCTask" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.912193128Z [error]: deleting experiment 180 error="failed to gc checkpoints for experiment: checkpoint GC task failed because allocation failed: task failed without an associated exit code: pod actor exited while pod was running"

Reproduction Steps

  1. Launch an experiment with an invalid podspec
  2. Try to delete the failed experiment

Expected Behavior

Failed experiments should be deleted regardless of failure reason.

Screenshot

N/A

Environment

Running determined 0.19.9 in a k8s cluster

Additional Context

No response

@marcmac marcmac added the bug label Jan 18, 2023
@rb-determined-ai
Copy link
Contributor

Interesting situation. The checkpoint-gc task inherits the podspec from the experiment for various reasons, but if the experiment couldn't run, then there can't be any checkpoints, and so there shouldn't have to be a checkpoint-gc in the first place.

That would actually skirt the other issue here, which is "what do we do if there are checkpoints that can't be deleted for some reason".

@rb-determined-ai
Copy link
Contributor

I've made an internal ticket for addressing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants