Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When when WaitForWorkersReady is enabled in MPI operator, MPI operator and gang scheduler are in a deadlock #608

Open
yzhao-2023 opened this issue Dec 10, 2023 · 4 comments

Comments

@yzhao-2023
Copy link

If WaitForWorkersReady is enabled, MPI operator and a gang scheduler would be stuck in a deadlock:

  1. WaitForWorkersReady is enabled, mpi operator created a pod group with only worker pod spec (N being worker pod count), but with the desired pod count being N+1 (N worker + 1 launcher)
  2. Gang scheduler would not scheduler this pod group, because there is not enough pods in the podgroup
  3. MPI operator would not create launcher pod spec, because the worker pods are not created yet.

A workaround, albeit still violating gang scheduling's semantic, is to set runPolicy.minAvailable to be the worker count, allowing mpi operator to create pod group with only worker pods, and allowing gang scheduler to proceed scheduling workers.

The problem is the strict semantic of gang scheduling is being broken, and the launcher might be able to be scheduled.

In reality, this should not be a problem, as launcher job does not consume gpus, therefore should be amply available in our case.

But the doc should be updated to reflect this pitfall.

A better fix might be to change the default behavior to only create a pod group with N (N being worker pod count).
Risking launcher not be started.

A possible true fix:
Extend Kubernetes to have resources being allocated, but not immediately start running the pods.
So that launcher can be executed after workers have been started.

[0] https://www.kubeflow.org/docs/components/training/mpi/#scheduling-policy
[1] https://www.alibabacloud.com/blog/the-burgeoning-kubernetes-scheduling-system-part-2-coscheduling-and-gang-scheduling-that-support-batch-jobs_597319

@alculquicondor
Copy link
Collaborator

Does volcano offer an API to declare the size of the group beforehand?

Otherwise, there is nothing we can do in this repo.

You might also want to consider https://kueue.sig.k8s.io which doesn't face this issue because it's not pod-based.

@tenzen-y
Copy link
Member

If WaitForWorkersReady is enabled, MPI operator and a gang scheduler would be stuck in a deadlock

@yzhao-2023 That's right, WaitForWorkersReady potentially has the deadlock.

But the doc should be updated to reflect this pitfall.

Anyway, we should add documentation about WaitForWorkersReady since there isn't any document about the feature.

A better fix might be to change the default behavior to only create a pod group with N (N being worker pod count).
Risking launcher not be started.

I don't want to add such a defaulting since users might be confused by the modified input value. I belive that validation would be better.

Does volcano offer an API to declare the size of the group beforehand?

@alculquicondor We can tell an arbitrary number to the volcano via PodGroup (runPolicy.minAvailable) here:

Spec: volcanov1beta1.PodGroupSpec{
MinMember: *minMember,

@alculquicondor
Copy link
Collaborator

What I mean is whether we can tell volcano that X pods of a shape are coming, so that it reserves the space for them.
Otherwise there is no way for mpi-operator to prevent this "race", as volcano is expecting the Pods to be created.

@tenzen-y
Copy link
Member

What I mean is whether we can tell volcano that X pods of a shape are coming, so that it reserves the space for them. Otherwise there is no way for mpi-operator to prevent this "race", as volcano is expecting the Pods to be created.

Ah, I see. yes, that's right. We don't have any way to tell a shape to volcano/scheduler-plugins.
So, I believe that validations would be worth it. It means users can not create a MPIJob with waitForWorkersReady and N , where N is the sum of all workers and a launcher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants