Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about communication between Nvidia and AMD GPUs #1039

Open
YangZhou1997 opened this issue Oct 17, 2024 · 3 comments
Open

Question about communication between Nvidia and AMD GPUs #1039

YangZhou1997 opened this issue Oct 17, 2024 · 3 comments

Comments

@YangZhou1997
Copy link

Hi ucc maintainer,

I just wonder if ucc could support collective communications among Nvidia and AMD GPUs in one ML workload. Say the collective ring has half Nvidia and half AMD GPUs.

Best,
Yang

@Sergei-Lebedev
Copy link
Contributor

Hi @YangZhou1997

UCC can theoretically support collective communication across Nvidia and AMD GPUs in a single workload, but with key restrictions

  1. It will only work with TL UCP and TL SHARP. Other transports aren’t compatible due to non-homogeneous memory, which can cause deadlocks.
  2. For reduction collectives, the local source and destination buffers on each rank must have the same memory type.
  3. Deadlocks from memory mismatches could be avoided by running a small allreduce before each collective.

While possible, this setup hasn’t been tested and would require careful handling to ensure stability.

@YangZhou1997
Copy link
Author

YangZhou1997 commented Oct 17, 2024 via email

@Sergei-Lebedev
Copy link
Contributor

Sure. Deadlock in this case is similar to what we fixed in this PR for a weak asymmetric memory #1000.
Basically UCC tries to choose best transport (UCP, SHM, CUDA, NCCL, RCCL, etc.) based on several factors including memory type of collective. So what might happen is one rank selects NCCL to do allreduce because it see CUDA memory and other chooses RCCL because it sees ROCM memory. This transport selection mismatch will result in deadlock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants