Question about communication between Nvidia and AMD GPUs #1039

YangZhou1997 · 2024-10-17T17:26:34Z

Hi ucc maintainer,

I just wonder if ucc could support collective communications among Nvidia and AMD GPUs in one ML workload. Say the collective ring has half Nvidia and half AMD GPUs.

Best,
Yang

Sergei-Lebedev · 2024-10-17T23:24:56Z

Hi @YangZhou1997

UCC can theoretically support collective communication across Nvidia and AMD GPUs in a single workload, but with key restrictions

It will only work with TL UCP and TL SHARP. Other transports aren’t compatible due to non-homogeneous memory, which can cause deadlocks.
For reduction collectives, the local source and destination buffers on each rank must have the same memory type.
Deadlocks from memory mismatches could be avoided by running a small allreduce before each collective.

While possible, this setup hasn’t been tested and would require careful handling to ensure stability.

YangZhou1997 · 2024-10-17T23:28:00Z

Thank you Sergey for your quick response! That's super helpful---can I know more about the deadlock? or is there any materials I can read through? Best, Yang

…

On Thu, Oct 17, 2024 at 4:25 PM Sergey Lebedev ***@***.***> wrote: Hi @YangZhou1997 <https://github.com/YangZhou1997> UCC can theoretically support collective communication across Nvidia and AMD GPUs in a single workload, but with key restrictions 1. It will only work with TL UCP and TL SHARP. Other transports aren’t compatible due to non-homogeneous memory, which can cause deadlocks. 2. For reduction collectives, the local source and destination buffers on each rank must have the same memory type. 3. Deadlocks from memory mismatches could be avoided by running a small allreduce before each collective. While possible, this setup hasn’t been tested and would require careful handling to ensure stability. — Reply to this email directly, view it on GitHub <#1039 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJFTPQQNSP27FDQXJBS2FDLZ4BBN5AVCNFSM6AAAAABQEHTLK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRQHAZTGNZZGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Sergei-Lebedev · 2024-10-17T23:42:28Z

Sure. Deadlock in this case is similar to what we fixed in this PR for a weak asymmetric memory #1000.
Basically UCC tries to choose best transport (UCP, SHM, CUDA, NCCL, RCCL, etc.) based on several factors including memory type of collective. So what might happen is one rank selects NCCL to do allreduce because it see CUDA memory and other chooses RCCL because it sees ROCM memory. This transport selection mismatch will result in deadlock.

RafalSiwek mentioned this issue Oct 31, 2024

Question about failing collective allreduce across on non-homogenous ring (NVIDIA and AMD GPU) #1043

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about communication between Nvidia and AMD GPUs #1039

Question about communication between Nvidia and AMD GPUs #1039

YangZhou1997 commented Oct 17, 2024

Sergei-Lebedev commented Oct 17, 2024

YangZhou1997 commented Oct 17, 2024 via email

Sergei-Lebedev commented Oct 17, 2024

Question about communication between Nvidia and AMD GPUs #1039

Question about communication between Nvidia and AMD GPUs #1039

Comments

YangZhou1997 commented Oct 17, 2024

Sergei-Lebedev commented Oct 17, 2024

YangZhou1997 commented Oct 17, 2024 via email

Sergei-Lebedev commented Oct 17, 2024