-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about communication between Nvidia and AMD GPUs #1039
Comments
UCC can theoretically support collective communication across Nvidia and AMD GPUs in a single workload, but with key restrictions
While possible, this setup hasn’t been tested and would require careful handling to ensure stability. |
Thank you Sergey for your quick response! That's super helpful---can I know
more about the deadlock? or is there any materials I can read through?
Best,
Yang
…On Thu, Oct 17, 2024 at 4:25 PM Sergey Lebedev ***@***.***> wrote:
Hi @YangZhou1997 <https://github.com/YangZhou1997>
UCC can theoretically support collective communication across Nvidia and
AMD GPUs in a single workload, but with key restrictions
1. It will only work with TL UCP and TL SHARP. Other transports aren’t
compatible due to non-homogeneous memory, which can cause deadlocks.
2. For reduction collectives, the local source and destination buffers
on each rank must have the same memory type.
3. Deadlocks from memory mismatches could be avoided by running a
small allreduce before each collective.
While possible, this setup hasn’t been tested and would require careful
handling to ensure stability.
—
Reply to this email directly, view it on GitHub
<#1039 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJFTPQQNSP27FDQXJBS2FDLZ4BBN5AVCNFSM6AAAAABQEHTLK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRQHAZTGNZZGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sure. Deadlock in this case is similar to what we fixed in this PR for a weak asymmetric memory #1000. |
Hi ucc maintainer,
I just wonder if ucc could support collective communications among Nvidia and AMD GPUs in one ML workload. Say the collective ring has half Nvidia and half AMD GPUs.
Best,
Yang
The text was updated successfully, but these errors were encountered: