[BUG] Distributed Training With (NVTabular + Pytorch DDP), I got this error: RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
#1876
Labels
bug
Something isn't working
the whole error Traceback:
When I use Pytorch DDP to drive 2 GPUs in single node for distributed training and use torchrun to start the distributed process, the above error occurs in one of the worker processes. When only one GPU is used, the above error does not occur. I am curious, is NVTabular Or does Merlin not support distributed training? My purpose is achieve multi-GPU training on a single node with Pytorch 。
data loader:
The text was updated successfully, but these errors were encountered: