Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makefile incorrectly finds that nccl is installed for Linux systems with libvncclclient #774

Open
leiDnedyA opened this issue Oct 2, 2024 · 0 comments · May be fixed by #775
Open

Makefile incorrectly finds that nccl is installed for Linux systems with libvncclclient #774

leiDnedyA opened this issue Oct 2, 2024 · 0 comments · May be fixed by #775

Comments

@leiDnedyA
Copy link

OS: Ubuntu 22.04.5 LTS

Hi all, I was running the Makefile for the first time, but found that it was failing with this message:

---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✗ MPI not found
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/bin/nvcc --threads=0 -t=0 --use_fast_math -std=c++17 -O3 -DMULTI_GPU train_gpt2_fp32.cu -lcublas -lcublasLt -lnvidia-ml  -lnccl -o train_gpt2fp32cu
train_gpt2_fp32.cu(62): warning #550-D: variable "cublas_compute_type" was set but never used

/usr/bin/ld: cannot find -lnccl: No such file or directory
collect2: error: ld returned 1 exit status
make: *** [Makefile:277: train_gpt2fp32cu] Error 255

It turns out, the makefile is using the following grep of a dpkg -l call to check if nccl is installed. This gives a false positive if the dpkg prints out any package with the substring nccl, such as "libvncclient1", in my case. Here's the actual code causing the issue:

# Check if NCCL is available, include if so, for multi-GPU training
ifeq ($(NO_MULTI_GPU), 1)
  $(info → Multi-GPU (NCCL) is manually disabled)
else
  ifneq ($(OS), Windows_NT)
    # Detect if running on macOS or Linux
    ifeq ($(SHELL_UNAME), Darwin)
      $(info ✗ Multi-GPU on CUDA on Darwin is not supported, skipping NCCL support)
+     else ifeq ($(shell dpkg -l | grep -q nccl && echo "exists"), exists)
      $(info ✓ NCCL found, OK to train with multiple GPUs)
      NVCC_FLAGS += -DMULTI_GPU
      NVCC_LDLIBS += -lnccl
    else
      $(info ✗ NCCL is not found, disabling multi-GPU support)
      $(info ---> On Linux you can try install NCCL with `sudo apt install libnccl2 libnccl-dev`)
    endif
  endif
endif

If I have some free time I think this would be a fun first issue and I'd be glad to contribute, but if anyone knows the fix off of the top of their head, that would be nice as well!

@leiDnedyA leiDnedyA linked a pull request Oct 2, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant