You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(maybe related to Issue #638)
The following command leads to a hang (probably a deadlock) on dgx machine (swx-dgx02 from hpchead): mpirun -x UCC_TL_CUDA_TUNE=inf -x UCC_TL_SHARP_TUNE=0 --mca coll ^hcoll -np 8 /.autodirect/mtrsysgwork/snordmann/ucc/build/test/mpi/ucc_test_mpi -d float32 -M cuda -v --triggered 1 -o sum -t world -r single:0 -c allreduce,reduce
Using TL_UCC (i.e. Removing the flag UCC_TL_CUDA_TUNE=inf) leads to the same bug. However, leaving only "reduce"or "allreduce" in the command line make the bug disappears
Here are the different backtraces of the processes:
#0 uct_rc_mlx5_iface_poll_tx (poll_flags=2, iface=0x2de0030) at rc/accel/rc_mlx5_iface.c:153
#1 uct_rc_mlx5_iface_progress (flags=2, arg=0x2de0030) at rc/accel/rc_mlx5_iface.c:190
#2 uct_rc_mlx5_iface_progress_cyclic (arg=0x2de0030) at rc/accel/rc_mlx5_iface.c:195
#3 0x00007f6e3c16995a in ucs_callbackq_dispatch (cbq=<optimized out>) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucs/datastruct/callbackq.h:211
#4 uct_worker_progress (worker=<optimized out>) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/api/uct.h:2768
#5 ucp_worker_progress (worker=0x2a2d440) at core/ucp_worker.c:2807
#6 0x00007f6e405767ac in opal_progress () at runtime/opal_progress.c:231
#7 0x00007f6e4180a933 in ompi_request_default_test (rptr=0x7ffc316acc70, completed=0x7ffc316acc7c, status=0x0) at request/req_test.c:88
#8 0x00007f6e418303e5 in PMPI_Test (request=0x7ffc316acc70, completed=0x7ffc316acc7c, status=<optimized out>) at ptest.c:65
#9 0x000000000043872c in TestReduce::check (this=0xc0f94e0) at ../../../test/mpi/test_reduce.cc:84
#10 0x0000000000406943 in UccTestMpi::exec_tests (this=0x335a410, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:499
#11 0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x335a410, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#12 0x0000000000407c46 in UccTestMpi::run_all (this=0x335a410, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#13 0x0000000000417c3a in main (argc=16, argv=0x7ffc316ad5a8) at ../../../test/mpi/main.cc:576
0 ucs_callbackq_dispatch (cbq=<optimized out>) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucs/datastruct/callbackq.h:211
#1 uct_worker_progress (worker=<optimized out>) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/api/uct.h:2768
#2 ucp_worker_progress (worker=0x1de9420) at core/ucp_worker.c:2807
#3 0x00007f3a85ab97ac in opal_progress () at runtime/opal_progress.c:231
#4 0x00007f3a86d4d933 in ompi_request_default_test (rptr=0x7ffc66f4e290, completed=0x7ffc66f4e29c, status=0x0) at request/req_test.c:88
#5 0x00007f3a86d733e5 in PMPI_Test (request=0x7ffc66f4e290, completed=0x7ffc66f4e29c, status=<optimized out>) at ptest.c:65
#6 0x000000000043872c in TestReduce::check (this=0xbf3ccd0) at ../../../test/mpi/test_reduce.cc:84
#7 0x0000000000406943 in UccTestMpi::exec_tests (this=0x3300120, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:499
#8 0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x3300120, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#9 0x0000000000407c46 in UccTestMpi::run_all (this=0x3300120, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#10 0x0000000000417c3a in main (argc=16, argv=0x7ffc66f4ebc8) at ../../../test/mpi/main.cc:576
#0 0x00007fff4f7ec6c2 in clock_gettime ()
#1 0x00007f4a08525c6d in clock_gettime () from /lib64/libc.so.6
#2 0x00007f4a09ae80ef in ?? () from /lib64/libcuda.so.1
#3 0x00007f4a099e136b in ?? () from /lib64/libcuda.so.1
#4 0x00007f4a09d26977 in ?? () from /lib64/libcuda.so.1
#5 0x00007f4a09988ba0 in ?? () from /lib64/libcuda.so.1
#6 0x00007f4a09b071d8 in ?? () from /lib64/libcuda.so.1
#7 0x00007f4a0b772a81 in uct_cuda_ipc_map_memhandle (key=key@entry=0xc0a1870, mapped_addr=mapped_addr@entry=0x7fff4f5fdbf0) at cuda_ipc/cuda_ipc_cache.c:272
#8 0x00007f4a0b771216 in uct_cuda_ipc_post_cuda_async_copy (iov=0x7fff4f5fdc68, iov=0x7fff4f5fdc68, direction=1, comp=<optimized out>, rkey=<optimized out>, remote_addr=<optimized out>, tl_ep=<optimized out>) at cuda_ipc/cuda_ipc_ep.c:70
#9 uct_cuda_ipc_ep_get_zcopy (tl_ep=<optimized out>, iov=0x7fff4f5fdc68, iovcnt=<optimized out>, remote_addr=140647904313344, rkey=201988208, comp=0xb911350) at cuda_ipc/cuda_ipc_ep.c:146
#10 0x00007f49f674dee4 in uct_ep_get_zcopy (comp=0xb911350, rkey=201988208, remote_addr=<optimized out>, iovcnt=1, iov=0x7fff4f5fdc68, ep=0xaa987d0) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/uct/api/uct.h:2960
#11 ucp_rndv_progress_rma_zcopy_common (proto=2, uct_rkey=201988208, lane=4 '\\004', req=0xb9112c0) at rndv/rndv.c:582
#12 ucp_rndv_progress_rma_get_zcopy (self=0xb911398) at rndv/rndv.c:2271
#13 0x00007f49f675235a in ucp_request_try_send (req=0xb9112c0) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/core/ucp_request.inl:334
#14 ucp_request_send (req=<optimized out>) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/core/ucp_request.inl:357
#15 ucp_rndv_req_send_rma_get (rkey_buf=<optimized out>, rndv_rts_hdr=0x7f49a8935d00, rreq=0xb910b40, rndv_req=0xb9112c0) at rndv/rndv.c:950
#16 ucp_rndv_receive (worker=worker@entry=0x848fcc0, rreq=rreq@entry=0xb910b40, rndv_rts_hdr=rndv_rts_hdr@entry=0x7f49a8935d00, rkey_buf=rkey_buf@entry=0x7f49a8935d29) at rndv/rndv.c:1730
#17 0x00007f49f6763991 in ucp_rndv_receive_start (rkey_length=<optimized out>, rkey_buf=0x7f49a8935d29, rndv_rts_hdr=0x7f49a8935d00, rreq=0xb910b40, worker=0x848fcc0) at /build-result/src/hpcx-gcc-redhat7/ucx-c5a185a7aeac67894abe96240f2cc52ff8df0187/src/ucp/rndv/rndv.inl:35
#18 ucp_tag_rndv_matched (worker=worker@entry=0x848fcc0, rreq=rreq@entry=0xb910b40, rts_hdr=rts_hdr@entry=0x7f49a8935d00, hdr_length=<optimized out>) at tag/tag_rndv.c:27
#19 0x00007f49f6765a7f in ucp_tag_recv_common (debug_name=<synthetic pointer>, param=0x7fff4f5fdfd0, rdesc=0x7f49a8935cd0, req=0xb910b40, tag_mask=18446744073709551615, tag=985162418749441, datatype=<optimized out>, count=<optimized out>, buffer=<optimized out>, worker=0x848fcc0) at tag/tag_recv.c:175
#20 ucp_tag_recv_nbx (worker=0x848fcc0, buffer=buffer@entry=0x7f4974800000, count=count@entry=1, tag=985162418749441, tag_mask=tag_mask@entry=18446744073709551615, param=0x7fff4f5fdfd0) at tag/tag_recv.c:249
#21 0x00007f49aa2a61c7 in ucc_tl_ucp_recv_common (cb=<optimized out>, task=0xb8070c0, team=0x9c34250, dest_group_rank=4, mtype=UCC_MEMORY_TYPE_CUDA, msglen=2097152, buffer=0x7f4974800000) at ./tl_ucp_sendrecv.h:155
#22 ucc_tl_ucp_recv_nb (task=0xb8070c0, team=0x9c34250, dest_group_rank=4, mtype=UCC_MEMORY_TYPE_CUDA, msglen=2097152, buffer=0x7f4974800000) at ./tl_ucp_sendrecv.h:166
#23 ucc_tl_ucp_reduce_knomial_progress (coll_task=<optimized out>) at reduce/reduce_knomial.c:78
#24 0x00007f4a0b526bc1 in ucc_pq_st_progress (pq=0x8cf4610) at core/ucc_progress_queue_st.c:31
#25 0x00007f4a0b52197e in ucc_progress_queue (pq=<optimized out>) at core/ucc_progress_queue.h:46
#26 ucc_context_progress (context=0x33f21d0) at core/ucc_context.c:934
#27 0x00000000004245c2 in TestCase::tc_progress_ctx (this=0xc03a1d0) at ../../../test/mpi/test_case.cc:160
#28 0x0000000000406868 in UccTestMpi::exec_tests (this=0x295d460, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:495
#29 0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x295d460, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#30 0x0000000000407c46 in UccTestMpi::run_all (this=0x295d460, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#31 0x0000000000417c3a in main (argc=16, argv=0x7fff4f5fe9d8) at ../../../test/mpi/main.cc:576
#0 0x00007febb40a9901 in ucp_worker_progress (worker=0x957cce0) at core/ucp_worker.c:2803
#1 0x00007feb5af55e49 in ucc_tl_ucp_test (task=0xc8f40c0) at ./tl_ucp_coll.h:300
#2 ucc_tl_ucp_reduce_knomial_progress (coll_task=<optimized out>) at reduce/reduce_knomial.c:57
#3 0x00007febbb159bc1 in ucc_pq_st_progress (pq=0x9de1650) at core/ucc_progress_queue_st.c:31
#4 0x00007febbb15497e in ucc_progress_queue (pq=<optimized out>) at core/ucc_progress_queue.h:46
#5 ucc_context_progress (context=0x44def80) at core/ucc_context.c:934
#6 0x00000000004245c2 in TestCase::tc_progress_ctx (this=0xbe84a50) at ../../../test/mpi/test_case.cc:160
#7 0x0000000000406868 in UccTestMpi::exec_tests (this=0x3a4aa50, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:495
#8 0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x3a4aa50, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#9 0x0000000000407c46 in UccTestMpi::run_all (this=0x3a4aa50, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#10 0x0000000000417c3a in main (argc=16, argv=0x7ffc1ab09748) at ../../../test/mpi/main.cc:576
#0 0x00007fccee9a72e8 in ompi_coll_libnbc_progress () at coll_libnbc_component.c:427
#1 0x00007fcd1f53c7ac in opal_progress () at runtime/opal_progress.c:231
#2 0x00007fcd207d0933 in ompi_request_default_test (rptr=0x7ffc4cbed090, completed=0x7ffc4cbed09c, status=0x0) at request/req_test.c:88
#3 0x00007fcd207f63e5 in PMPI_Test (request=0x7ffc4cbed090, completed=0x7ffc4cbed09c, status=<optimized out>) at ptest.c:65
#4 0x000000000043872c in TestReduce::check (this=0xb0ac9c0) at ../../../test/mpi/test_reduce.cc:84
#5 0x0000000000406943 in UccTestMpi::exec_tests (this=0x2edc3a0, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:499
#6 0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x2edc3a0, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#7 0x0000000000407c46 in UccTestMpi::run_all (this=0x2edc3a0, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#8 0x0000000000417c3a in main (argc=16, argv=0x7ffc4cbed9c8) at ../../../test/mpi/main.cc:576
#0 opal_sys_timer_get_cycles () at ../../../../opal/include/opal/sys/x86_64/timer.h:42
#1 opal_timer_linux_get_cycles_sys_timer () at timer_linux_component.c:232
#2 0x00007fa248bb88c9 in opal_progress_events () at runtime/opal_progress.c:183
#3 opal_progress () at runtime/opal_progress.c:245
#4 0x00007fa249e4c933 in ompi_request_default_test (rptr=0x7fff1a594ad0, completed=0x7fff1a594adc, status=0x0) at request/req_test.c:88
#5 0x00007fa249e723e5 in PMPI_Test (request=0x7fff1a594ad0, completed=0x7fff1a594adc, status=<optimized out>) at ptest.c:65
#6 0x000000000043872c in TestReduce::check (this=0xbec94e0) at ../../../test/mpi/test_reduce.cc:84
#7 0x0000000000406943 in UccTestMpi::exec_tests (this=0x3125560, tcs=..., triggered=true, persistent=false) at ../../../test/mpi/test_mpi.cc:499
#8 0x00000000004075ab in UccTestMpi::run_all_at_team (this=0x3125560, team=..., rst=...) at ../../../test/mpi/test_mpi.cc:613
#9 0x0000000000407c46 in UccTestMpi::run_all (this=0x3125560, is_onesided=false) at ../../../test/mpi/test_mpi.cc:664
#10 0x0000000000417c3a in main (argc=16, argv=0x7fff1a595408) at ../../../test/mpi/main.cc:576
The text was updated successfully, but these errors were encountered:
(maybe related to Issue #638)
The following command leads to a hang (probably a deadlock) on dgx machine (swx-dgx02 from hpchead):
mpirun -x UCC_TL_CUDA_TUNE=inf -x UCC_TL_SHARP_TUNE=0 --mca coll ^hcoll -np 8 /.autodirect/mtrsysgwork/snordmann/ucc/build/test/mpi/ucc_test_mpi -d float32 -M cuda -v --triggered 1 -o sum -t world -r single:0 -c allreduce,reduce
Using TL_UCC (i.e. Removing the flag
UCC_TL_CUDA_TUNE=inf
) leads to the same bug. However, leaving only "reduce"or "allreduce" in the command line make the bug disappearsHere are the different backtraces of the processes:
The text was updated successfully, but these errors were encountered: