-
Notifications
You must be signed in to change notification settings - Fork 426
UCX BLOG
Updates by Organizations:
- ORNL has been working on GNI AMOs (work in progress), Shared memory [testing] (https://github.com/grahamlopez/ucx/tree/mmp), doxygen infrastructure improvments (pushed upstream), gtest improvements (pushed upstream), contribution agreement
- MLNX callback interfaces (pushed upstream), memory interfaces (design), UCP interface for OpenSHMEM (pushed upstream), fixes for IB (pushed upstream), infrastructure fixes.
Discussions:
- Followup on the discussion about coordination of memory operations across different devices and types of the memory. The document was update with an actual routing table for PUT operations.
Updates by Organizations:
- ORNL has been working on GNI Put (pushed upstream), Shared memory [infrastructure] (https://github.com/grahamlopez/ucx/tree/mmp), doxygen infrastructure (pushed upstream), callback interfaces (review), memory interfaces (preliminary discussion), contribution agreement
- MLNX RC support done (pushed upstream), packet tracer/logging for IB (pushed upstream), QP parameters update (pushed upstream), coverty fixes (pushed upstream), gtest enchantments (pushed upstream), callback interfaces (review/design), memory interfaces (design), UCP interface for OpenSHMEM (PR)
Discussions:
- Followup on the discussion about coordination of memory operations across different devices and types of the memory.
- Memory proposal - review. Discussion about the memory "attachment" interface: (1) use existing rkey unpack, (2) Separate interface for attaching the memory. Based on the initial discussion it seems that #1 is preferable option.
-- shamisp
Updates by Organizations:
- ORNL has been working on Licensing, contribution agreement, uGNI transport, shared memory
- MLNX RC transport, testing infrastructure
Discussion about an abstraction for memory operations across different devices and types of the memory. The document was updated with an example code (pseudo code) for a protocol handling cross-transport memory operations. For the next call we decided to talk about the technical side of the operation "chaining". We have to ensure that the proposed method is aligned with hardware requirements
Discussion about thread safety (Yossi's proposal). It seems that everybody is ok with the proposed direction
-- shamisp
Discussion about an abstraction for memory interfaces One of the ideas - present GPU devices as an transport that provides put/get/registration functionality between GPU/HOST memories
The put/get operations can be implemented withing Cuda copy or GDR-Copy
In addition we have to define an API for memory type identification (gpu,host,nvram,etc..)
For the next call we will use the following google doc for the API discussion (public accessible):
Open question: memory locality representation
--shamisp
-
Discussion about an abstraction for memory interface between GPU and host memories. AR: Yossi/Pasha have to come up with API proposal
-
GDRCopy discussion.
- MikeD(MLNX): It will be challenging to support GDRCopy as a 3rd party component since it includes kernel module component. It is preferable to have the module installed together with CUDA driver
- Davide(Nvidia): In order to add GDRCopy module to CUDA we have to demonstrate usefulness of this approach for a real applications.
- Manju/Pasha(ORNL): ORNL will lookup for an application that can benefit from this optimization.
- Nvidia/Mellanox: Can provide an access to testbed platform.
--shamisp
UCX design call is off this week, due to OpenSHMEM video conference.
Had UCX developers call, discussed following topics:
- Sequence number size in data structures:
- The data structures
ucs_callbackq_t
anducs_frag_list_t
are hard-coded to use 16-bit sequence numbers. - This is suited only for transport which have 16 bin SN, and not generic.
- Since it's lot of work to make them generic (need something like in sglib - macros which generate functions), we'll keep it as is for now, and add appropriate compile-time checks in the transports.
- User arguments for get_bcopy/atomic callbacks:
- In the initial proposal, the callbacks have only one user-defined argument.
- Pasha noted it's useful to also pass user's buffer as additional argument.
- Will discuss on the dev list and update here.
- Using communication functions from callbacks:
- Looks like transport should support calling communication from all kinds of callbacks (send/recv/am).
- All communication calls are simple and non-blocking so that should be doable.
-
ucs_callbackq_t
should support a callback which potentially pushes data to the queue.
-
- We cannot return a descriptor-to-be-posted like in UCCS, because in case TL has no resources to send, it cannot queue in anywhere.
--yosefe
- Discussions about GDRcopy library and how it can be used in context of UCX
- Sharing ideas about an abstraction for GPU/CPU memory to memory transfers
- Status update and brief discussion about pending PR #91, #92, #95
-shamisp
Resumed the weekly design and developer calls. Discussing how to check fast if a memory is GPU, looks like there are distinct ranges for malloc() memory and GPU memory. There should probably be an abstraction for memory management, handling both determining the type of memory and copying in/out of that memory. All complex protocols (e.g pipeline through RAM) should be in UCP, and this abstraction layer can help define thresholds for the protocols.
Also discussed descriptor allocation, agreed that it should be done in UCP, and possibly cache one descriptor in case it was not consumed by UCT (e.g shared memory RMA which always complete immediately). PR with the API changes is merged.
Working on RC, UD, uGNI transports, and UCP port/device/transport selection.
-yosefe
For the next 2 weeks the weekly "Design call" is off, due to holidays "Developers call" is still on. Discussed on the list about various ways to detect GPU memory, current options are:
- use cuda API to check memory type, cache results in local data structure.
- handle page faults of access to GPU memory.
- extend API to let user pass the memory type.
- modify cuda driver to allocate GPU memory in well-known virtual range.
In addition, UCX developers are working on the following:
- Adding UD support
- Adding tag matching support
- Adding Bcopy/Zcopy
- uGNI transport
-yosefe
Discussed a bit about build issues, and mostly about progress/completion semantics. Debated whether there should be a layer which provides pending queue support for all transports, which all protocols could use.
Approach 1: There should not be such layer.
- Every protocol component will manage it's own pending queue. For example, p2p pending queue, collectives pending queue,...
- Collective operation can be either blocking or non-blocking. If it's blocking, it can retry the operation in tight progress loop until it's locally completed. If it's non-blocking, it will have it's own descriptor for the state of collective algorithm, which can be put on the pending queue.
- Consider adding "blocking api" - inline functions which are spin/progress loop on top of non-blocking API. This can be either in UCP or UCT (probably UCT is better).
Approach 2: There should be a common pending queue layer.
- Either in UCT or UCP (ucpi)
- Will allow easier development of new protocols - no need to worry about managing pending queue.
- Easier to enforce fair scheduling between different UCP components on same resources - it will all go through on place.
Updating the list of conference calls Weekly conference calls
Participants: Shreeram (Nvidia), Donald Becker(Nvidia), Davide (Nvidia), Duncan P. (Nvidia) Pasha (ORNL), Yossi (MLNX), Tony (UH), Tomas (UTK), Aurelien (UTK)
Subject: Discussion about memory allocation and mapping between host memory and GPU memory.
How do we coordinate between mmap, xpmem, knem, cma, verbs, and cuda memory allocations ? It seems that the transport has to be selected based on the memory type for the source and destination. On the Wiki will will create a separate page about memory management and a table that will summarize all the possible combinations.
Difference options has been mentioned for memory allocation:
- cudaMalloc - allocation of memory on GPU
- malloc + cuda_host_register - allocation of memory on a host and pinning by GPU
- Unified Memory - transparent
Communication Between different GPUs
- Cuda IPC communication (GPUs intranode)
- VERBS (GPUs iternode)
New wiki page was added for discussion on this subject UCX Memory management
Participants: Pasha (ORNL), Vasily (MLNX), Alina (MLNX), Yossi (MLNX) Discussing progress, completion, and descriptor allocation. Summary:
- UCT will not allocate descriptors for pending queue (because of fragmentation).
- UCP will allocate descriptors and manage pending queue.
- Pending queue will be one per UCP-endpoint, even if it contains multiple UCT-endpoints.
- Queue will be single linked, so it adds only
sizeof(void*)
overhead to each ep
- Queue will be single linked, so it adds only
- UCP may push sends from pending queue to multiple UCT-endpoints, but forcing order if needed.
- UCT will inform UCP when there are new send resources per-destination or per-interface
- UCP will decide which endpoint to progress in case per-interface resources became available
- May need to have 2 pending queues: one per endpoint, another per-interface
- Use inline functions in UCP for good-flow code, e.g
ucp_put(...) {
status = uct_put (...);
if (ucs_likely(status != UCS_ERR_WOULD_BLOCK)) {
return status;
} else {
return ucp_put_pending(...); /* Function call for pending flow */
}
}
- TBD need to show prototype (pseudo code) of pending queue implementation
- how notification mechanism works
- how scheduling works on UCP
- Handling non-existing UCT functions (e.g RMA on UD/TCP)
- proto layer will "push down" it's own function pointer to fill in the missing UCT function
- the call will be redirected back to proto layer which will use it's protocol over UCT AM this time
- this will avoid extra "if" on fast path to check for support
- need to work out details how to pass these functions so it would not be ugly
Nov-30-2014 Work Plan Update
The Work Plan was updated based on SC14 discussion with Mike
Participants: Yossi, Pasha, Manju (Rich was't able to join)
The following topics has been discussed:
- Discussion about completion model for UCT interface.
- Discussion about management of pending queue. What layer is responsible for management of queue ?
- Some operations like Atomic operations for 32bit and SWAP has to be implemented using retry protocol. Where the protocol has to be implemented ?
- Memory alignment for uGNI Get also requires special requirement. Where do we handle this ?
- So far - no solution for the above issues ? This is homework for next call
- Nvidia proposed C11 memory model - Short discussion about this. We have to do homework on this subject.