DEVICE API
shmem_device_rma.h
Defines
-
SHMEM_TYPE_FUNC(FUNC)
Standard RMA Types and Names.
NAME
TYPE
half
half
float
float
double
double
int8
int8
int16
int16
int32
int32
int64
int64
uint8
uint8
uint16
uint16
uint32
uint32
uint64
uint64
char
char
bfloat16
bfloat16
Functions
- SHMEM_DEVICE void shmem_NAME_p (__gm__ TYPE *dst, const TYPE value, int pe)
Provide a low latency put capability for single element of most basic types.
- Parameters:
dst – [in] Symmetric address of the destination data on local PE.
value – [in] The element to be put.
pe – [in] The number of the remote PE.
- SHMEM_DEVICE TYPE shmem_NAME_g (__gm__ TYPE *src, int32_t pe)
Provide a low latency get capability for single element of most basic types.
- Parameters:
src – [in] Symmetric address of the destination data on local PE.
pe – [in] The number of the remote PE.
- Returns:
A single element of type specified in the input pointer.
- SHMEM_DEVICE void shmem_getmem (__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)
Synchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_get_NAME_mem (__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)
Synchronous interface. Copy contiguous data on symmetric memory from the specified PE to * address on the local PE.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_putmem (__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)
Synchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_put_NAME_mem (__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)
Synchronous interface. Copy a contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_getmem_nbi (__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_putmem_signal (__gm__ void *dst, __gm__ void *src, size_t elem_size, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
Synchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE then update sig_addr.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
sig_addr – [in] Symmetric address of the signal word to be updated.
signal – [in] The value used to update sig_addr.
sig_op – [in] Operation used to update sig_addr with signal. Supported operations: SHMEM_SIGNAL_SET/SHMEM_SIGNAL_ADD
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_putmem_signal_nbi (__gm__ void *dst, __gm__ void *src, size_t elem_size, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE then update sig_addr.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
sig_addr – [in] Symmetric address of the signal word to be updated.
signal – [in] The value used to update sig_addr.
sig_op – [in] Operation used to update sig_addr with signal. Supported operations: SHMEM_SIGNAL_SET/SHMEM_SIGNAL_ADD
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_get_NAME_mem_nbi (__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)
Asynchronous interface. Copy a contiguous data on symmetric memory from the specified PE to address on the local PE.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_get_NAME_mem_nbi (__gm__ TYPE *dst, __gm__ TYPE *src, const non_contiguous_copy_param ©_params, int32_t pe)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data * on symmetric memory from the specified PE to address on the local device.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
copy_params – [in] Params to describe how non-contiguous data is managed in src and dst.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_get_NAME_mem_nbi (AscendC::GlobalTensor< TYPE > dst, AscendC::GlobalTensor< TYPE > src, uint32_t elem_size, int pe)
Asynchronous interface. Copy a contiguous data on symmetric memory from the specified PE to address on the local PE.
- Parameters:
dst – [in] GlobalTensor on local device of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_get_NAME_mem_nbi (AscendC::GlobalTensor< TYPE > dst, AscendC::GlobalTensor< TYPE > src, const non_contiguous_copy_param ©_params, int pe)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data * on symmetric memory from the specified PE to address on the local device.
- Parameters:
dst – [in] GlobalTensor on local device of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
copy_params – [in] Params to describe how non-contiguous data is managed in src and dst.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_put_NAME_mem_nbi (__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)
Asynchronous interface. Copy a contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_put_NAME_mem_nbi (__gm__ TYPE *dst, __gm__ TYPE *src, const non_contiguous_copy_param ©_params, int32_t pe)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data * on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
copy_params – [in] Params to describe how non-contiguous data is managed in src and dst.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_put_NAME_mem_nbi (AscendC::GlobalTensor< TYPE > dst, AscendC::GlobalTensor< TYPE > src, uint32_t elem_size, int pe)
Asynchronous interface. Copy a contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] GlobalTensor on local device of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_put_NAME_mem_nbi (AscendC::GlobalTensor< TYPE > dst, AscendC::GlobalTensor< TYPE > src, const non_contiguous_copy_param ©_params, int pe)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data * on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] GlobalTensor on local device of the source data.
copy_params – [in] Params to describe how non-contiguous data is managed in src and dst.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_get_NAME_mem_nbi (__ubuf__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int pe)
Asynchronous interface. Copy a contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] Pointer on local UB of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_get_NAME_mem_nbi (AscendC::LocalTensor< TYPE > dst, AscendC::GlobalTensor< TYPE > src, uint32_t elem_size, int pe)
Asynchronous interface. Copy a contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] LocalTensor on local UB of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_get_NAME_mem_nbi (__ubuf__ TYPE *dst, __gm__ TYPE *src, const non_contiguous_copy_param ©_params, int pe)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data * on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] Pointer on local UB of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
copy_params – [in] Params to describe how non-contiguous data is managed in src and dst.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_get_NAME_mem_nbi (AscendC::LocalTensor< TYPE > dst, AscendC::GlobalTensor< TYPE > src, const non_contiguous_copy_param ©_params, int pe)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data * on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] LocalTensor on local UB of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
copy_params – [in] Params to describe how non-contiguous data is managed in src and dst.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_put_NAME_mem_nbi (__gm__ TYPE *dst, __ubuf__ TYPE *src, uint32_t elem_size, int32_t pe)
Asynchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local UB of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_put_NAME_mem_nbi (AscendC::GlobalTensor< TYPE > dst, AscendC::LocalTensor< TYPE > src, uint32_t elem_size, int32_t pe)
Asynchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] LocalTensor on local UB of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_put_NAME_mem_nbi (__gm__ TYPE *dst, __ubuf__ TYPE *src, const non_contiguous_copy_param ©_params, int32_t pe)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data * on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local UB of the source data.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE void shmem_put_NAME_mem_nbi (AscendC::GlobalTensor< TYPE > dst, AscendC::LocalTensor< TYPE > src, const non_contiguous_copy_param ©_params, int32_t pe)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data * on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] LocalTensor on local UB of the source data.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
Low-level Functions
- SHMEM_DEVICE __gm__ void * shmem_ptr (__gm__ void *ptr, int pe)
Translate an local symmetric address to remote symmetric address on the specified PE.
- Parameters:
ptr – [in] Symmetric address on local PE.
pe – [in] The number of the remote PE.
- Returns:
A remote symmetric address on the specified PE that can be accessed using memory loads and stores.
- template<typename T> SHMEM_DEVICE void shmem_mte_get_mem_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, uint32_t elem_size, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Copy a contiguous data on symmetric memory from the specified PE to address on the local device.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_roce_get_mem_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local device. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- template<typename T> SHMEM_DEVICE void shmem_mte_get_mem_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, const non_contiguous_copy_param ©_params, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local device.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
copy_params – [in] Params to describe how non-contiguous data is managed in src and dst.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_roce_get_mem_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] GlobalTensor on local device of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
buf – [in] LocalTensor on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- template<typename T> SHMEM_DEVICE void shmem_mte_get_mem_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Copy a contiguous data on symmetric memory from the specified PE to address on the local PE.
- Parameters:
dst – [in] GlobalTensor on local device of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
buf – [in] LocalTensor on local UB.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_get_mem_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, const non_contiguous_copy_param ©_params, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local device.
- Parameters:
dst – [in] GlobalTensor on local device of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
buf – [in] LocalTensor on local UB.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_roce_put_mem_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- template<typename T> SHMEM_DEVICE void shmem_mte_put_mem_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, uint32_t elem_size, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Copy a contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_put_mem_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, const non_contiguous_copy_param ©_params, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_roce_put_mem_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] GlobalTensor on local device of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_put_mem_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Copy a contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] GlobalTensor on local device of the source data.
buf – [in] Pointer on local UB.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_put_mem_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, const non_contiguous_copy_param ©_params, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] GlobalTensor on local device of the source data.
buf – [in] LocalTensor on local UB.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_get_mem_nbi (__ubuf__ T *dst, __gm__ T *src, uint32_t elem_size, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Copy a contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] Pointer on local UB of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_get_mem_nbi (AscendC::LocalTensor< T > dst, AscendC::GlobalTensor< T > src, uint32_t elem_size, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Copy a contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] LocalTensor on local UB of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_get_mem_nbi (__ubuf__ T *dst, __gm__ T *src, const non_contiguous_copy_param ©_params, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] Pointer on local UB of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_get_mem_nbi (AscendC::LocalTensor< T > dst, AscendC::GlobalTensor< T > src, const non_contiguous_copy_param ©_params, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] LocalTensor on local UB of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_put_mem_nbi (__gm__ T *dst, __ubuf__ T *src, uint32_t elem_size, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local UB of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_put_mem_nbi (AscendC::GlobalTensor< T > dst, AscendC::LocalTensor< T > src, uint32_t elem_size, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] LocalTensor on local UB of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_put_mem_nbi (__gm__ T *dst, __ubuf__ T *src, const non_contiguous_copy_param ©_params, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local UB of the source data.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
- template<typename T> SHMEM_DEVICE void shmem_mte_put_mem_nbi (AscendC::GlobalTensor< T > dst, AscendC::LocalTensor< T > src, const non_contiguous_copy_param ©_params, int pe, AscendC::TEventID EVENT_ID)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] LocalTensor on local UB of the source data.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
EVENT_ID – [in] ID used to Sync MTE2\MTE3 Event.
shmem_device_sync.h
Functions
- SHMEM_DEVICE void shmemx_set_ffts_config (uint64_t config)
Set runtime ffts address. Call this at MIX Kernel entry point (if the kernel contains barrier calls).
- Parameters:
config – [config] ffts config, acquired by shmemx_get_ffts_config()
- SHMEM_DEVICE void shmem_barrier (shmem_team_t tid)
shmem_barrier is a collective synchronization routine over a team. Control returns from shmem_barrier after all PEs in the team have called shmem_barrier.
shmem_barrier ensures that all previously issued stores and remote memory updates, including AMOs and RMA operations, done by any of the PEs in the active set are complete before returning. On systems with only scale-up network (HCCS), updates are globally visible, whereas on systems with both scale-up network HCCS and scale-out network (RDMA), SHMEM only guarantees that updates to the memory of a given PE are visible to that PE.
Barrier operations issued on the CPU and the NPU only complete communication operations that were issued from the CPU and the NPU, respectively. To ensure completion of NPU-side operations from the CPU, using aclrtSynchronizeStream/aclrtDeviceSynchronize or stream-based API.
- Parameters:
tid – [in] team to do barrier
- SHMEM_DEVICE void shmem_barrier_all ()
shmem_barrier of all PEs.
- SHMEM_DEVICE void shmemx_barrier_vec (shmem_team_t tid)
Similar to shmem_barrier except that only vector cores participate. Useful in communication-over-compute operators. Cube core may call the api but takes no effect.
- Parameters:
tid – [in] team to do barrier
- SHMEM_DEVICE void shmemx_barrier_all_vec ()
shmemx_barrier_vec of all PEs.
- SHMEM_DEVICE void shmem_quiet ()
The shmem_quiet routine ensures completion of all operations on symmetric data objects issued by the calling PE.
On systems with only scale-up network (HCCS), updates are globally visible, whereas on systems with both scale-up network HCCS and scale-out network (RDMA), SHMEM only guarantees that updates to the memory of a given PE are visible to that PE.
Quiet operations issued on the CPU and the NPU only complete communication operations that were issued from the CPU and the NPU, respectively. To ensure completion of NPU-side operations from the CPU, using aclrtSynchronizeStream/aclrtDeviceSynchronize or stream-based API.
- SHMEM_DEVICE void shmem_fence ()
In OpenSHMEM specification, shmem_fence assures ordering of delivery of Put, AMOs, and memory store routines to symmetric data objects, but does not guarantee the completion of these operations.
However, due to hardware capabilities, we implemented shmem_fence same as shmem_quiet, ensuring both ordering and completion.
Fence operations issued on the CPU and the NPU only order communication operations that were issued from the CPU and the NPU, respectively. To ensure completion of NPU-side operations from the CPU, using aclrtSynchronizeStream/aclrtDeviceSynchronize or stream-based API.
- SHMEM_DEVICE void shmemx_signal_op (__gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
The shmemx_signal_op operation updates sig_addr with signal using operation sig_op on the specified PE. This operation can be used together with shmem_signal_wait_until for efficient point-to-point synchronization. WARNING: Atomicity NOT Guaranteed.
- Parameters:
sig_addr – [in] Symmetric address of the signal word to be updated.
signal – [in] The value used to update sig_addr.
sig_op – [in] Operation used to update sig_addr with signal. Supported operations: SHMEM_SIGNAL_SET/SHMEM_SIGNAL_ADD
pe – [in] PE number of the remote PE.
- SHMEM_DEVICE int32_t shmem_signal_wait_until (__gm__ int32_t *sig_addr, int cmp, int32_t cmp_val)
This routine can be used to implement point-to-point synchronization between PEs or between threads within the same PE. A call to this routine blocks until the value of sig_addr at the calling PE satisfies the wait condition specified by the comparison operator, cmp, and comparison value, cmp_val.
- Parameters:
sig_addr – [in] Local address of the source signal variable.
cmp – [in] The comparison operator that compares sig_addr with cmp_val. Supported operators: SHMEM_CMP_EQ/SHMEM_CMP_NE/SHMEM_CMP_GT/ SHMEM_CMP_GE/SHMEM_CMP_LT/SHMEM_CMP_LE.
cmp_val – [in] The value against which the object pointed to by sig_addr will be compared.
- Returns:
Return the contents of the signal data object, sig_addr, at the calling PE that satisfies the wait condition.
shmem_device_team.h
Functions
- SHMEM_DEVICE int shmem_my_pe (void)
Returns the PE number of the local PE.
- Returns:
Integer between 0 and npes - 1
- SHMEM_DEVICE int shmem_n_pes (void)
Returns the number of PEs running in the program.
- Returns:
Number of PEs in the program.
- SHMEM_DEVICE int shmem_team_my_pe (shmem_team_t team)
Returns the number of the calling PE in the specified team.
- Parameters:
team – [in] A team handle.
- Returns:
The number of the calling PE within the specified team. If the team handle is SHMEM_TEAM_INVALID, returns -1.
- SHMEM_DEVICE int shmem_team_n_pes (shmem_team_t team)
Returns the number of PEs in the specified team.
- Parameters:
team – [in] A team handle.
- Returns:
The number of PEs in the specified team. If the team handle is SHMEM_TEAM_INVALID, returns -1.
- SHMEM_DEVICE int shmem_team_translate_pe (shmem_team_t src_team, int src_pe, shmem_team_t dest_team)
Translate a given PE number in one team into the corresponding PE number in another team.
- Parameters:
src_team – [in] A SHMEM team handle.
src_pe – [in] The PE number in src_team.
dest_team – [in] A SHMEM team handle.
- Returns:
The number of PEs in the specified team. If the team handle is SHMEM_TEAM_INVALID, returns -1.