DEVICE API
shmem_device_amo.h
Defines
-
ACLSHMEM_TYPE_FUNC_ATOMIC_ADD(FUNC)
Type Function Macros for Atomic Operations.
Each macro is used to generate atomic operation functions for specific data types.
Macro Name
Used by Atomic Interfaces
ACLSHMEM_TYPE_FUNC_ATOMIC_ADD
atomic_add
ACLSHMEM_TYPE_FUNC_ATOMIC_ADD_910
atomic_add
ACLSHMEM_TYPE_FUNC_ATOMIC_ADD_EXT
atomic_add
ACLSHMEM_TYPE_FUNC_ATOMIC_SWAP
atomic_set, atomic_swap, atomic_compare_swap
ACLSHMEM_TYPE_FUNC_ATOMIC_SWAP_CAST
atomic_set, atomic_swap, atomic_compare_swap (CAST types)
ACLSHMEM_TYPE_FUNC_ATOMIC_CAS_CAST
atomic_compare_swap (CAST types)
ACLSHMEM_TYPE_FUNC_ATOMIC_ADD_950
atomic_fetch_add, atomic_inc, atomic_fetch_inc, atomic_fetch
ACLSHMEM_TYPE_FUNC_ATOMIC_LOGIC
atomic_and, atomic_or, atomic_xor, atomic_fetch_and,
atomic_fetch_or, atomic_fetch_xor
Standard Atomic Add Types and Names
NAME
TYPE
int8
int8_t
int16
int16_t
int32
int32_t
bfloat16
bfloat16_t
half
half
-
ACLSHMEM_TYPE_FUNC_ATOMIC_ADD_910(FUNC)
Ascend_910 operations - support types for atomic_add.
NAME
TYPE
int32
int32_t
float
float
-
ACLSHMEM_TYPE_FUNC_ATOMIC_ADD_EXT(FUNC)
Ascend950 operations - support types for atomic_add.
NAME
TYPE
uint32
uint32_t
uint64
uint64_t
int64
int64_t
-
ACLSHMEM_TYPE_FUNC_ATOMIC_SWAP(FUNC)
Integer-only operations - direct support types.
NAME
TYPE
uint32
uint32_t
uint64
uint64_t
-
ACLSHMEM_TYPE_FUNC_ATOMIC_SWAP_CAST(FUNC)
Integer-only operations - CAST support types.
NAME
TYPE
CAST TYPE
int32
int32_t
uint32_t
int64
int64_t
uint64_t
float
float
uint32_t
-
ACLSHMEM_TYPE_FUNC_ATOMIC_CAS_CAST(FUNC)
Integer-only operations - CAST support types.
NAME
TYPE
CAST TYPE
int32
int32_t
uint32_t
int64
int64_t
uint64_t
-
ACLSHMEM_TYPE_FUNC_ATOMIC_ADD_950(FUNC)
Ascend950 operations - support types for atomic_add.
NAME
TYPE
uint32
uint32_t
uint64
uint64_t
int32
int32_t
int64
int64_t
-
ACLSHMEM_TYPE_FUNC_ATOMIC_LOGIC(FUNC)
Logic operations - support types for atomic_and, atomic_or, atomic_xor, atomic_fetch_and,.
NAME
TYPE
uint32
uint32_t
uint64
uint64_t
int32
int32_t
int64
int64_t
-
ACLSHMEM_ATOMIC_ADD_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic add functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
MTE
int8, int16, bf16, half, int32, float
Ascend910B/Ascend910C/Ascend950
MTE
uint32, uint64, int64
Ascend950
ROCE
int32, uint32, int64, uint64,
Ascend950
UDMA
int32, uint32, int64, uint64, float
Ascend950
The implementation dispatches to MTE or UDMA based on the transport topology. Pipeline synchronization requirements differ by transport and must be ensured externally by the caller:
MTE path: writes operands to UB via Scalar, then MTE3 reads UB and performs the remote atomic add to GM.
Before calling: if another unit (e.g. MTE2) is also writing to the same UB region, the caller must fence those writes before Scalar writes to UB (e.g. SetFlag/WaitFlag on MTE2_S event), otherwise UB data may be corrupted.
After calling: if there is a data dependency on the atomic add result, the caller must fence MTE3 before reading GM (e.g. SetFlag/WaitFlag on MTE3_MTE2 event). Likewise, if new values will be written to the same UB region, the caller must fence MTE3 before overwriting UB (e.g. SetFlag/ WaitFlag on MTE3_S event), otherwise UB data may be overwritten before MTE3 has finished reading.
UDMA path: the atomic add is issued asynchronously over UDMA. The caller must call aclshmemx_udma_quiet(pe) before reading the result to guarantee the operation has completed on the target PE.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_atomic_add(__gm__ TYPE *dst, TYPE value, int32_t pe)
- Function Description
Asynchronous interface. Perform contiguous data atomic add operation on symmetric memory from the specified PE to address on the local PE.
The MTE UB buffer offset defaults to 0 and can be adjusted via aclshmemx_set_mte_config(offset, ub_size, sync_id).
- Parameters
dst - [in] Pointer on local device of the destination data.
value - [in] Value atomic add to destination.
pe - [in] PE number of the remote PE.
Note
The MTE transport for this operation does not support cross-PCIe (inter-node) communication. Use the ROCE or UDMA transport paths for cross-PCIe scenarios.
-
ACLSHMEM_ATOMIC_ADD_910_TYPENAME(NAME, TYPE)
-
ACLSHMEM_ATOMIC_ADD_EXT_TYPENAME(NAME, TYPE)
-
ACLSHMEM_ATOMIC_FETCH_ADD_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic fetch add functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
MTE
int32, uint32, uint64, int64, float
Ascend950
ROCE
int32, uint32, int64, uint64
Ascend950
UDMA
int32, uint32, int64, uint64
Ascend950
Remark
ACLSHMEM_DEVICE TYPE aclshmem_NAME_atomic_fetch_add(__gm__ TYPE *dest, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically adds value to the value at dest and returns the old value.
- Parameters
dest - [in] Pointer on local device of the destination data.
value - [in] Value atomic add to destination.
pe - [in] PE number of the remote PE.
- Return
The old value at dest before the addition.
Note
The MTE transport for this operation does not support cross-PCIe (inter-node) communication. Use the ROCE or UDMA transport paths for cross-PCIe scenarios.
-
ACLSHMEM_ATOMIC_INC_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic inc functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
MTE
int32, uint32, uint64, int64,
Ascend950
ROCE
int32, uint32, int64, uint64, float
Ascend950
UDMA
int32, uint32, int64, uint64, float
Ascend950
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_atomic_inc(__gm__ TYPE *dst, int32_t pe)
- Function Description
Synchronous interface. Perform atomic increment operation on symmetric memory from the specified PE to address on the local PE. Increments the value at the destination by 1.
- Parameters
dst - [in] Pointer on local device of the destination data.
pe - [in] PE number of the remote PE.
Note
The MTE transport for this operation does not support cross-PCIe (inter-node) communication. Use the ROCE or UDMA transport paths for cross-PCIe scenarios.
-
ACLSHMEM_ATOMIC_FETCH_INC_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic fetch inc functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
MTE
int32, uint32, uint64, int64, float
Ascend950
ROCE
int32, uint32, int64, uint64
Ascend950
UDMA
int32, uint32, int64, uint64
Ascend950
Remark
ACLSHMEM_DEVICE TYPE aclshmem_NAME_atomic_fetch_inc(__gm__ TYPE *dest, int32_t pe)
- Function Description
Synchronous interface. Atomically increments the value at dest by 1 and returns the old value.
- Parameters
dest - [in] Pointer on local device of the destination data.
pe - [in] PE number of the remote PE.
- Return
The old value at dest before the increment.
Note
The MTE transport for this operation does not support cross-PCIe (inter-node) communication. Use the ROCE or UDMA transport paths for cross-PCIe scenarios.
-
ACLSHMEM_ATOMIC_AND_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic and functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
ROCE
int32, uint32, int64, uint64
Ascend950
UDMA
int32, uint32, int64, uint64
Ascend950
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_atomic_and(__gm__ TYPE *dest, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically performs bitwise AND operation between the value at dest and the specified value.
- Parameters
dest - [in] Pointer on local device of the destination data.
value - [in] Value to perform bitwise AND with.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_ATOMIC_OR_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic or functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
ROCE
int32, uint32, int64, uint64
Ascend950
UDMA
int32, uint32, int64, uint64
Ascend950
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_atomic_or(__gm__ TYPE *dest, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically performs bitwise OR operation between the value at dest and the specified value.
- Parameters
dest - [in] Pointer on local device of the destination data.
value - [in] Value to perform bitwise OR with.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_ATOMIC_XOR_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic xor functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
ROCE
int32, uint32, int64, uint64
Ascend950
UDMA
int32, uint32, int64, uint64
Ascend950
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_atomic_xor(__gm__ TYPE *dest, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically performs bitwise XOR operation between the value at dest and the specified value.
- Parameters
dest - [in] Pointer on local device of the destination data.
value - [in] Value to perform bitwise XOR with.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_ATOMIC_FETCH_AND_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic fetch and functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
ROCE
int32, uint32, int64, uint64
Ascend950
UDMA
int32, uint32, int64, uint64
Ascend950
Remark
ACLSHMEM_DEVICE TYPE aclshmem_NAME_atomic_fetch_and(__gm__ TYPE *dest, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically performs bitwise AND operation between the value at dest and the specified value, and returns the old value.
- Parameters
dest - [in] Pointer on local device of the destination data.
value - [in] Value to perform bitwise AND with.
pe - [in] PE number of the remote PE.
- Return
The old value at dest before the operation.
-
ACLSHMEM_ATOMIC_FETCH_OR_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic fetch or functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
ROCE
int32, uint32, int64, uint64
Ascend950
UDMA
int32, uint32, int64, uint64
Ascend950
Remark
ACLSHMEM_DEVICE TYPE aclshmem_NAME_atomic_fetch_or(__gm__ TYPE *dest, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically performs bitwise OR operation between the value at dest and the specified value, and returns the old value.
- Parameters
dest - [in] Pointer on local device of the destination data.
value - [in] Value to perform bitwise OR with.
pe - [in] PE number of the remote PE.
- Return
The old value at dest before the operation.
-
ACLSHMEM_ATOMIC_FETCH_XOR_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic fetch xor functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
ROCE
int32, uint32, int64, uint64
Ascend950
UDMA
int32, uint32, int64, uint64
Ascend950
Remark
ACLSHMEM_DEVICE TYPE aclshmem_NAME_atomic_fetch_xor(__gm__ TYPE *dest, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically performs bitwise XOR operation between the value at dest and the specified value, and returns the old value.
- Parameters
dest - [in] Pointer on local device of the destination data.
value - [in] Value to perform bitwise XOR with.
pe - [in] PE number of the remote PE.
- Return
The old value at dest before the operation.
-
ACLSHMEM_ATOMIC_FETCH_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic fetch functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
MTE
uint32, uint64, int32, int64, float
Ascend950
ROCE
uint32, uint64, int32, int64, float
Ascend950
UDMA
uint32, uint64, int32, int64
Ascend950
Remark
ACLSHMEM_DEVICE TYPE aclshmem_NAME_atomic_fetch(const TYPE *source, int32_t pe)
- Function Description
Synchronous interface. Atomically reads the value from source and returns it. This operation does not modify the data at source.
- Parameters
source - [in] Pointer on local device of the source data to read.
pe - [in] PE number of the remote PE.
- Return
The value at source.
Note
The MTE transport for this operation does not support cross-PCIe (inter-node) communication. Use the ROCE or UDMA transport paths for cross-PCIe scenarios.
-
ACLSHMEM_ATOMIC_SET_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic set functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
MTE
uint32, uint64, int32, int64, float
Ascend950
ROCE
uint32, uint64, int32, int64, float
Ascend950
UDMA
uint32, uint64, int32, int64
Ascend950
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_atomic_set(TYPE *dest, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically sets the value at dest to the specified value.
- Parameters
dest - [in] Pointer on local device of the destination data.
value - [in] Value to set.
pe - [in] PE number of the remote PE.
Note
The MTE transport for this operation does not support cross-PCIe (inter-node) communication. Use the ROCE or UDMA transport paths for cross-PCIe scenarios.
-
ACLSHMEM_ATOMIC_SET_TYPENAME_CAST(NAME, TYPE, SUBTYPE)
Automatically generates aclshmem atomic set functions for types requiring CAST. The macro parameters: NAME is the function name suffix, TYPE is the operation data type, SUBNAME is the underlying type name, SUBTYPE is the underlying type.
Path
Supported Types
Hardware Platform
MTE
uint32, uint64, int32, int64, float
Ascend950
ROCE
uint32, uint64, int32, int64, float
Ascend950
UDMA
uint32, uint64, int32, int64
Ascend950
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_atomic_set(TYPE *dest, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically sets the value at dest to the specified value. Types are CAST to underlying unsigned integer types for the atomic operation.
- Parameters
dest - [in] Pointer on local device of the destination data.
value - [in] Value to set.
pe - [in] PE number of the remote PE.
Note
The MTE transport for this operation does not support cross-PCIe (inter-node) communication. Use the ROCE or UDMA transport paths for cross-PCIe scenarios.
-
ACLSHMEM_ATOMIC_SWAP_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic swap functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
MTE
uint32, uint64, int32, int64, float
Ascend950
ROCE
uint32, uint64, int32, int64, float
Ascend950
UDMA
uint32, uint64, int32, int64
Ascend950
Remark
ACLSHMEM_DEVICE TYPE aclshmem_NAME_atomic_swap(TYPE *dest, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically swaps the value at dest with the specified value and returns the old value.
- Parameters
dest - [in] Pointer on local device of the destination data.
value - [in] Value to swap.
pe - [in] PE number of the remote PE.
- Return
The old value at dest before the swap.
Note
The MTE transport for this operation does not support cross-PCIe (inter-node) communication. Use the ROCE or UDMA transport paths for cross-PCIe scenarios.
-
ACLSHMEM_ATOMIC_SWAP_TYPENAME_CAST(NAME, TYPE, SUBTYPE)
Automatically generates aclshmem atomic swap functions for types requiring CAST. The macro parameters: NAME is the function name suffix, TYPE is the operation data type, SUBNAME is the underlying type name, SUBTYPE is the underlying type.
Path
Supported Types
Hardware Platform
MTE
uint32, uint64, int32, int64, float
Ascend950
ROCE
uint32, uint64, int32, int64, float
Ascend950
UDMA
uint32, uint64, int32, int64
Ascend950
Remark
ACLSHMEM_DEVICE TYPE aclshmem_NAME_atomic_swap(TYPE *dest, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically swaps the value at dest with the specified value and returns the old value. Types are CAST to underlying unsigned integer types for the atomic operation.
- Parameters
dest - [in] Pointer on local device of the destination data.
value - [in] Value to swap.
pe - [in] PE number of the remote PE.
- Return
The old value at dest before the swap.
Note
The MTE transport for this operation does not support cross-PCIe (inter-node) communication. Use the ROCE or UDMA transport paths for cross-PCIe scenarios.
-
ACLSHMEM_ATOMIC_COMPARE_SWAP_TYPENAME(NAME, TYPE)
Automatically generates aclshmem atomic compare swap functions for different data types. The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Path
Supported Types
Hardware Platform
MTE
uint32, uint64, int32, int64
Ascend950
ROCE
uint32, uint64, int32, int64
Ascend950
UDMA
uint32, uint64, int32, int64
Ascend950
Remark
ACLSHMEM_DEVICE TYPE aclshmem_NAME_atomic_compare_swap(TYPE *dest, TYPE cond, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically compares the value at dest with cond. If they are equal, the value at dest is set to value. Returns the old value at dest.
- Parameters
dest - [in] Pointer on local device of the destination data.
cond - [in] Value to compare against.
value - [in] Value to set if comparison succeeds.
pe - [in] PE number of the remote PE.
- Return
The old value at dest before the operation.
Note
The MTE transport for this operation does not support cross-PCIe (inter-node) communication. Use the ROCE or UDMA transport paths for cross-PCIe scenarios.
-
ACLSHMEM_ATOMIC_COMPARE_SWAP_TYPENAME_CAST(NAME, TYPE, SUBTYPE)
Automatically generates aclshmem atomic compare swap functions for types requiring CAST. The macro parameters: NAME is the function name suffix, TYPE is the operation data type, SUBNAME is the underlying type name, SUBTYPE is the underlying type.
Path
Supported Types
Hardware Platform
MTE
uint32, uint64, int32, int64
Ascend950
ROCE
uint32, uint64, int32, int64
Ascend950
UDMA
uint32, uint64, int32, int64
Ascend950
Remark
ACLSHMEM_DEVICE TYPE aclshmem_NAME_atomic_compare_swap(TYPE *dest, TYPE cond, TYPE value, int32_t pe)
- Function Description
Synchronous interface. Atomically compares the value at dest with cond. If they are equal, the value at dest is set to value. Returns the old value at dest. Types are CAST to underlying unsigned integer types for the atomic operation.
- Parameters
dest - [in] Pointer on local device of the destination data.
cond - [in] Value to compare against.
value - [in] Value to set if comparison succeeds.
pe - [in] PE number of the remote PE.
- Return
The old value at dest before the operation.
Note
The MTE transport for this operation does not support cross-PCIe (inter-node) communication. Use the ROCE or UDMA transport paths for cross-PCIe scenarios.
shmem_device_cc.h
shmem device Collective Communication APIs
Functions
- ACLSHMEM_DEVICE void util_set_ffts_config (uint64_t config)
Set runtime ffts address. Call this at MIX Kernel entry point (if the kernel contains barrier calls).
- Parameters:
config – [config] ffts config, acquired by util_get_ffts_config()
- ACLSHMEM_DEVICE void aclshmem_barrier (aclshmem_team_t team)
aclshmem_barrier is a collective synchronization routine over a team. Control returns from aclshmem_barrier after all PEs in the team have called aclshmem_barrier. aclshmem_barrier ensures that all previously issued stores and remote memory updates, including AMOs and RMA operations, done by any of the PEs in the active set are complete before returning. On systems with only scale-up network (HCCS), updates are globally visible, whereas on systems with both scale-up network HCCS and scale-out network (RDMA), ACLSHMEM only guarantees that updates to the memory of a given PE are visible to that PE. Barrier operations issued on the CPU and the NPU only complete communication operations that were issued from the CPU and the NPU, respectively. To ensure completion of NPU-side operations from the CPU, using aclrtSynchronizeStream/aclrtDeviceSynchronize or stream-based API.
- Parameters:
team – [in] team to do barrier
- ACLSHMEM_DEVICE void aclshmem_barrier_all (void)
aclshmem_barrier of all PEs.
- ACLSHMEM_DEVICE void aclshmemx_barrier_vec (aclshmem_team_t team)
Similar to aclshmem_barrier except that only vector cores participate. Useful in communication-over-compute operators. Cube core may call the api but takes no effect.
- Parameters:
team – [in] team to do barrier
- ACLSHMEM_DEVICE void aclshmemx_barrier_all_vec (void)
aclshmemx_barrier_vec of all PEs.
- ACLSHMEM_DEVICE void aclshmem_sync (aclshmem_team_t team)
Similar to aclshmem_barrier. In contrast with the aclshmem_barrier routine, aclshmem_sync only ensures completion and visibility of previously issued memory stores and does not ensure completion of remote memory updates issued via ACLSHMEM routines.
- Parameters:
team – [in] team to do barrier
- ACLSHMEM_DEVICE void aclshmem_sync_all (void)
aclshmem_sync_all of all PEs.
shmem_device_mo.h
Functions
- ACLSHMEM_DEVICE void aclshmem_quiet (void)
The aclshmem_quiet routine ensures completion of all operations on symmetric data objects issued by the calling PE. On systems with only scale-up network (HCCS), updates are globally visible, whereas on systems with both scale-up network HCCS and scale-out network (RDMA), ACLSHMEM only guarantees that updates to the memory of a given PE are visible to that PE. Quiet operations issued on the CPU and the NPU only complete communication operations that were issued from the CPU and the NPU, respectively. To ensure completion of NPU-side operations from the CPU, using aclrtSynchronizeStream/aclrtDeviceSynchronize or stream-based API.
- ACLSHMEM_DEVICE void aclshmem_fence (void)
In OpenACLSHMEM specification, aclshmem_fence assures ordering of delivery of Put, AMOs, and memory store routines to symmetric data objects, but does not guarantee the completion of these operations. However, due to hardware capabilities, we implemented aclshmem_fence same as aclshmem_quiet, ensuring both ordering and completion. Fence operations issued on the CPU and the NPU only order communication operations that were issued from the CPU and the NPU, respectively. To ensure completion of NPU-side operations from the CPU, using aclrtSynchronizeStream/aclrtDeviceSynchronize or stream-based API.
shmem_device_p2p_sync.h
Defines
-
ACLSHMEM_WAIT_UNTIL(NAME, TYPE)
Automatically generates aclshmem wait until functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_wait_until(__gm__ TYPE *ivar, int cmp, TYPE cmp_value)
- Function Description
Implements point-to-point synchronization by blocking until the value at ivar satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.
- Parameters
ivar - [in] Symmetric address of a remotely accessible data object. The type of ivar should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
cmp - [in] The comparison operator that compares ivar with cmp_val. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_value - [in] The value to be compared with ivar. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
-
ACLSHMEM_WAIT(NAME, TYPE)
Automatically generates aclshmem wait functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_wait(__gm__ TYPE *ivar, TYPE cmp_value)
- Function Description
Implements point-to-point synchronization by blocking until the value of ivar is not equal to comparison value, cmp_value.
- Parameters
ivar - [in] Symmetric address of a remotely accessible data object. The type of ivar should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
cmp_value - [in] The value to be compared with ivar. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
-
ACLSHMEM_WAIT_UNTIL_ALL(NAME, TYPE)
Automatically generates aclshmem wait until all functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_wait_until_all(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, TYPE cmp_value)
- Function Description
Implements point-to-point synchronization by blocking until all entries in the wait set specified by ivars and status satisfy the condition defined by the comparison operator, cmp, and comparison value, cmp_value.
- Parameters
ivar - [in] Symmetric address of a remotely accessible data object. The type of ivar should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
nelems - [in] The number of elements in the ivars array.
status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.
cmp - [in] The comparison operator that compares ivar with cmp_val. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_value - [in] The value to be compared with ivar. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
-
ACLSHMEM_WAIT_UNTIL_ANY(NAME, TYPE)
Automatically generates aclshmem wait until any functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_wait_until_any(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, TYPE cmp_value)
- Function Description
Implements point-to-point synchronization by blocking until any one entry in the wait set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.
- Parameters
ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
nelems - [in] The number of elements in the ivars array.
status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.
cmp - [in] The comparison operator that compares ivar with cmp_val. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_value - [in] The value to be compared with ivar. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
- Returns
Return the index of an element in the ivars array that satisfies the wait condition. If the wait set is empty, this routine returns SIZE_MAX.
-
ACLSHMEM_WAIT_UNTIL_SOME(NAME, TYPE)
Automatically generates aclshmem wait until some functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE size_t aclshmem_NAME_wait_until_some(__gm__ TYPE *ivars, size_t nelems, __gm__ size_t *indices, __gm__ const int *status, int cmp, TYPE cmp_value)
- Function Description
Implements point-to-point synchronization by blocking until at least one entry in the wait set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.
- Parameters
ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
nelems - [in] The number of elements in the ivars array.
indices - [out] Local address of an array of indices of length at least nelems into ivars that satisfied the wait condition.
status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.
cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_value - [in] The value to be compared with ivar. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
- Returns
Return the number of indices returned in the indices array. If the wait set is empty, this routine returns 0.
-
ACLSHMEM_WAIT_UNTIL_ALL_VECTOR(NAME, TYPE)
Automatically generates aclshmem wait until all vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE size_t aclshmem_NAME_wait_until_all_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)
- Function Description
Implements point-to-point synchronization by blocking until all entries in the wait set specified by ivars and status satisfy the condition defined by the comparison operator, cmp, and comparison value, cmp_values.
- Parameters
ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
nelems - [in] The number of elements in the ivars array.
status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.
cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
-
ACLSHMEM_WAIT_UNTIL_ANY_VECTOR(NAME, TYPE)
Automatically generates aclshmem wait until any vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE size_t aclshmem_NAME_wait_until_any_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)
- Function Description
Implements point-to-point synchronization by blocking until any one entry in the wait set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_values.
- Parameters
ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
nelems - [in] The number of elements in the ivars array.
status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.
cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
- Returns
Return the index of an element in the ivars array that satisfies the wait condition. If the wait set is empty, this routine returns SIZE_MAX.
-
ACLSHMEM_WAIT_UNTIL_SOME_VECTOR(NAME, TYPE)
Automatically generates aclshmem wait until some vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE size_t aclshmem_NAME_wait_until_some_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ size_t *indices, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)
- Function Description
Implements point-to-point synchronization by blocking until at least one entry in the wait set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_values.
- Parameters
ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
nelems - [in] The number of elements in the ivars array.
indices - [out] Local address of an array of indices of length at least nelems into ivars that satisfied the wait condition.
status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.
cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
- Returns
Return the number of indices returned in the indices array. If the wait set is empty, this routine returns 0.
-
ACLSHMEM_TEST(NAME, TYPE)
Automatically generates aclshmem test functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE int aclshmem_NAME_test(__gm__ TYPE *ivars, int cmp, TYPE cmp_value)
- Function Description
Implements point-to-point synchronization by testing whether the value of ivar satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.
- Parameters
ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_value - [in] The value against which the object pointed to by ivar will be compared. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC
- Returns
Return 1 if the comparison (via the operator cmp) between the ivar and cmp_value results in true; otherwise, return 0.
-
ACLSHMEM_TEST_ANY(NAME, TYPE)
Automatically generates aclshmem test any functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE size_t aclshmem_NAME_test_any(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, TYPE cmp_value)
- Function Description
Implements point-to-point synchronization by testing whether any one entry in the test set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.
- Parameters
ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
nelems - [in] The number of elements in the ivars array.
status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.
cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_value - [in] The value to be compared with ivars. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
- Returns
Return the index of an element in the ivars array that satisfies the test condition. If the test set is empty or no conditions in the test set are satisfied, this routine returns SIZE_MAX.
-
ACLSHMEM_TEST_SOME(NAME, TYPE)
Automatically generates aclshmem test some functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE size_t aclshmem_NAME_test_some(__gm__ TYPE *ivars, size_t nelems, __gm__ size_t *indices, __gm__ const int *status, int cmp, TYPE cmp_value)
- Function Description
Implements point-to-point synchronization by testing whether at least one entry in the test set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.
- Parameters
ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
nelems - [in] The number of elements in the ivars array.
indices - [out] Local address of an array of indices of length at least nelems into ivars that satisfied the test condition.
status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the test set; If status[i] != 0, then ivars[i] is excluded from the test set; If status is NULL, all elements of ivars are included in the test set.
cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_value - [in] The value to be compared with ivars. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
- Returns
Return the number of indices returned in the indices array. If the test set is empty, this routine returns 0.
-
ACLSHMEM_TEST_ALL_VECTOR(NAME, TYPE)
Automatically generates aclshmem test all vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE size_t aclshmem_NAME_test_all_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)
- Function Description
Implements point-to-point synchronization by testing whether all entries in the test set specified by ivars and status satisfy the condition defined by the comparison operator, cmp, and comparison value, cmp_values.
- Parameters
ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
nelems - [in] The number of elements in the ivars array.
status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the test set; If status[i] != 0, then ivars[i] is excluded from the test set; If status is NULL, all elements of ivars are included in the test set.
cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
- Returns
Return 1 if all elements in ivars satisfy the test conditions or if nelems is 0, otherwise this routine returns 0.
-
ACLSHMEM_TEST_ANY_VECTOR(NAME, TYPE)
Automatically generates aclshmem test any vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE size_t aclshmem_NAME_test_any_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)
- Function Description
Implements point-to-point synchronization by testing whether any one entry in the test set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_values.
- Parameters
ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
nelems - [in] The number of elements in the ivars array.
status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the test set; If status[i] != 0, then ivars[i] is excluded from the test set; If status is NULL, all elements of ivars are included in the test set.
cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
- Returns
Return the index of an element in the ivars array that satisfies the test condition. If the test set is empty or no conditions in the test set are satisfied, this routine returns SIZE_MAX.
-
ACLSHMEM_TEST_SOME_VECTOR(NAME, TYPE)
Automatically generates aclshmem test some vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE size_t aclshmem_NAME_test_some_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ size_t *indices, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)
- Function Description
Implements point-to-point synchronization by testing whether at least one entry in the test set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_values.
- Parameters
ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
nelems - [in] The number of elements in the ivars array.
indices - [out] Local address of an array of indices of length at least nelems into ivars that satisfied the test condition.
status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the test set; If status[i] != 0, then ivars[i] is excluded from the test set; If status is NULL, all elements of ivars are included in the test set.
cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.
- Returns
Return the number of indices returned in the indices array. If the test set is empty, this routine returns 0.
Functions
- ACLSHMEM_DEVICE void aclshmemx_signal_op (__gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
The aclshmemx_signal_op operation updates sig_addr with signal using operation sig_op on the specified PE. This operation can be used together with aclshmem_signal_wait_until for efficient point-to-point synchronization. On Ascend950, RDMA is used as the communication engine, supporting signal_set with guaranteed atomicity.
- Parameters:
sig_addr – [in] Symmetric address of the signal word to be updated.
signal – [in] The value used to update sig_addr.
sig_op – [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe – [in] PE number of the remote PE.
- ACLSHMEM_DEVICE int32_t aclshmem_signal_wait_until (__gm__ int32_t *sig_addr, int cmp, int32_t cmp_val)
This routine can be used to implement point-to-point synchronization between PEs or between threads within the same PE. A call to this routine blocks until the value of sig_addr at the calling PE satisfies the wait condition specified by the comparison operator, cmp, and comparison value, cmp_val.
- Parameters:
sig_addr – [in] Local address of the source signal variable.
cmp – [in] The comparison operator that compares sig_addr with cmp_val. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.
cmp_val – [in] The value against which the object pointed to by sig_addr will be compared.
- Returns:
Return the contents of the signal data object, sig_addr, at the calling PE that satisfies the wait condition.
shmem_device_rma.h
Defines
-
ACLSHMEM_TYPE_FUNC(FUNC)
Standard RMA Types and Names.
NAME
TYPE
half
half
float
float
double
double
int8
int8
int16
int16
int32
int32
int64
int64
uint8
uint8
uint16
uint16
uint32
uint32
uint64
uint64
char
char
bfloat16
bfloat16
-
ACLSHMEM_TEST_TYPE_FUNC(FUNC)
Standard test Types and Names.
NAME
TYPE
float
float
int8
int8
int16
int16
int32
int32
int64
int64
uint8
uint8
uint16
uint16
uint32
uint32
uint64
uint64
char
char
-
ACLSHMEM_TYPENAME_P_AICORE(NAME, TYPE)
Automatically generates aclshmem p functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_p(__gm__ TYPE *dst, const TYPE value, int pe)
- Function Description
Provide a low latency put capability for single element of most basic types.
- Parameters
dst - [in] Symmetric address of the destination data.
value - [in] The element to be put.
pe - [in] The number of the remote PE.
-
ACLSHMEM_TYPENAME_G_AICORE(NAME, TYPE)
Automatically generates aclshmem g functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE TYPE aclshmem_NAME_g(__gm__ TYPE *src, int32_t pe)
- Function Description
Provide a low latency get capability for single element of most basic types.
- Parameters
src - [in] Symmetric address of the source data.
pe - [in] The number of the remote PE.
- Returns
A single element of type specified in the input pointer.
-
ACLSHMEM_GET_TYPENAME_MEM(NAME, TYPE)
Automatically generates aclshmem get functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_get(__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)
- Function Description
Synchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. Supports MTE, RDMA, SDMA, or UDMA as the underlying transport.
- Parameters
dst - [in] Pointer on local device of the destination data.
src - [in] Pointer on Symmetric memory of the source data.
elem_size - [in] Number of elements in the dest and source arrays.
pe - [in] PE number of the remote PE.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
-
ACLSHMEM_IGET_TYPENAME_MEM(NAME, TYPE)
Automatically generates aclshmem get functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_iget(__gm__ TYPE *dest, __gm__ TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe)
- Function Description
Synchronous interface. Copy strided data elements from a symmetric array from a specified remote PE to strided locations on a local array.
- Parameters
dest - [in] Pointer on local device of the destination data.
source - [in] Pointer on Symmetric memory of the source data.
dst - [in] The stride between consecutive elements of the dest array.
sst - [in] The stride between consecutive elements of the source array.
nelems - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_GET_SIZE_MEM(BITS)
Automatically generates aclshmem get functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.
Remark
ACLSHMEM_DEVICE void aclshmem_getBITS(__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)
- Function Description
Synchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. Supports MTE, RDMA, SDMA, or UDMA as the underlying transport.
- Parameters
dst - [in] Pointer on local device of the destination data.
src - [in] Pointer on Symmetric memory of the source data.
elem_size - [in] Number of elements in the dest and source arrays.
pe - [in] PE number of the remote PE.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
-
ACLSHMEM_IGET_SIZE_MEM(BITS)
Automatically generates aclshmem get functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.
Remark
ACLSHMEM_DEVICE void aclshmem_igetBITS(__gm__ void *dest, __gm__ void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe)
- Function Description
Synchronous interface. Copy strided data elements from a symmetric array from a specified remote PE to strided locations on a local array.
- Parameters
dest - [in] Pointer on local device of the destination data.
source - [in] Pointer on Symmetric memory of the source data.
dst - [in] The stride between consecutive elements of the dest array.
sst - [in] The stride between consecutive elements of the source array.
nelems - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM(NAME, TYPE)
Automatically generates aclshmem put functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put(__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)
- Function Description
Synchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. Supports MTE, RDMA, SDMA, or UDMA as the underlying transport.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
elem_size - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
-
ACLSHMEM_IPUT_TYPENAME_MEM(NAME, TYPE)
Automatically generates aclshmem put functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_iput(__gm__ TYPE *dest, __gm__ TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe)
- Function Description
Synchronous interface. Copy strided data elements (specified by sst) of an array from a source array on the local PE to locations specified by stride dst on a dest array on specified remote PE.
- Parameters
dest - [in] Pointer on Symmetric memory of the destination data.
source - [in] Pointer on local device of the source data.
dst - [in] The stride between consecutive elements of the dest array.
sst - [in] The stride between consecutive elements of the source array.
nelems - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_SIZE_MEM(BITS)
Automatically generates aclshmem put functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.
Remark
ACLSHMEM_DEVICE void aclshmem_putBITS(__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)
- Function Description
Synchronous interface. Copy a contiguous data on local PE to symmetric address on the specified PE. Supports MTE, RDMA, SDMA, or UDMA as the underlying transport.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
elem_size - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
-
ACLSHMEM_IPUT_SIZE_MEM(BITS)
Automatically generates aclshmem put functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.
Remark
ACLSHMEM_DEVICE void aclshmem_iputBITS(__gm__ void *dest, __gm__ void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe)
- Function Description
Synchronous interface. Copy strided data elements (specified by sst) of an array from a source array on the local PE to locations specified by stride dst on a dest array on specified remote PE.
- Parameters
dest - [in] Pointer on Symmetric memory of the destination data.
source - [in] Pointer on local device of the source data.
dst - [in] The stride between consecutive elements of the dest array.
sst - [in] The stride between consecutive elements of the source array.
nelems - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_GET_TYPENAME_MEM_NBI(NAME, TYPE)
Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)
- Function Description
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. Supports MTE, RDMA, SDMA, or UDMA as the underlying transport.
- Parameters
dst - [in] Pointer on local device of the destination data.
src - [in] Pointer on Symmetric memory of the source data.
elem_size - [in] Number of elements in the dest and source arrays.
pe - [in] PE number of the remote PE.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
-
ACLSHMEM_GET_SIZE_MEM_NBI(BITS)
Automatically generates aclshmem get functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.
Remark
ACLSHMEM_DEVICE void aclshmem_getBITS_nbi(__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)
- Function Description
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. Supports MTE, RDMA, or SDMA as the underlying transport.
- Parameters
dst - [in] Pointer on local device of the destination data.
src - [in] Pointer on Symmetric memory of the source data.
elem_size - [in] Number of elements in the dest and source arrays.
pe - [in] PE number of the remote PE.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
-
ACLSHMEM_GET_TYPENAME_MEM_DETAILED_NBI(NAME, TYPE)
Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(__gm__ TYPE *dst, __gm__ TYPE *src, const non_contiguous_copy_param ©_params, int32_t pe)
- Function Description
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local device.
- Parameters
dst - [in] Pointer on local device of the destination data.
src - [in] Pointer on Symmetric memory of the source data.
copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_GET_TYPENAME_MEM_TENSOR_NBI(NAME, TYPE)
Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, uint32_t elem_size, int pe)
- Function Description
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. Supports MTE, RDMA, SDMA, or UDMA as the underlying transport.
- Parameters
dst - [in] GlobalTensor on local device of the destination data.
src - [in] GlobalTensor on Symmetric memory of the source data.
elem_size - [in] Number of elements in the dest and source arrays.
pe - [in] PE number of the remote PE.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
-
ACLSHMEM_GET_TYPENAME_MEM_TENSOR_DETAILED_NBI(NAME, TYPE)
Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, uint32_t elem_size, int pe)
- Function Description
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local device.
- Parameters
dst - [in] GlobalTensor on local device of the destination data.
src - [in] GlobalTensor on Symmetric memory of the source data.
copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_NBI(NAME, TYPE)
Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)
- Function Description
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. Supports MTE, RDMA, SDMA, or UDMA as the underlying transport.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
elem_size - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
-
ACLSHMEM_PUT_SIZE_MEM_NBI(BITS)
Automatically generates aclshmem put functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.
Remark
ACLSHMEM_DEVICE void aclshmem_putBITS_nbi(__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)
- Function Description
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. Supports MTE, RDMA, or SDMA as the underlying transport.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
elem_size - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
-
ACLSHMEM_PUT_TYPENAME_MEM_DETAILED_NBI(NAME, TYPE)
Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(__gm__ TYPE *dst, __gm__ TYPE *src, const non_contiguous_copy_param ©_params, int32_t pe)
- Function Description
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local PE to symmetric address on the specified PE.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_TENSOR_NBI(NAME, TYPE)
Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, uint32_t elem_size, int pe)
- Function Description
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. Supports MTE, RDMA, SDMA, or UDMA as the underlying transport.
- Parameters
dst - [in] GlobalTensor on Symmetric memory of the destination data.
src - [in] GlobalTensor on local device of the source data.
elem_size - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
-
ACLSHMEM_PUT_TYPENAME_MEM_TENSOR_DETAILED_NBI(NAME, TYPE)
Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, const non_contiguous_copy_param ©_params, int pe)
- Function Description
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local PE to symmetric address on the specified PE.
- Parameters
dst - [in] GlobalTensor on Symmetric memory of the destination data.
src - [in] GlobalTensor on local device of the source data.
copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.
pe - [in] PE number of the remote PE.
Functions
- ACLSHMEM_DEVICE void aclshmem_getmem (__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)
Synchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. Supports MTE, RDMA, SDMA, or UDMA as the underlying transport.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
pe – [in] PE number of the remote PE.
- ACLSHMEM_DEVICE void aclshmem_putmem (__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)
Synchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. Supports MTE, RDMA, SDMA, or UDMA as the underlying transport.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
pe – [in] PE number of the remote PE.
- ACLSHMEM_DEVICE void aclshmem_getmem_nbi (__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. Supports MTE, RDMA, or SDMA as the underlying transport.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
pe – [in] PE number of the remote PE.
- ACLSHMEM_DEVICE void aclshmem_putmem_nbi (__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. Supports MTE, RDMA, or SDMA as the underlying transport.
Warning
Concurrent RMA/AMO operations to the same PE are NOT supported when using RDMA as the underlying transport. When using RDMA or SDMA, the corresponding sync_id from device_state’s rdma_config or sdma_config is used for pipeline synchronization.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
pe – [in] PE number of the remote PE.
- ACLSHMEM_DEVICE void aclshmemx_set_mte_config (uint64_t offset, uint32_t ub_size, uint32_t sync_id)
Set necessary parameters for put or get.
- Parameters:
offset – [in] The start address on UB.
ub_size – [in] The Size of Temp UB Buffer.
sync_id – [in] Sync ID for put or get.
Defines
-
ACLSHMEM_TYPE_FUNC(FUNC)
Standard RMA Types and Names.
NAME
TYPE
half
half
float
float
double
double
int8
int8
int16
int16
int32
int32
int64
int64
uint8
uint8
uint16
uint16
uint32
uint32
uint64
uint64
char
char
bfloat16
bfloat16
-
ACLSHMEM_GET_TYPENAME_MEM_UB_NBI(NAME, TYPE)
Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(__ubuf__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int pe)
- Function Description
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters
dst - [in] Pointer on local UB of the destination data.
src - [in] Pointer on Symmetric memory of the source data.
elem_size - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_GET_TYPENAME_MEM_UB_TENSOR_NBI(NAME, TYPE)
Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(AscendC::LocalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, uint32_t elem_size, int pe)
- Function Description
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters
dst - [in] LocalTensor on local UB of the destination data.
src - [in] GlobalTensor on Symmetric memory of the source data.
elem_size - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_GET_TYPENAME_MEM_UB_DETAILED_NBI(NAME, TYPE)
Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(__ubuf__ TYPE *dst, __gm__ TYPE *src, const non_contiguous_copy_param ©_params, int pe)
- Function Description
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters
dst - [in] Pointer on local UB of the destination data.
src - [in] Pointer on Symmetric memory of the source data.
copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_GET_TYPENAME_MEM_UB_TENSOR_DETAILED_NBI(NAME, TYPE)
Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(AscendC::LocalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, const non_contiguous_copy_param ©_params, int pe)
- Function Description
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters
dst - [in] LocalTensor on local UB of the destination data.
src - [in] GlobalTensor on Symmetric memory of the source data.
copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_UB_NBI(NAME, TYPE)
Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(__gm__ TYPE *dst, __ubuf__ TYPE *src, uint32_t elem_size, int32_t pe)
- Function Description
Asynchronous interface. Copy contiguous data on local UB to symmetric address on the specified PE.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local UB of the source data.
elem_size - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_UB_TENSOR_NBI(NAME, TYPE)
Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::LocalTensor<TYPE> src, uint32_t elem_size, int32_t pe)
- Function Description
Asynchronous interface. Copy contiguous data on local UB to symmetric address on the specified PE.
- Parameters
dst - [in] GlobalTensor on Symmetric memory of the destination data.
src - [in] LocalTensor on local UB of the source data.
elem_size - [in] Number of elements in the destination and source arrays.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_UB_DETAILED_NBI(NAME, TYPE)
Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(__gm__ TYPE *dst, __ubuf__ TYPE *src, const non_contiguous_copy_param ©_params, int32_t pe)
- Function Description
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local UB of the source data.
copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_UB_TENSOR_DETAILED_NBI(NAME, TYPE)
Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::LocalTensor<TYPE> src, const non_contiguous_copy_param ©_params, int32_t pe)
- Function Description
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.
- Parameters
dst - [in] GlobalTensor on Symmetric memory of the destination data.
src - [in] LocalTensor on local UB of the source data.
copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.
pe - [in] PE number of the remote PE.
Functions
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (__ubuf__ T *dst, __gm__ T *src, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] Pointer on local UB of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (AscendC::LocalTensor< T > dst, AscendC::GlobalTensor< T > src, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] LocalTensor on local UB of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (__ubuf__ T *dst, __gm__ T *src, const non_contiguous_copy_param ©_params, int pe, uint32_t sync_id)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] Pointer on local UB of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (AscendC::LocalTensor< T > dst, AscendC::GlobalTensor< T > src, const non_contiguous_copy_param ©_params, int pe, uint32_t sync_id)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local UB.
- Parameters:
dst – [in] LocalTensor on local UB of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (__gm__ T *dst, __ubuf__ T *src, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local UB of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::LocalTensor< T > src, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] LocalTensor on local UB of the source data.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (__gm__ T *dst, __ubuf__ T *src, const non_contiguous_copy_param ©_params, int pe, uint32_t sync_id)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local UB of the source data.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::LocalTensor< T > src, const non_contiguous_copy_param ©_params, int pe, uint32_t sync_id)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] LocalTensor on local UB of the source data.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
Functions
- ACLSHMEM_DEVICE __gm__ void * aclshmem_ptr (__gm__ void *ptr, int pe)
Translate an local symmetric address to remote symmetric address on the specified PE.
- Parameters:
ptr – [in] Symmetric address on local PE.
pe – [in] The number of the remote PE.
- Returns:
A remote symmetric address on the specified PE that can be accessed using memory loads and stores.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local device.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, const non_contiguous_copy_param ©_params, int pe, uint32_t sync_id)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local device.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
copy_params – [in] Params to describe how non-contiguous data is managed in src and dst.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE.
- Parameters:
dst – [in] GlobalTensor on local device of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
buf – [in] LocalTensor on local UB.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, const non_contiguous_copy_param ©_params, int pe, uint32_t sync_id)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local device.
- Parameters:
dst – [in] GlobalTensor on local device of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
buf – [in] LocalTensor on local UB.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, const non_contiguous_copy_param ©_params, int pe, uint32_t sync_id)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] GlobalTensor on local device of the source data.
buf – [in] Pointer on local UB.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, const non_contiguous_copy_param ©_params, int pe, uint32_t sync_id)
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] GlobalTensor on local device of the source data.
buf – [in] LocalTensor on local UB.
copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync pipeline.
- ACLSHMEM_DEVICE void aclshmemx_mte_quiet ()
Asynchronous interface. Clear instruction pipes and flush data cache to GM.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_mte_atomic_fetch (__gm__ T *src, int32_t pe)
Atomic fetch operation. Returns the value at the source address on the specified PE.
Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.mte_config for pipeline synchronization. NOTE: This is an asynchronous interface. Atomic operations involve scalar computation (Scalar). If there are data dependencies between the scalar computation unit and the move unit (MTE2/MTE3) when reading/writing GM, developers need to insert synchronization according to actual situations.
Note
T only supports int32_t/uint32_t/float/int64_t/uint64_t.
Note
The MTE transport operates over the in-die interconnect and does not support cross-PCIe (inter-node) communication.
- Parameters:
src – [in] Symmetric address of the source data.
pe – [in] PE number of the remote PE.
- Returns:
The value at the source address.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_atomic_set (__gm__ T *dst, T value, int32_t pe)
Atomic set operation. Sets the value at the destination address on the specified PE. Supported hardware platform: Ascend950. NOTE: This is an asynchronous interface. Atomic operations involve scalar computation (Scalar). If there are data dependencies between the scalar computation unit and the move unit (MTE2/MTE3) when reading/writing GM, developers need to insert synchronization according to actual situations.
Note
T only supports uint32_t and uint64_t.
Note
The MTE transport operates over the in-die interconnect and does not support cross-PCIe (inter-node) communication.
- Parameters:
dst – [in] Symmetric address of the destination data.
value – [in] Value to be set.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_mte_atomic_compare_swap (__gm__ T *dst, T cond, T value, int32_t pe)
Atomic compare and swap operation. Conditionally updates the value at the destination address. Supported hardware platform: Ascend950. NOTE: This is an asynchronous interface. Atomic operations involve scalar computation (Scalar). If there are data dependencies between the scalar computation unit and the move unit (MTE2/MTE3) when reading/writing GM, developers need to insert synchronization according to actual situations.
Note
T only supports uint32_t and uint64_t.
Note
The MTE transport operates over the in-die interconnect and does not support cross-PCIe (inter-node) communication.
- Parameters:
dst – [in] Symmetric address of the destination data.
cond – [in] Value to compare against.
value – [in] Value to be written if comparison succeeds.
pe – [in] PE number of the remote PE.
- Returns:
The original value at the destination address.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_mte_atomic_swap (__gm__ T *dst, T value, int32_t pe)
Atomic swap operation. Swaps the value at the destination address. Supported hardware platform: Ascend950. NOTE: This is an asynchronous interface. Atomic operations involve scalar computation (Scalar). If there are data dependencies between the scalar computation unit and the move unit (MTE2/MTE3) when reading/writing GM, developers need to insert synchronization according to actual situations.
Note
T only supports uint32_t and uint64_t.
Note
The MTE transport operates over the in-die interconnect and does not support cross-PCIe (inter-node) communication.
- Parameters:
dst – [in] Symmetric address of the destination data.
value – [in] Value to be swapped.
pe – [in] PE number of the remote PE.
- Returns:
The original value at the destination address.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_atomic_inc (__gm__ T *dst, int32_t pe)
Atomic increment operation. Increments the value at the destination address by 1. Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.mte_config for pipeline synchronization. NOTE: This is an asynchronous interface. Atomic operations involve scalar computation (Scalar). If there are data dependencies between the scalar computation unit and the move unit (MTE2/MTE3) when reading/writing GM, developers need to insert synchronization according to actual situations.
Note
T only supports int32_t/uint32_t/float/int64_t/uint64_t.
Note
The MTE transport operates over the in-die interconnect and does not support cross-PCIe (inter-node) communication.
- Parameters:
dst – [in] Symmetric address of the destination data.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_atomic_add (__gm__ T *dst, T value, int32_t pe)
Atomic add operation. Adds the value to the destination address. WARNING: Use sync_id in device_state.mte_config for pipeline synchronization. NOTE: This is an asynchronous interface. The final operation differs by platform and type:
On Ascend910B/Ascend910C: The atomic add is performed by MTE2 unit.
On Ascend950:
uint32_t/int64_t/uint64_t: scalar AtomicAdd (Scalar unit).
Other types: UB + MTE3 (MTE3 atomic add). If there are data dependencies, developers need to insert synchronization according to actual situations. Scalar write to UB before MTE3 read requires S_MTE3 event synchronization; MTE3 write to GM before subsequent reads requires MTE3_MTE2 event synchronization.
Data Type
Ascend910B/Ascend910C
Ascend950
int8_t/int16_t/half/bfloat16_t
✓
✓
int32_t/float
✓
✓
uint32_t/int64_t/uint64_t
✗
✓
Note
The MTE transport operates over the in-die interconnect and does not support cross-PCIe (inter-node) communication.
- Parameters:
dst – [in] Symmetric address of the destination data.
value – [in] Value to be added.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_mte_atomic_fetch_inc (__gm__ T *dst, int32_t pe)
Atomic fetch increment operation. Increments the value at the destination address by 1 and returns the old value. Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.mte_config for pipeline synchronization. NOTE: This is an asynchronous interface. Atomic operations involve scalar computation (Scalar). If there are data dependencies between the scalar computation unit and the move unit (MTE2/MTE3) when reading/writing GM, developers need to insert synchronization according to actual situations.
Note
T only supports int32_t/uint32_t/float/int64_t/uint64_t.
Note
The MTE transport operates over the in-die interconnect and does not support cross-PCIe (inter-node) communication.
- Parameters:
dst – [in] Symmetric address of the destination data.
pe – [in] PE number of the remote PE.
- Returns:
The original value at the destination address before increment.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_mte_atomic_fetch_add (__gm__ T *dst, T value, int32_t pe)
Atomic fetch add operation. Adds the value to the destination address and returns the old value. Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.mte_config for pipeline synchronization. NOTE: This is an asynchronous interface. Atomic operations involve scalar computation (Scalar). If there are data dependencies between the scalar computation unit and the move unit (MTE2/MTE3) when reading/writing GM, developers need to insert synchronization according to actual situations.
Note
T only supports int32_t/uint32_t/float/int64_t/uint64_t.
Note
The MTE transport operates over the in-die interconnect and does not support cross-PCIe (inter-node) communication.
- Parameters:
dst – [in] Symmetric address of the destination data.
value – [in] Value to be added.
pe – [in] PE number of the remote PE.
- Returns:
The original value at the destination address before addition.
Functions
- ACLSHMEM_DEVICE __gm__ void * aclshmem_roce_ptr (__gm__ void *ptr, int pe)
Translate an local symmetric address to remote symmetric address on the specified PE used by RDMA.
- Parameters:
ptr – [in] Symmetric address on local PE.
pe – [in] The number of the remote PE.
- Returns:
A remote symmetric address on the specified PE that can be accessed using memory loads and stores.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_get_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local device. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_get_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local device. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to Sync S\MTE3 Event.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_get_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.
- Parameters:
dst – [in] GlobalTensor on local device of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
buf – [in] LocalTensor on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_get_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] GlobalTensor on local device of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
buf – [in] LocalTensor on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to Sync S\MTE3 Event.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_put_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_put_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to Sync S\MTE3 Event.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] GlobalTensor on local device of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] GlobalTensor on local device of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to Sync S\MTE3 Event.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_quiet (uint32_t pe, __ubuf__ T *buf, uint32_t sync_id)
RDMA Quiet function. This synchronous function ensures all previous RDMA WQEs are completed (data has arrived at the destination NIC).
- Parameters:
pe – [in] PE number of the remote PE.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
sync_id – [in] ID used to Sync S\MTE3 Event.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_roce_atomic_fetch (__gm__ T *src, int32_t pe)
Atomic fetch operation. Returns the value at the source address on the specified PE. Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit data types.
- Parameters:
src – [in] Symmetric address of the source data.
pe – [in] PE number of the remote PE.
- Returns:
The value at the source address.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_atomic_set (__gm__ T *dst, T value, int32_t pe)
Atomic set operation. Sets the value at the destination address on the specified PE. Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit data types.
- Parameters:
dst – [in] Symmetric address of the destination data.
value – [in] Value to be set.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_roce_atomic_compare_swap (__gm__ T *dst, T cond, T value, int32_t pe)
Atomic compare and swap operation. Conditionally updates the value at the destination address. Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers.
- Parameters:
dst – [in] Symmetric address of the destination data.
cond – [in] Value to compare against.
value – [in] Value to be written if comparison succeeds.
pe – [in] PE number of the remote PE.
- Returns:
The original value at the destination address.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_roce_atomic_swap (__gm__ T *dst, T value, int32_t pe)
Atomic swap operation. Swaps the value at the destination address. Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers.
- Parameters:
dst – [in] Symmetric address of the destination data.
value – [in] Value to be swapped.
pe – [in] PE number of the remote PE.
- Returns:
The original value at the destination address.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_atomic_inc (__gm__ T *dst, int32_t pe)
Atomic increment operation. Increments the value at the destination address by 1. Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers.
- Parameters:
dst – [in] Symmetric address of the destination data.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_atomic_add (__gm__ T *dst, T value, int32_t pe)
Atomic add operation. Adds the value to the destination address. Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers.
- Parameters:
dst – [in] Symmetric address of the destination data.
value – [in] Value to be added.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_atomic_and (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise AND operation on dst (remote symmetric address) on the specified PE pe with the operand value, without returning a value. Supported types: int32, uint32, int64, uint64. Supported hardware platform: Ascend950. The function returns after the remote atomic operation has completed and is visible on the remote PE. An internal quiet operation is performed before returning. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers. Using unsupported types or platforms results in undefined behavior.
- Parameters:
dst – [in] Symmetric address of the destination data. Must be a valid symmetric address.
value – [in] Operand of bitwise AND operation.
pe – [in] PE number of the remote PE. Must be a valid PE number within the active set.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_atomic_or (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise OR operation on dst (remote symmetric address) on the specified PE pe with the operand value, without returning a value. Supported types: int32, uint32, int64, uint64. Supported hardware platform: Ascend950. The function returns after the remote atomic operation has completed and is visible on the remote PE. An internal quiet operation is performed before returning. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers. Using unsupported types or platforms will result in a compile-time error.
- Parameters:
dst – [in] Symmetric address of the destination data. Must be a valid symmetric address.
value – [in] Operand of bitwise OR operation.
pe – [in] PE number of the remote PE. Must be a valid PE number within the active set.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_atomic_xor (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise XOR operation on dst (remote symmetric address) on the specified PE pe with the operand value, without returning a value. Supported types: int32, uint32, int64, uint64. Supported hardware platform: Ascend950. The function returns after the remote atomic operation has completed and is visible on the remote PE. An internal quiet operation is performed before returning. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers. Using unsupported types or platforms results in undefined behavior.
- Parameters:
dst – [in] Symmetric address of the destination data. Must be a valid symmetric address.
value – [in] Operand of bitwise XOR operation.
pe – [in] PE number of the remote PE. Must be a valid PE number within the active set.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_roce_atomic_fetch_inc (__gm__ T *dst, int32_t pe)
Atomic fetch increment operation. Increments the value at the destination address by 1 and returns the old value. Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers.
- Parameters:
dst – [in] Symmetric address of the destination data.
pe – [in] PE number of the remote PE.
- Returns:
The original value at the destination address before increment.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_roce_atomic_fetch_add (__gm__ T *dst, T value, int32_t pe)
Atomic fetch add operation. Adds the value to the destination address and returns the old value. Supported hardware platform: Ascend950. WARNING: Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers.
- Parameters:
dst – [in] Symmetric address of the destination data.
value – [in] Value to be added.
pe – [in] PE number of the remote PE.
- Returns:
The original value at the destination address before addition.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_roce_atomic_fetch_and (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise AND operation on dst (remote symmetric address) on the specified PE pe with the operand value, and return the previous contents of dst. Supported types: int32, uint32, int64, uint64. Supported hardware platform: Ascend950. The function returns after the remote atomic operation has completed and is visible on the remote PE. An internal quiet operation is performed before returning. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers. Using unsupported types or platforms will result in a compile-time error.
- Parameters:
dst – [in] Symmetric address of the destination data. Must be a valid symmetric address.
value – [in] Operand of bitwise AND operation.
pe – [in] PE number of the remote PE. Must be a valid PE number within the active set.
- Returns:
Return the previous contents of dst.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_roce_atomic_fetch_or (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise OR operation on dst (remote symmetric address) on the specified PE pe with the operand value, and return the previous contents of dst. Supported types: int32, uint32, int64, uint64. Supported hardware platform: Ascend950. The function returns after the remote atomic operation has completed and is visible on the remote PE. An internal quiet operation is performed before returning. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers. Using unsupported types or platforms results in undefined behavior.
- Parameters:
dst – [in] Symmetric address of the destination data. Must be a valid symmetric address.
value – [in] Operand of bitwise OR operation.
pe – [in] PE number of the remote PE. Must be a valid PE number within the active set.
- Returns:
Return the previous contents of dst.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_roce_atomic_fetch_xor (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise XOR operation on dst (remote symmetric address) on the specified PE pe with the operand value, and return the previous contents of dst. Supported types: int32, uint32, int64, uint64. Supported hardware platform: Ascend950. The function returns after the remote atomic operation has completed and is visible on the remote PE. An internal quiet operation is performed before returning. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.
Note
T only supports 32-bit and 64-bit integers. Using unsupported types or platforms will result in a compile-time error.
- Parameters:
dst – [in] Symmetric address of the destination data. Must be a valid symmetric address.
value – [in] Operand of bitwise XOR operation.
pe – [in] PE number of the remote PE. Must be a valid PE number within the active set.
- Returns:
Return the previous contents of dst.
Functions
- ACLSHMEM_DEVICE void aclshmemx_set_sdma_config (uint64_t offset, uint32_t ub_size, uint32_t sync_id)
Set necessary parameters for SDMA operations.
- Parameters:
offset – [in] The start address on UB.
ub_size – [in] The Size of Temp UB Buffer (In Bytes), at least 64 bytes and 64-byte aligned.
sync_id – [in] Sync ID for put or get.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_get_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local device.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_get_nbi (AscendC::GlobalTensor< T > &dst, AscendC::GlobalTensor< T > &src, AscendC::LocalTensor< T > &buf, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE.
- Parameters:
dst – [in] AscendC::GlobalTensor on local device of the destination data.
src – [in] AscendC::GlobalTensor on Symmetric memory of the source data.
buf – [in] LocalTensor on local UB.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_put_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_put_nbi (AscendC::GlobalTensor< T > &dst, AscendC::GlobalTensor< T > &src, AscendC::LocalTensor< T > &buf, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE.
- Parameters:
dst – [in] AscendC::GlobalTensor on Symmetric memory of the destination data.
src – [in] AscendC::GlobalTensor on local device of the source data.
buf – [in] LocalTensor on local UB.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_cmo_nbi (__gm__ T *src, uint32_t elem_size, ACLSHMEMCMOTYPE cmo_type, __ubuf__ T *buf, uint32_t ub_size, uint32_t sync_id)
Asynchronous interface. Performs a l2 cache manager operation on device global memory. WARNING: Currently, cmo_type only supports CMO_TYPE_PREFETCH.
- Parameters:
src – [in] Pointer to device global memory to operate on.
elem_size – [in] Number of elements (of type T) to operate on.
cmo_type – [in] Cache operation type.
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
sync_id – [in] ID used to sync.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_cmo_nbi (AscendC::GlobalTensor< T > &src, uint32_t elem_size, ACLSHMEMCMOTYPE cmo_type, AscendC::LocalTensor< T > &buf, uint32_t sync_id)
Asynchronous interface. Performs a l2 cache manager operation on device global memory. WARNING: Currently, cmo_type only supports CMO_TYPE_PREFETCH.
- Parameters:
src – [in] AscendC::GlobalTensor of device memory to operate on.
elem_size – [in] Number of elements.
cmo_type – [in] Cache operation type.
buf – [in] LocalTensor on local UB.
sync_id – [in] ID used to sync.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_quiet (AscendC::LocalTensor< T > &buf, uint32_t sync_id)
SDMA Quiet function. This synchronous function ensures all previous SDMA SQEs are completed.
- Parameters:
buf – [in] temporary UB local tensor of uint32_t used as workspace
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_quiet (__ubuf__ T *buf, uint32_t ub_size, uint32_t sync_id)
SDMA Quiet function. This synchronous function ensures all previous SDMA SQEs are completed.
- Parameters:
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_notify_record (AscendC::LocalTensor< T > &buf, uint32_t sync_id)
AIV direct STARS helper function for notify record.
- Parameters:
buf – [in] temporary UB local tensor of uint32_t used as workspace
sync_id – [in] ID used to sync pipeline.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_notify_record (__ubuf__ T *buf, uint32_t ub_size, uint32_t sync_id)
AIV direct STARS helper function for notify record.
- Parameters:
buf – [in] Pointer on local UB.
ub_size – [in] The size of temp Buffer on UB. (In Bytes)
sync_id – [in] ID used to sync pipeline.
Functions
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_udma_get_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local device. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
src – [in] Pointer on Symmetric memory of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_udma_get_nbi (const AscendC::GlobalTensor< T > &dst, const AscendC::GlobalTensor< T > &src, const AscendC::LocalTensor< T > &buf, uint32_t elem_size, int pe)
Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] GlobalTensor on local device of the destination data.
src – [in] GlobalTensor on Symmetric memory of the source data.
buf – [in] LocalTensor on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_udma_put_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_udma_put_nbi (const AscendC::GlobalTensor< T > &dst, const AscendC::GlobalTensor< T > &src, const AscendC::LocalTensor< T > &buf, uint32_t elem_size, int pe, uint32_t sync_id)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] GlobalTensor on Symmetric memory of the destination data.
src – [in] GlobalTensor on local device of the source data.
buf – [in] Pointer on local UB, available space larger than 64 Bytes.
elem_size – [in] Number of elements in the destination and source arrays.
pe – [in] PE number of the remote PE.
sync_id – [in] ID used to sync S\MTE3 Event.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_udma_put_signal_nbi (__gm__ T *dst, __gm__ T *src, uint32_t elem_size, __gm__ uint64_t *sig_addr, uint64_t signal, int pe)
Asynchronous interface. Copy a contiguous data from local to symmetric address on the specified PE and updating a remote signal flag on completion using UDMA. Template function for different data types. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
sig_addr – [in] Symmetric address of the signal word to be updated.
signal – [in] The value used to update sig_addr.
pe – [in] PE number of the remote PE.
- ACLSHMEM_DEVICE void aclshmemx_udma_quiet (int pe)
UDMA Quiet function. This synchronous function ensures all previous UDMA WQEs are completed (data has arrived at the destination PE).
- Parameters:
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_udma_atomic_add (__gm__ T *dst, T value, int32_t pe)
Asynchronous interface. Add value to dst (remote symmetric address) on the specified PE pe, and atomically update the dst without returning the value. Supported types: int32, uint32, int64, uint64, float. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
value – [in] Operand of atomic add
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_udma_atomic_fetch_add (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Add value to dst (remote symmetric address) on the specified PE pe, and return the previous content of dst. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
value – [in] Operand of atomic add
pe – [in] PE number of the remote PE.
- Returns:
Return the previous content of dst.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_udma_atomic_compare_swap (__gm__ T *dst, T cond, T value, int32_t pe)
Synchronous interface. Conditionally update dst (remote symmetric address) on the specified PE pe and return the previous content of dst. If cond and the remote dst value are equal, then value is swapped into the remote dst; otherwise, the remote dst is unchanged. In either case, the old value of the remote dest is returned. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
cond – [in] condition for swap
value – [in] Operand of atomic add
pe – [in] PE number of the remote PE.
- Returns:
Return the previous content of dst.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_udma_atomic_fetch (__gm__ T *dst, int32_t pe)
Synchronous interface. Fetch the contents of dst (remote symmetric address) on the specified PE pe and return the contents. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
pe – [in] PE number of the remote PE.
- Returns:
Return the contents of dst.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_udma_atomic_set (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Set value to dst (remote symmetric address) on the specified PE pe without returning a value. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
value – [in] Value to be atomically written to the remote PE.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_udma_atomic_swap (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Swap value to dst (remote symmetric address) on the specified PE pe and return the previous contents of dst. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
value – [in] Value to be atomically written to the remote PE.
pe – [in] PE number of the remote PE.
- Returns:
Return the previous contents of dst.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_udma_atomic_fetch_inc (__gm__ T *dst, int32_t pe)
Synchronous interface. Increment dst (remote symmetric address) on the specified PE pe by one and return the previous contents of dst. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
pe – [in] PE number of the remote PE.
- Returns:
Return the previous contents of dst.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_udma_atomic_inc (__gm__ T *dst, int32_t pe)
Synchronous interface. Increment dst (remote symmetric address) on the specified PE pe by one without returning a value. Supported types: int32, uint32, int64, uint64, float. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_udma_atomic_fetch_and (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise AND operation on dst (remote symmetric address) on the specified PE pe with the operand value, and return the previous contents of dst. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
value – [in] Operand of bitwise AND operation.
pe – [in] PE number of the remote PE.
- Returns:
Return the previous contents of dst.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_udma_atomic_and (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise AND operation on dst (remote symmetric address) on the specified PE pe with the operand value, without returning a value. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
value – [in] Operand of bitwise AND operation.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_udma_atomic_fetch_or (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise OR operation on dst (remote symmetric address) on the specified PE pe with the operand value, and return the previous contents of dst. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
value – [in] Operand of bitwise OR operation.
pe – [in] PE number of the remote PE.
- Returns:
Return the previous contents of dst.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_udma_atomic_or (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise OR operation on dst (remote symmetric address) on the specified PE pe with the operand value, without returning a value. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
value – [in] Operand of bitwise OR operation.
pe – [in] PE number of the remote PE.
- template<typename T> ACLSHMEM_DEVICE T aclshmemx_udma_atomic_fetch_xor (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise XOR operation on dst (remote symmetric address) on the specified PE pe with the operand value, and return the previous contents of dst. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
value – [in] Operand of bitwise XOR operation.
pe – [in] PE number of the remote PE.
- Returns:
Return the previous contents of dst.
- template<typename T> ACLSHMEM_DEVICE void aclshmemx_udma_atomic_xor (__gm__ T *dst, T value, int32_t pe)
Synchronous interface. Perform a bitwise XOR operation on dst (remote symmetric address) on the specified PE pe with the operand value, without returning a value. Supported types: int32, uint32, int64, uint64. WARNING: When using UDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.
- Parameters:
dst – [in] Pointer on local device of the destination data.
value – [in] Operand of bitwise XOR operation.
pe – [in] PE number of the remote PE.
shmem_device_so.h
Defines
-
ACLSHMEM_TYPE_FUNC(FUNC)
Standard RMA Types and Names.
* Copyright (c) 2025 Huawei Technologies Co., Ltd. * This program is free software, you can redistribute it and/or modify it under the terms and conditions of * CANN Open Software License Agreement Version 2.0 (the “License”). * Please refer to the License for details. You may not use this file except in compliance with the License. * THIS SOFTWARE IS PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. * See LICENSE in the root of the software repository for the full text of the License.
NAME
TYPE
half
half
float
float
double
double
int8
int8
int16
int16
int32
int32
int64
int64
uint8
uint8
uint16
uint16
uint32
uint32
uint64
uint64
char
char
bfloat16
bfloat16
-
ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL(NAME, TYPE)
Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_signal(__gm__ TYPE *dst, __gm__ TYPE *src, size_t elem_size,\ __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
- Function Description
Synchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
elem_size - [in] Number of elements in the dest and source arrays.
sig_addr - [in] Symmetric address of the signal word to be updated.
signal - [in] The value used to update sig_addr.
sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_TENSOR(NAME, TYPE)
Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_signal(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE>\ src, size_t elem_size, gm int32_t *sig_addr, int32_t signal, int sig_op, int pe)
- Function Description
Synchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
elem_size - [in] Number of elements in the dest and source arrays.
sig_addr - [in] Symmetric address of the signal word to be updated.
signal - [in] The value used to update sig_addr.
sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_DETAILED(NAME, TYPE)
Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_signal(__gm__ TYPE *dst, __gm__ TYPE *src, const\ non_contiguous_copy_param ©_params, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
- Function Description
Synchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE then update sig_addr
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.
sig_addr - [in] Symmetric address of the signal word to be updated.
signal - [in] The value used to update sig_addr.
sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_TENSOR_DETAILED(NAME, TYPE)
Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_signal(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE>\ src,const non_contiguous_copy_param ©_params, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
- Function Description
Synchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.
sig_addr - [in] Symmetric address of the signal word to be updated.
signal - [in] The value used to update sig_addr.
sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_SIZE_MEM_SIGNAL_DETAIL(BITS)
Automatically generates aclshmem put functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.
Remark
ACLSHMEM_DEVICE void aclshmem_putBITS_signal(void *dst, void *src, size_t nelems, int32_t *sig_addr,\ int32_t signal, int sig_op, int pe)
- Function Description
Synchronous interface. Copy a contiguous data from local to symmetric address on the specified PE and updating a remote signal flag on completion.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
nelems - [in] Number of elements in the dest and source arrays.
sig_addr - [in] Symmetric address of the signal word to be updated.
signal - [in] The value used to update sig_addr.
sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_NBI(NAME, TYPE)
Automatically generates aclshmem put signal nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_signal_nbi(__gm__ TYPE *dst, __gm__ TYPE *src, size_t\ elem_size, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
- Function Description
Asynchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
elem_size - [in] Number of elements in the dest and source arrays.
sig_addr - [in] Symmetric address of the signal word to be updated.
signal - [in] The value used to update sig_addr.
sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_TENSOR_NBI(NAME, TYPE)
Automatically generates aclshmem put signal nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_signal_nbi(AscendC::GlobalTensor<TYPE> dst,\ AscendC::GlobalTensor<TYPE> src, size_t elem_size, size_t elem_size, __gm__ int32_t *sig_addr, int32_t signal,\ int sig_op, int pe)
- Function Description
Asynchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
elem_size - [in] Number of elements in the dest and source arrays.
sig_addr - [in] Symmetric address of the signal word to be updated.
signal - [in] The value used to update sig_addr.
sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_DETAILED_NBI(NAME, TYPE)
Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_signal_nbi(__gm__ TYPE *dst, __gm__ TYPE *src, const\ non_contiguous_copy_param ©_params, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
- Function Description
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE then update sig_addr
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.
sig_addr - [in] Symmetric address of the signal word to be updated.
signal - [in] The value used to update sig_addr.
sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_TENSOR_DETAILED_NBI(NAME, TYPE)
Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.
Remark
ACLSHMEM_DEVICE void aclshmem_NAME_put_signal_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE>\ src,const non_contiguous_copy_param ©_params, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
- Function Description
Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.
sig_addr - [in] Symmetric address of the signal word to be updated.
signal - [in] The value used to update sig_addr.
sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe - [in] PE number of the remote PE.
-
ACLSHMEM_PUT_SIZE_MEM_SIGNAL_DETAILED_NBI(BITS)
Automatically generates aclshmem put functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.
Remark
ACLSHMEM_DEVICE void aclshmem_putBITS_signal_nbi(void *dst, void *src, size_t nelems, int32_t \ *sig_addr, int32_t signal, int sig_op, int pe)
- Function Description
Asynchronous interface. Copy a contiguous data from local to symmetric address on the specified PE and updating a remote signal flag on completion.
- Parameters
dst - [in] Pointer on Symmetric memory of the destination data.
src - [in] Pointer on local device of the source data.
nelems - [in] Number of elements in the dest and source arrays.
sig_addr - [in] Symmetric address of the signal word to be updated.
signal - [in] The value used to update sig_addr.
sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe - [in] PE number of the remote PE.
Functions
- ACLSHMEM_DEVICE void aclshmem_putmem_signal (__gm__ void *dst, __gm__ void *src, size_t elem_size, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
Synchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE then update sig_addr.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
sig_addr – [in] Symmetric address of the signal word to be updated.
signal – [in] The value used to update sig_addr.
sig_op – [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe – [in] PE number of the remote PE.
- ACLSHMEM_DEVICE void aclshmem_putmem_signal_nbi (__gm__ void *dst, __gm__ void *src, size_t elem_size, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)
Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE then update sig_addr.
- Parameters:
dst – [in] Pointer on Symmetric memory of the destination data.
src – [in] Pointer on local device of the source data.
elem_size – [in] Number of elements in the dest and source arrays.
sig_addr – [in] Symmetric address of the signal word to be updated.
signal – [in] The value used to update sig_addr.
sig_op – [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD
pe – [in] PE number of the remote PE.
shmem_device_team.h
Functions
- ACLSHMEM_DEVICE int aclshmem_my_pe (void)
Returns the PE number of the local PE.
- Returns:
Integer between 0 and npes - 1
- ACLSHMEM_DEVICE int aclshmem_n_pes (void)
Returns the number of PEs running in the program.
- Returns:
Number of PEs in the program.
- ACLSHMEM_DEVICE int aclshmem_team_my_pe (aclshmem_team_t team)
Returns the number of the calling PE in the specified team.
- Parameters:
team – [in] A team handle.
- Returns:
The number of the calling PE within the specified team. If the team handle is ACLSHMEM_TEAM_INVALID, returns -1.
- ACLSHMEM_DEVICE int aclshmem_team_n_pes (aclshmem_team_t team)
Returns the number of PEs in the specified team.
- Parameters:
team – [in] A team handle.
- Returns:
The number of PEs in the specified team. If the team handle is ACLSHMEM_TEAM_INVALID, returns -1.
- ACLSHMEM_DEVICE int aclshmem_team_translate_pe (aclshmem_team_t src_team, int src_pe, aclshmem_team_t dest_team)
Translate a given PE number in one team into the corresponding PE number in another team.
- Parameters:
src_team – [in] A ACLSHMEM team handle.
src_pe – [in] The PE number in src_team.
dest_team – [in] A ACLSHMEM team handle.
- Returns:
The corresponding PE number in the specified team. If the team handle is ACLSHMEM_TEAM_INVALID, returns -1.
- ACLSHMEM_DEVICE int aclshmem_team_pe_mapping (aclshmem_team_t team, int pe)
Translate a given PE number in one team into the corresponding PE number in global team.
- Parameters:
team – [in] A ACLSHMEM team handle.
pe – [in] The PE number in src_team.
- Returns:
The PE number in the global team. If the team handle is ACLSHMEM_TEAM_INVALID, returns -1.