DEVICE API

shmem_device_amo.h

Defines

ACLSHMEM_TYPE_FUNC_ATOMIC_ADD(FUNC)

Standard Atomic Add Types and Names.

NAME

TYPE

int8

int8

int16

int16

int32

int32

half

half

bfloat16

bfloat16

float

float

ACLSHMEM_ATOMIC_ADD_TYPENAME(NAME, TYPE)

Automatically generates aclshmem atomic add functions for different data types (e.g., int8, int16, int32, float, half, bfloat16). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_atomic_add(__gm__ TYPE *dst, TYPE value, int32_t pe)

Function Description

Asynchronous interface. Perform contiguous data atomic add operation on symmetric memory from the specified PE to address on the local PE.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • value - [in] Value atomic add to destination.

  • pe - [in] PE number of the remote PE.

shmem_device_cc.h

shmem device Collective Communication APIs

Functions

ACLSHMEM_DEVICE void util_set_ffts_config (uint64_t config)

Set runtime ffts address. Call this at MIX Kernel entry point (if the kernel contains barrier calls).

Parameters:

config – [config] ffts config, acquired by util_get_ffts_config()

ACLSHMEM_DEVICE void aclshmem_barrier (aclshmem_team_t team)

aclshmem_barrier is a collective synchronization routine over a team. Control returns from aclshmem_barrier after all PEs in the team have called aclshmem_barrier. aclshmem_barrier ensures that all previously issued stores and remote memory updates, including AMOs and RMA operations, done by any of the PEs in the active set are complete before returning. On systems with only scale-up network (HCCS), updates are globally visible, whereas on systems with both scale-up network HCCS and scale-out network (RDMA), ACLSHMEM only guarantees that updates to the memory of a given PE are visible to that PE. Barrier operations issued on the CPU and the NPU only complete communication operations that were issued from the CPU and the NPU, respectively. To ensure completion of GPU-side operations from the CPU, using aclrtSynchronizeStream/aclrtDeviceSynchronize or stream-based API.

Parameters:

team – [in] team to do barrier

ACLSHMEM_DEVICE void aclshmem_barrier_all (void)

aclshmem_barrier of all PEs.

ACLSHMEM_DEVICE void aclshmemx_barrier_vec (aclshmem_team_t team)

Similar to aclshmem_barrier except that only vector cores participate. Useful in communication-over-compute operators. Cube core may call the api but takes no effect.

Parameters:

team – [in] team to do barrier

ACLSHMEM_DEVICE void aclshmemx_barrier_all_vec (void)

aclshmemx_barrier_vec of all PEs.

ACLSHMEM_DEVICE void aclshmem_sync (aclshmem_team_t team)

Similar to aclshmem_barrier. In constract with the aclshmem_barrier routine, aclshmem_sync only ensures completion and visibility of previously issued memory stores and does not ensure completion of remote memory updates issued via ACLSHMEM rountines.

Parameters:

team – [in] team to do barrier

ACLSHMEM_DEVICE void aclshmem_sync_all (void)

aclshmem_sync_all of all PEs.

shmem_device_mo.h

Functions

ACLSHMEM_DEVICE void aclshmem_quiet (void)

The aclshmem_quiet routine ensures completion of all operations on symmetric data objects issued by the calling PE. On systems with only scale-up network (HCCS), updates are globally visible, whereas on systems with both scale-up network HCCS and scale-out network (RDMA), ACLSHMEM only guarantees that updates to the memory of a given PE are visible to that PE. Quiet operations issued on the CPU and the NPU only complete communication operations that were issued from the CPU and the NPU, respectively. To ensure completion of GPU-side operations from the CPU, using aclrtSynchronizeStream/aclrtDeviceSynchronize or stream-based API.

ACLSHMEM_DEVICE void aclshmem_fence (void)

In OpenACLSHMEM specification, aclshmem_fence assures ordering of delivery of Put, AMOs, and memory store routines to symmetric data objects, but does not guarantee the completion of these operations. However, due to hardware capabilities, we implemented aclshmem_fence same as aclshmem_quiet, ensuring both ordering and completion. Fence operations issued on the CPU and the NPU only order communication operations that were issued from the CPU and the NPU, respectively. To ensure completion of GPU-side operations from the CPU, using aclrtSynchronizeStream/aclrtDeviceSynchronize or stream-based API.

shmem_device_p2p_sync.h

Defines

ACLSHMEM_WAIT_UNTIL(NAME, TYPE)

Automatically generates aclshmem wait until functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_wait_until(__gm__ TYPE *ivar, int cmp, TYPE cmp_value)

Function Description

Implements point-to-point synchronization by blocking until the value at ivar satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.

Parameters

  • ivar - [in] Symmetric address of a remotely accessible data object. The type of ivar should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • cmp - [in] The comparison operator that compares ivar with cmp_val. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_value - [in] The value to be compared with ivar. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

ACLSHMEM_WAIT(NAME, TYPE)

Automatically generates aclshmem wait functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_wait(__gm__ TYPE *ivar, TYPE cmp_value)

Function Description

Implements point-to-point synchronization by blocking until the value of ivar is not equal to comparison value, cmp_value.

Parameters

  • ivar - [in] Symmetric address of a remotely accessible data object. The type of ivar should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • cmp_value - [in] The value to be compared with ivar. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

ACLSHMEM_WAIT_UNTIL_ALL(NAME, TYPE)

Automatically generates aclshmem wait until all functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_wait_until_all(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, TYPE cmp_value)

Function Description

Implements point-to-point synchronization by blocking until all entries in the wait set specified by ivars and status satisfy the condition defined by the comparison operator, cmp, and comparison value, cmp_value.

Parameters

  • ivar - [in] Symmetric address of a remotely accessible data object. The type of ivar should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • nelems - [in] The number of elements in the ivars array.

  • status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.

  • cmp - [in] The comparison operator that compares ivar with cmp_val. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_value - [in] The value to be compared with ivar. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

ACLSHMEM_WAIT_UNTIL_ANY(NAME, TYPE)

Automatically generates aclshmem wait until any functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_wait_until_any(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, TYPE cmp_value)

Function Description

Implements point-to-point synchronization by blocking until any one entry in the wait set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.

Parameters

  • ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • nelems - [in] The number of elements in the ivars array.

  • status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.

  • cmp - [in] The comparison operator that compares ivar with cmp_val. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_value - [in] The value to be compared with ivar. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

Returns

Return the index of an element in the ivars array that satisfies the wait condition. If the wait set is empty, this routine returns SIZE_MAX.

ACLSHMEM_WAIT_UNTIL_SOME(NAME, TYPE)

Automatically generates aclshmem wait until some functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE size_t aclshmem_NAME_wait_until_some(__gm__ TYPE *ivars, size_t nelems, __gm__ size_t *indices, __gm__ const int *status, int cmp, TYPE cmp_value)

Function Description

Implements point-to-point synchronization by blocking until at least one entry in the wait set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.

Parameters

  • ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • nelems - [in] The number of elements in the ivars array.

  • indices - [out] Local address of an array of indices of length at least nelems into ivars that satisfied the wait condition.

  • status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.

  • cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_value - [in] The value to be compared with ivar. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

Returns

Return the number of indices returned in the indices array. If the wait set is empty, this routine returns 0.

ACLSHMEM_WAIT_UNTIL_ALL_VECTOR(NAME, TYPE)

Automatically generates aclshmem wait until all vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE size_t aclshmem_NAME_wait_until_all_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)

Function Description

Implements point-to-point synchronization by blocking until all entries in the wait set specified by ivars and status satisfy the condition defined by the comparison operator, cmp, and comparison value, cmp_values.

Parameters

  • ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • nelems - [in] The number of elements in the ivars array.

  • status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.

  • cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

ACLSHMEM_WAIT_UNTIL_ANY_VECTOR(NAME, TYPE)

Automatically generates aclshmem wait until any vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE size_t aclshmem_NAME_wait_until_any_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)

Function Description

Implements point-to-point synchronization by blocking until any one entry in the wait set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_values.

Parameters

  • ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • nelems - [in] The number of elements in the ivars array.

  • status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.

  • cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

Returns

Return the index of an element in the ivars array that satisfies the wait condition. If the wait set is empty, this routine returns SIZE_MAX.

ACLSHMEM_WAIT_UNTIL_SOME_VECTOR(NAME, TYPE)

Automatically generates aclshmem wait until some vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE size_t aclshmem_NAME_wait_until_some_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ size_t *indices, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)

Function Description

Implements point-to-point synchronization by blocking until at least one entry in the wait set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_values.

Parameters

  • ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • nelems - [in] The number of elements in the ivars array.

  • indices - [out] Local address of an array of indices of length at least nelems into ivars that satisfied the wait condition.

  • status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.

  • cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

Returns

Return the number of indices returned in the indices array. If the wait set is empty, this routine returns 0.

ACLSHMEM_TEST(NAME, TYPE)

Automatically generates aclshmem test functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE int aclshmem_NAME_test(__gm__ TYPE *ivars, int cmp, TYPE cmp_value)

Function Description

Implements point-to-point synchronization by testing whether the value of ivar satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.

Parameters

  • ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_value - [in] The value against which the object pointed to by ivar will be compared. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC

Returns

Return 1 if the comparison (via the operator cmp) between the ivar and cmp_value results in true; otherwise, return 0.

ACLSHMEM_TEST_ANY(NAME, TYPE)

Automatically generates aclshmem test any functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE size_t aclshmem_NAME_test_any(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, TYPE cmp_value)

Function Description

Implements point-to-point synchronization by testing whether any one entry in the test set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.

Parameters

  • ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • nelems - [in] The number of elements in the ivars array.

  • status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the wait set; If status[i] != 0, then ivars[i] is excluded from the wait set; If status is NULL, all elements of ivars are included in the wait set.

  • cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_value - [in] The value to be compared with ivars. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

Returns

Return the index of an element in the ivars array that satisfies the test condition. If the test set is empty or no conditions in the test set are satisfied, this routine returns SIZE_MAX.

ACLSHMEM_TEST_SOME(NAME, TYPE)

Automatically generates aclshmem test some functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE size_t aclshmem_NAME_test_some(__gm__ TYPE *ivars, size_t nelems, __gm__ size_t *indices, __gm__ const int *status, int cmp, TYPE cmp_value)

Function Description

Implements point-to-point synchronization by testing whether at least one entry in the test set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_value.

Parameters

  • ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • nelems - [in] The number of elements in the ivars array.

  • indices - [out] Local address of an array of indices of length at least nelems into ivars that satisfied the test condition.

  • status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the test set; If status[i] != 0, then ivars[i] is excluded from the test set; If status is NULL, all elements of ivars are included in the test set.

  • cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_value - [in] The value to be compared with ivars. The type of cmp_value should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

Returns

Return the number of indices returned in the indices array. If the test set is empty, this routine returns 0.

ACLSHMEM_TEST_ALL_VECTOR(NAME, TYPE)

Automatically generates aclshmem test all vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE size_t aclshmem_NAME_test_all_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)

Function Description

Implements point-to-point synchronization by testing whether all entries in the test set specified by ivars and status satisfy the condition defined by the comparison operator, cmp, and comparison value, cmp_values.

Parameters

  • ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • nelems - [in] The number of elements in the ivars array.

  • status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the test set; If status[i] != 0, then ivars[i] is excluded from the test set; If status is NULL, all elements of ivars are included in the test set.

  • cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

Returns

Return 1 if all elements in ivars satisfy the test conditions or if nelems is 0, otherwise this routine returns 0.

ACLSHMEM_TEST_ANY_VECTOR(NAME, TYPE)

Automatically generates aclshmem test any vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE size_t aclshmem_NAME_test_any_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)

Function Description

Implements point-to-point synchronization by testing whether any one entry in the test set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_values.

Parameters

  • ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • nelems - [in] The number of elements in the ivars array.

  • status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the test set; If status[i] != 0, then ivars[i] is excluded from the test set; If status is NULL, all elements of ivars are included in the test set.

  • cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

Returns

Return the index of an element in the ivars array that satisfies the test condition. If the test set is empty or no conditions in the test set are satisfied, this routine returns SIZE_MAX.

ACLSHMEM_TEST_SOME_VECTOR(NAME, TYPE)

Automatically generates aclshmem test some vector functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE size_t aclshmem_NAME_test_some_vector(__gm__ TYPE *ivars, size_t nelems, __gm__ size_t *indices, __gm__ const int *status, int cmp, __gm__ TYPE *cmp_values)

Function Description

Implements point-to-point synchronization by testing whether at least one entry in the test set specified by ivars and status satisfies the condition defined by the comparison operator, cmp, and comparison value, cmp_values.

Parameters

  • ivar - [in] Symmetric address of an array of remotely accessible data objects. The type of ivars should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

  • nelems - [in] The number of elements in the ivars array.

  • indices - [out] Local address of an array of indices of length at least nelems into ivars that satisfied the test condition.

  • status - [in] Local address of an optional mask array of length nelems. If status[i] == 0, then ivars[i] is included in the test set; If status[i] != 0, then ivars[i] is excluded from the test set; If status is NULL, all elements of ivars are included in the test set.

  • cmp - [in] The comparison operator that compares ivars with cmp_value. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_values - [in] Local address of an array of length nelems containing values to be compared with the respective value in ivars. The type of cmp_values should match that implied in the ACLSHMEM_P2P_SYNC_TYPE_FUNC.

Returns

Return the number of indices returned in the indices array. If the test set is empty, this routine returns 0.

Functions

ACLSHMEM_DEVICE void aclshmemx_signal_op (__gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)

The aclshmemx_signal_op operation updates sig_addr with signal using operation sig_op on the specified PE. This operation can be used together with aclshmem_signal_wait_until for efficient point-to-point synchronization. WARNING: Atomicity NOT Guaranteed.

Parameters:
  • sig_addr – [in] Symmetric address of the signal word to be updated.

  • signal – [in] The value used to update sig_addr.

  • sig_op – [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe – [in] PE number of the remote PE.

ACLSHMEM_DEVICE int32_t aclshmem_signal_wait_until (__gm__ int32_t *sig_addr, int cmp, int32_t cmp_val)

This routine can be used to implement point-to-point synchronization between PEs or between threads within the same PE. A call to this routine blocks until the value of sig_addr at the calling PE satisfies the wait condition specified by the comparison operator, cmp, and comparison value, cmp_val.

Parameters:
  • sig_addr – [in] Local address of the source signal variable.

  • cmp – [in] The comparison operator that compares sig_addr with cmp_val. Supported operators: ACLSHMEM_CMP_EQ/ACLSHMEM_CMP_NE/ACLSHMEM_CMP_GT/ ACLSHMEM_CMP_GE/ACLSHMEM_CMP_LT/ACLSHMEM_CMP_LE.

  • cmp_val – [in] The value against which the object pointed to by sig_addr will be compared.

Returns:

Return the contents of the signal data object, sig_addr, at the calling PE that satisfies the wait condition.

shmem_device_rma.h

Defines

ACLSHMEM_TYPE_FUNC(FUNC)

Standard RMA Types and Names.

NAME

TYPE

half

half

float

float

double

double

int8

int8

int16

int16

int32

int32

int64

int64

uint8

uint8

uint16

uint16

uint32

uint32

uint64

uint64

char

char

bfloat16

bfloat16

ACLSHMEM_TEST_TYPE_FUNC(FUNC)

Standard test Types and Names.

NAME

TYPE

float

float

int8

int8

int16

int16

int32

int32

int64

int64

uint8

uint8

uint16

uint16

uint32

uint32

uint64

uint64

char

char

ACLSHMEM_TYPENAME_P_AICORE(NAME, TYPE)

Automatically generates aclshmem p functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_p(__gm__ TYPE *dst, const TYPE value, int pe)

Function Description

Provide a low latency put capability for single element of most basic types.

Parameters

  • dst - [in] Symmetric address of the destination data on local PE.

  • value - [in] The element to be put.

  • pe - [in] The number of the remote PE.

ACLSHMEM_TYPENAME_G_AICORE(NAME, TYPE)

Automatically generates aclshmem g functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_g(__gm__ TYPE *dst, const TYPE value, int32_t pe)

Function Description

Provide a low latency get capability for single element of most basic types.

Parameters

  • src - [in] Symmetric address of the destination data on local PE.

  • pe - [in] The number of the remote PE.

Returns

A single element of type specified in the input pointer.

ACLSHMEM_GET_TYPENAME_MEM(NAME, TYPE)

Automatically generates aclshmem get functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_get(__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)

Function Description

Synchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • elem_size - [in] Number of elements in the dest and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_IGET_TYPENAME_MEM(NAME, TYPE)

Automatically generates aclshmem get functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_iget(__gm__ TYPE *dest, __gm__ TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe)

Function Description

Synchronous interface. Copy strided data elements from a symmetric array from a specified remote PE to strided locations on a local array.

Parameters

  • dest - [in] Pointer on local device of the destination data.

  • source - [in] Pointer on Symmetric memory of the source data.

  • dst - [in] The stride between consecutive elements of the dest array.

  • sst - [in] The stride between consecutive elements of the source array.

  • nelems - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_GET_SIZE_MEM(BITS)

Automatically generates aclshmem get functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.

Remark

ACLSHMEM_DEVICE void aclshmem_getBITS(__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)

Function Description

Synchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • elem_size - [in] Number of elements in the dest and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_IGET_SIZE_MEM(BITS)

Automatically generates aclshmem get functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.

Remark

ACLSHMEM_DEVICE void aclshmem_igetBITS(__gm__ void *dest, __gm__ void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe)

Function Description

Synchronous interface. Copy strided data elements from a symmetric array from a specified remote PE to strided locations on a local array.

Parameters

  • dest - [in] Pointer on local device of the destination data.

  • source - [in] Pointer on Symmetric memory of the source data.

  • dst - [in] The stride between consecutive elements of the dest array.

  • sst - [in] The stride between consecutive elements of the source array.

  • nelems - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM(NAME, TYPE)

Automatically generates aclshmem put functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put(__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)

Function Description

Synchronous interface. Copy a contiguous data on local PE to symmetric address on the specified PE.

Parameters

  • dst - [in] Pointer on Symmetric memory of the destination data.

  • src - [in] Pointer on local device of the source data.

  • elem_size - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_IPUT_TYPENAME_MEM(NAME, TYPE)

Automatically generates aclshmem put functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_iput(__gm__ TYPE *dest, __gm__ TYPE *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe)

Function Description

Synchronous interface. Copy strided data elements (specified by sst) of an array from a source array on the local PE to locations specified by stride dst on a dest array on specified remote PE.

Parameters

  • dest - [in] Pointer on Symmetric memory of the destination data.

  • source - [in] Pointer on local device of the source data.

  • dst - [in] The stride between consecutive elements of the dest array.

  • sst - [in] The stride between consecutive elements of the source array.

  • nelems - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_SIZE_MEM(BITS)

Automatically generates aclshmem put functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.

Remark

ACLSHMEM_DEVICE void aclshmem_putBITS(__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)

Function Description

Synchronous interface. Copy a contiguous data on local PE to symmetric address on the specified PE.

Parameters

  • dst - [in] Pointer on Symmetric memory of the destination data.

  • src - [in] Pointer on local device of the source data.

  • elem_size - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_IPUT_SIZE_MEM(BITS)

Automatically generates aclshmem put functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.

Remark

ACLSHMEM_HOST_API void aclshmem_iputBITS(__gm__ void *dest, __gm__ void *source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe)

Function Description

Synchronous interface. Copy strided data elements (specified by sst) of an array from a source array on the local PE to locations specified by stride dst on a dest array on specified remote PE.

Parameters

  • dest - [in] Pointer on Symmetric memory of the destination data.

  • source - [in] Pointer on local device of the source data.

  • dst - [in] The stride between consecutive elements of the dest array.

  • sst - [in] The stride between consecutive elements of the source array.

  • nelems - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_GET_TYPENAME_MEM_NBI(NAME, TYPE)

Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)

Function Description

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • elem_size - [in] Number of elements in the dest and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_GET_SIZE_MEM_NBI(BITS)

Automatically generates aclshmem get functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.

Remark

ACLSHMEM_DEVICE void aclshmem_getBITS_nbi(__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)

Function Description

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • elem_size - [in] Number of elements in the dest and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_GET_TYPENAME_MEM_DETAILED_NBI(NAME, TYPE)

Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(__gm__ TYPE *dst, __gm__ TYPE *src, const non_contiguous_copy_param &copy_params, int32_t pe)

Function Description

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local device.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_GET_TYPENAME_MEM_TENSOR_NBI(NAME, TYPE)

Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, uint32_t elem_size, int pe)

Function Description

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.

Parameters

  • dst - [in] GlobalTensor on local device of the destination data.

  • src - [in] GlobalTensor on Symmetric memory of the source data.

  • elem_size - [in] Number of elements in the dest and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_GET_TYPENAME_MEM_TENSOR_DETAILED_NBI(NAME, TYPE)

Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, uint32_t elem_size, int pe)

Function Description

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local device.

Parameters

  • dst - [in] GlobalTensor on local device of the destination data.

  • src - [in] GlobalTensor on Symmetric memory of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_NBI(NAME, TYPE)

Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(__gm__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int32_t pe)

Function Description

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.

Parameters

  • dst - [in] Pointer on Symmetric memory of the destination data.

  • src - [in] Pointer on local device of the source data.

  • elem_size - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_SIZE_MEM_NBI(BITS)

Automatically generates aclshmem put functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.

Remark

ACLSHMEM_DEVICE void aclshmem_putBITS_nbi(__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)

Function Description

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.

Parameters

  • dst - [in] Pointer on Symmetric memory of the destination data.

  • src - [in] Pointer on local device of the source data.

  • elem_size - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_DETAILED_NBI(NAME, TYPE)

Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(__gm__ TYPE *dst, __gm__ TYPE *src, const non_contiguous_copy_param &copy_params, int32_t pe)

Function Description

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local PE to symmetric address on the specified PE.

Parameters

  • dst - [in] Pointer on Symmetric memory of the destination data.

  • src - [in] Pointer on local device of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_TENSOR_NBI(NAME, TYPE)

Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, uint32_t elem_size, int pe)

Function Description

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.

Parameters

  • dst - [in] GlobalTensor on Symmetric memory of the destination data.

  • src - [in] GlobalTensor on local device of the source data.

  • elem_size - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_TENSOR_DETAILED_NBI(NAME, TYPE)

Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, const non_contiguous_copy_param &copy_params, int pe)

Function Description

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local PE to symmetric address on the specified PE.

Parameters

  • dst - [in] GlobalTensor on Symmetric memory of the destination data.

  • src - [in] GlobalTensor on local device of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.

  • pe - [in] PE number of the remote PE.

Functions

ACLSHMEM_DEVICE void aclshmem_getmem (__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)

Synchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE.

Parameters:
  • dst – [in] Pointer on local device of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • elem_size – [in] Number of elements in the dest and source arrays.

  • pe – [in] PE number of the remote PE.

ACLSHMEM_DEVICE void aclshmem_putmem (__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)

Synchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE.

Parameters:
  • dst – [in] Pointer on local device of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • elem_size – [in] Number of elements in the dest and source arrays.

  • pe – [in] PE number of the remote PE.

ACLSHMEM_DEVICE void aclshmem_getmem_nbi (__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE.

Parameters:
  • dst – [in] Pointer on local device of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • elem_size – [in] Number of elements in the dest and source arrays.

  • pe – [in] PE number of the remote PE.

ACLSHMEM_DEVICE void aclshmem_putmem_nbi (__gm__ void *dst, __gm__ void *src, uint32_t elem_size, int32_t pe)

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE.

Parameters:
  • dst – [in] Pointer on local device of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • elem_size – [in] Number of elements in the dest and source arrays.

  • pe – [in] PE number of the remote PE.

ACLSHMEM_DEVICE void aclshmemx_set_mte_config (uint64_t offset, uint32_t ub_size, uint32_t sync_id)

Set necessary parameters for put or get.

Parameters:
  • offset – [in] The start address on UB.

  • ub_size – [in] The Size of Temp UB Buffer.

  • sync_id – [in] Sync ID for put or get.

Defines

ACLSHMEM_TYPE_FUNC(FUNC)

Standard RMA Types and Names.

NAME

TYPE

half

half

float

float

double

double

int8

int8

int16

int16

int32

int32

int64

int64

uint8

uint8

uint16

uint16

uint32

uint32

uint64

uint64

char

char

bfloat16

bfloat16

ACLSHMEM_GET_TYPENAME_MEM_UB_NBI(NAME, TYPE)

Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(__ubuf__ TYPE *dst, __gm__ TYPE *src, uint32_t elem_size, int pe)

Function Description

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local UB.

Parameters

  • dst - [in] Pointer on local UB of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • elem_size - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_GET_TYPENAME_MEM_UB_TENSOR_NBI(NAME, TYPE)

Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(AscendC::LocalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, uint32_t elem_size, int pe)

Function Description

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local UB.

Parameters

  • dst - [in] LocalTensor on local UB of the destination data.

  • src - [in] GlobalTensor on Symmetric memory of the source data.

  • elem_size - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_GET_TYPENAME_MEM_UB_DETAILED_NBI(NAME, TYPE)

Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(__ubuf__ TYPE *dst, __gm__ TYPE *src, const non_contiguous_copy_param &copy_params, int pe)

Function Description

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local UB.

Parameters

  • dst - [in] Pointer on local UB of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_GET_TYPENAME_MEM_UB_TENSOR_DETAILED_NBI(NAME, TYPE)

Automatically generates aclshmem get nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_get_nbi(AscendC::LocalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE> src, const non_contiguous_copy_param &copy_params, int pe)

Function Description

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local UB.

Parameters

  • dst - [in] LocalTensor on local UB of the destination data.

  • src - [in] GlobalTensor on Symmetric memory of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is managed in src and dst.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_UB_NBI(NAME, TYPE)

Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(__gm__ TYPE *dst, __ubuf__ TYPE *src, uint32_t elem_size, int32_t pe)

Function Description

Asynchronous interface. Copy contiguous data on local UB to symmetric address on the specified PE.

Parameters

  • dst - [in] Pointer on Symmetric memory of the destination data.

  • src - [in] Pointer on local UB of the source data.

  • elem_size - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_UB_TENSOR_NBI(NAME, TYPE)

Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::LocalTensor<TYPE> src, uint32_t elem_size, int32_t pe)

Function Description

Asynchronous interface. Copy contiguous data on local UB to symmetric address on the specified PE.

Parameters

  • dst - [in] GlobalTensor on Symmetric memory of the destination data.

  • src - [in] LocalTensor on local UB of the source data.

  • elem_size - [in] Number of elements in the destination and source arrays.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_UB_DETAILED_NBI(NAME, TYPE)

Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(__gm__ TYPE *dst, __ubuf__ TYPE *src, const non_contiguous_copy_param &copy_params, int32_t pe)

Function Description

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.

Parameters

  • dst - [in] Pointer on Symmetric memory of the destination data.

  • src - [in] Pointer on local UB of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_UB_TENSOR_DETAILED_NBI(NAME, TYPE)

Automatically generates aclshmem put nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_nbi(AscendC::GlobalTensor<TYPE> dst, AscendC::LocalTensor<TYPE> src, const non_contiguous_copy_param &copy_params, int32_t pe)

Function Description

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.

Parameters

  • dst - [in] GlobalTensor on Symmetric memory of the destination data.

  • src - [in] LocalTensor on local UB of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.

  • pe - [in] PE number of the remote PE.

Functions

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (__ubuf__ T *dst, __gm__ T *src, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local UB.

Parameters:
  • dst – [in] Pointer on local UB of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (AscendC::LocalTensor< T > dst, AscendC::GlobalTensor< T > src, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local UB.

Parameters:
  • dst – [in] LocalTensor on local UB of the destination data.

  • src – [in] GlobalTensor on Symmetric memory of the source data.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (__ubuf__ T *dst, __gm__ T *src, const non_contiguous_copy_param &copy_params, int pe, uint32_t sync_id)

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local UB.

Parameters:
  • dst – [in] Pointer on local UB of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (AscendC::LocalTensor< T > dst, AscendC::GlobalTensor< T > src, const non_contiguous_copy_param &copy_params, int pe, uint32_t sync_id)

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local UB.

Parameters:
  • dst – [in] LocalTensor on local UB of the destination data.

  • src – [in] GlobalTensor on Symmetric memory of the source data.

  • copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (__gm__ T *dst, __ubuf__ T *src, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on local UB to symmetric address on the specified PE.

Parameters:
  • dst – [in] Pointer on Symmetric memory of the destination data.

  • src – [in] Pointer on local UB of the source data.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::LocalTensor< T > src, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on local UB to symmetric address on the specified PE.

Parameters:
  • dst – [in] GlobalTensor on Symmetric memory of the destination data.

  • src – [in] LocalTensor on local UB of the source data.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (__gm__ T *dst, __ubuf__ T *src, const non_contiguous_copy_param &copy_params, int pe, uint32_t sync_id)

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.

Parameters:
  • dst – [in] Pointer on Symmetric memory of the destination data.

  • src – [in] Pointer on local UB of the source data.

  • copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::LocalTensor< T > src, const non_contiguous_copy_param &copy_params, int pe, uint32_t sync_id)

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.

Parameters:
  • dst – [in] GlobalTensor on Symmetric memory of the destination data.

  • src – [in] LocalTensor on local UB of the source data.

  • copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

Functions

ACLSHMEM_DEVICE __gm__ void * aclshmem_ptr (__gm__ void *ptr, int pe)

Translate an local symmetric address to remote symmetric address on the specified PE.

Parameters:
  • ptr – [in] Symmetric address on local PE.

  • pe – [in] The number of the remote PE.

Returns:

A remote symmetric address on the specified PE that can be accessed using memory loads and stores.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local device.

Parameters:
  • dst – [in] Pointer on local device of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • buf – [in] Pointer on local UB.

  • ub_size – [in] The size of temp Buffer on UB. (In Bytes)

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, const non_contiguous_copy_param &copy_params, int pe, uint32_t sync_id)

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local device.

Parameters:
  • dst – [in] Pointer on local device of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • buf – [in] Pointer on local UB.

  • ub_size – [in] The size of temp Buffer on UB. (In Bytes)

  • copy_params – [in] Params to describe how non-contiguous data is managed in src and dst.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE.

Parameters:
  • dst – [in] GlobalTensor on local device of the destination data.

  • src – [in] GlobalTensor on Symmetric memory of the source data.

  • buf – [in] LocalTensor on local UB.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_get_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, const non_contiguous_copy_param &copy_params, int pe, uint32_t sync_id)

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on symmetric memory from the specified PE to address on the local device.

Parameters:
  • dst – [in] GlobalTensor on local device of the destination data.

  • src – [in] GlobalTensor on Symmetric memory of the source data.

  • buf – [in] LocalTensor on local UB.

  • copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE.

Parameters:
  • dst – [in] Pointer on Symmetric memory of the destination data.

  • src – [in] Pointer on local device of the source data.

  • buf – [in] Pointer on local UB.

  • ub_size – [in] The size of temp Buffer on UB. (In Bytes)

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, const non_contiguous_copy_param &copy_params, int pe, uint32_t sync_id)

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local PE to symmetric address on the specified PE.

Parameters:
  • dst – [in] Pointer on Symmetric memory of the destination data.

  • src – [in] Pointer on local device of the source data.

  • buf – [in] Pointer on local UB.

  • ub_size – [in] The size of temp Buffer on UB. (In Bytes)

  • copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE.

Parameters:
  • dst – [in] GlobalTensor on Symmetric memory of the destination data.

  • src – [in] GlobalTensor on local device of the source data.

  • buf – [in] Pointer on local UB.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_mte_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, const non_contiguous_copy_param &copy_params, int pe, uint32_t sync_id)

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local PE to symmetric address on the specified PE.

Parameters:
  • dst – [in] GlobalTensor on Symmetric memory of the destination data.

  • src – [in] GlobalTensor on local device of the source data.

  • buf – [in] LocalTensor on local UB.

  • copy_params – [in] Params to describe how non-contiguous data is organized in src and dst.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync pipeline.

Functions

ACLSHMEM_DEVICE __gm__ void * aclshmem_roce_ptr (__gm__ void *ptr, int pe)

Translate an local symmetric address to remote symmetric address on the specified PE used by RDMA.

Parameters:
  • ptr – [in] Symmetric address on local PE.

  • pe – [in] The number of the remote PE.

Returns:

A remote symmetric address on the specified PE that can be accessed using memory loads and stores.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_get_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe)

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local device. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.

Parameters:
  • dst – [in] Pointer on local device of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • buf – [in] Pointer on local UB, available space larger than 64 Bytes.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_get_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local device. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.

Parameters:
  • dst – [in] Pointer on local device of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • buf – [in] Pointer on local UB, available space larger than 64 Bytes.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to Sync S\MTE3 Event.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_get_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe)

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.

Parameters:
  • dst – [in] GlobalTensor on local device of the destination data.

  • src – [in] GlobalTensor on Symmetric memory of the source data.

  • buf – [in] LocalTensor on local UB, available space larger than 64 Bytes.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_get_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.

Parameters:
  • dst – [in] GlobalTensor on local device of the destination data.

  • src – [in] GlobalTensor on Symmetric memory of the source data.

  • buf – [in] LocalTensor on local UB, available space larger than 64 Bytes.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to Sync S\MTE3 Event.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_put_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe)

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.

Parameters:
  • dst – [in] Pointer on Symmetric memory of the destination data.

  • src – [in] Pointer on local device of the source data.

  • buf – [in] Pointer on local UB, available space larger than 64 Bytes.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_put_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.

Parameters:
  • dst – [in] Pointer on Symmetric memory of the destination data.

  • src – [in] Pointer on local device of the source data.

  • buf – [in] Pointer on local UB, available space larger than 64 Bytes.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to Sync S\MTE3 Event.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe)

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported. Use sync_id in device_state.rdma_config for pipeline synchronization.

Parameters:
  • dst – [in] GlobalTensor on Symmetric memory of the destination data.

  • src – [in] GlobalTensor on local device of the source data.

  • buf – [in] Pointer on local UB, available space larger than 64 Bytes.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_roce_put_nbi (AscendC::GlobalTensor< T > dst, AscendC::GlobalTensor< T > src, AscendC::LocalTensor< T > buf, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using RDMA as the underlying transport, concurrent RMA/AMO operations to the same PE are not supported.

Parameters:
  • dst – [in] GlobalTensor on Symmetric memory of the destination data.

  • src – [in] GlobalTensor on local device of the source data.

  • buf – [in] Pointer on local UB, available space larger than 64 Bytes.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to Sync S\MTE3 Event.

Functions

ACLSHMEM_DEVICE void aclshmemx_set_sdma_config (uint64_t offset, uint32_t ub_size, uint32_t sync_id)

Set necessary parameters for SDMA operations.

Parameters:
  • offset – [in] The start address on UB.

  • ub_size – [in] The Size of Temp UB Buffer (In Bytes), at least 64 bytes and 64-byte aligned.

  • sync_id – [in] Sync ID for put or get.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_get_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local device. WARNING: When using SDMA as the underlying transport method, the number of AIV cores invoked must not exceed 40 (ACLSHMEM_SDMA_MAX_CHAN).

Parameters:
  • dst – [in] Pointer on local device of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • buf – [in] Pointer on local UB.

  • ub_size – [in] The size of temp Buffer on UB. (In Bytes)

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_get_nbi (AscendC::GlobalTensor< T > &dst, AscendC::GlobalTensor< T > &src, AscendC::LocalTensor< T > &buf, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on symmetric memory from the specified PE to address on the local PE. WARNING: When using SDMA as the underlying transport method, the number of AIV cores invoked must not exceed 40 (ACLSHMEM_SDMA_MAX_CHAN).

Parameters:
  • dst – [in] AscendC::GlobalTensor on local device of the destination data.

  • src – [in] AscendC::GlobalTensor on Symmetric memory of the source data.

  • buf – [in] LocalTensor on local UB.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_put_nbi (__gm__ T *dst, __gm__ T *src, __ubuf__ T *buf, uint32_t ub_size, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using SDMA as the underlying transport method, the number of AIV cores invoked must not exceed 40 (ACLSHMEM_SDMA_MAX_CHAN).

Parameters:
  • dst – [in] Pointer on Symmetric memory of the destination data.

  • src – [in] Pointer on local device of the source data.

  • buf – [in] Pointer on local UB.

  • ub_size – [in] The size of temp Buffer on UB. (In Bytes)

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync.

template<typename T> ACLSHMEM_DEVICE void aclshmemx_sdma_put_nbi (AscendC::GlobalTensor< T > &dst, AscendC::GlobalTensor< T > &src, AscendC::LocalTensor< T > &buf, uint32_t elem_size, int pe, uint32_t sync_id)

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE. WARNING: When using SDMA as the underlying transport method, the number of AIV cores invoked must not exceed 40 (ACLSHMEM_SDMA_MAX_CHAN).

Parameters:
  • dst – [in] AscendC::GlobalTensor on Symmetric memory of the destination data.

  • src – [in] AscendC::GlobalTensor on local device of the source data.

  • buf – [in] LocalTensor on local UB.

  • elem_size – [in] Number of elements in the destination and source arrays.

  • pe – [in] PE number of the remote PE.

  • sync_id – [in] ID used to sync.

shmem_device_so.h

Defines

ACLSHMEM_TYPE_FUNC(FUNC)

Standard RMA Types and Names.

 * Copyright (c) 2025 Huawei Technologies Co., Ltd.  * This program is free software, you can redistribute it and/or modify it under the terms and conditions of  * CANN Open Software License Agreement Version 2.0 (the “License”).  * Please refer to the License for details. You may not use this file except in compliance with the License.  * THIS SOFTWARE IS PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,  * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.  * See LICENSE in the root of the software repository for the full text of the License.  

NAME

TYPE

half

half

float

float

double

double

int8

int8

int16

int16

int32

int32

int64

int64

uint8

uint8

uint16

uint16

uint32

uint32

uint64

uint64

char

char

bfloat16

bfloat16

ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL(NAME, TYPE)

Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_signal(__gm__ TYPE *dst, __gm__ TYPE *src, size_t elem_size,\ __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)

Function Description

Synchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • elem_size - [in] Number of elements in the dest and source arrays.

  • sig_addr - [in] Symmetric address of the signal word to be updated.

  • signal - [in] The value used to update sig_addr.

  • sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_TENSOR(NAME, TYPE)

Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_signal(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE>\ src, size_t elem_size, gm int32_t *sig_addr, int32_t signal, int sig_op, int pe)

Function Description

Synchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • elem_size - [in] Number of elements in the dest and source arrays.

  • sig_addr - [in] Symmetric address of the signal word to be updated.

  • signal - [in] The value used to update sig_addr.

  • sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_DETAILED(NAME, TYPE)

Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_signal(__gm__ TYPE *dst, __gm__ TYPE *src, const\ non_contiguous_copy_param &copy_params, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)

Function Description

Synchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE then update sig_addr

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.

  • sig_addr - [in] Symmetric address of the signal word to be updated.

  • signal - [in] The value used to update sig_addr.

  • sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_TENSOR_DETAILED(NAME, TYPE)

Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_signal(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE>\ src,const non_contiguous_copy_param &copy_params, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)

Function Description

Synchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.

  • sig_addr - [in] Symmetric address of the signal word to be updated.

  • signal - [in] The value used to update sig_addr.

  • sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_SIZE_MEM_SIGNAL_DETAIL(BITS)

Automatically generates aclshmem put functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.

Remark

ACLSHMEM_DEVICE void aclshmem_putBITS_signal(void *dst, void *src, size_t nelems, int32_t *sig_addr,\ int32_t signal, int sig_op, int pe)

Function Description

Synchronous interface. Copy a contiguous data from local to symmetric address on the specified PE and updating a remote signal flag on completion.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • nelems - [in] Number of elements in the dest and source arrays.

  • sig_addr - [in] Symmetric address of the signal word to be updated.

  • signal - [in] The value used to update sig_addr.

  • sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_NBI(NAME, TYPE)

Automatically generates aclshmem put signal nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_signal_nbi(__gm__ TYPE *dst, __gm__ TYPE *src, size_t\ elem_size, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)

Function Description

Asynchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • elem_size - [in] Number of elements in the dest and source arrays.

  • sig_addr - [in] Symmetric address of the signal word to be updated.

  • signal - [in] The value used to update sig_addr.

  • sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_TENSOR_NBI(NAME, TYPE)

Automatically generates aclshmem put signal nbi functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_signal_nbi(AscendC::GlobalTensor<TYPE> dst,\ AscendC::GlobalTensor<TYPE> src, size_t elem_size, size_t elem_size, __gm__ int32_t *sig_addr, int32_t signal,\ int sig_op, int pe)

Function Description

Asynchronous interface. Copy a contiguous data on local UB to symmetric address on the specified PE.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • elem_size - [in] Number of elements in the dest and source arrays.

  • sig_addr - [in] Symmetric address of the signal word to be updated.

  • signal - [in] The value used to update sig_addr.

  • sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_DETAILED_NBI(NAME, TYPE)

Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_signal(__gm__ TYPE *dst, __gm__ TYPE *src, const\ non_contiguous_copy_param &copy_params, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)

Function Description

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE then update sig_addr

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.

  • sig_addr - [in] Symmetric address of the signal word to be updated.

  • signal - [in] The value used to update sig_addr.

  • sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_TYPENAME_MEM_SIGNAL_TENSOR_DETAILED_NBI(NAME, TYPE)

Automatically generates aclshmem put signal functions for different data types (e.g., float, int8_t). The macro parameters: NAME is the function name suffix, TYPE is the operation data type.

Remark

ACLSHMEM_DEVICE void aclshmem_NAME_put_signal(AscendC::GlobalTensor<TYPE> dst, AscendC::GlobalTensor<TYPE>\ src,const non_contiguous_copy_param &copy_params, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)

Function Description

Asynchronous interface. Provide a high-performance way to copy non-contiguous data on local UB to symmetric address on the specified PE.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • copy_params - [in] Params to describe how non-contiguous data is organized in src and dst.

  • sig_addr - [in] Symmetric address of the signal word to be updated.

  • signal - [in] The value used to update sig_addr.

  • sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe - [in] PE number of the remote PE.

ACLSHMEM_PUT_SIZE_MEM_SIGNAL_DETAILED_NBI(BITS)

Automatically generates aclshmem put functions for different bits (e.g., 8, 16). The macro parameters: BITS is the bits.

Remark

ACLSHMEM_DEVICE void aclshmem_putBITS_signal_nbi(void *dst, void *src, size_t nelems, int32_t \ *sig_addr, int32_t signal, int sig_op, int pe)

Function Description

Asynchronous interface. Copy a contiguous data from local to symmetric address on the specified PE and updating a remote signal flag on completion.

Parameters

  • dst - [in] Pointer on local device of the destination data.

  • src - [in] Pointer on Symmetric memory of the source data.

  • nelems - [in] Number of elements in the dest and source arrays.

  • sig_addr - [in] Symmetric address of the signal word to be updated.

  • signal - [in] The value used to update sig_addr.

  • sig_op - [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe - [in] PE number of the remote PE.

Functions

ACLSHMEM_DEVICE void aclshmem_putmem_signal (__gm__ void *dst, __gm__ void *src, size_t elem_size, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)

Synchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE then update sig_addr.

Parameters:
  • dst – [in] Pointer on local device of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • elem_size – [in] Number of elements in the dest and source arrays.

  • sig_addr – [in] Symmetric address of the signal word to be updated.

  • signal – [in] The value used to update sig_addr.

  • sig_op – [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe – [in] PE number of the remote PE.

ACLSHMEM_DEVICE void aclshmem_putmem_signal_nbi (__gm__ void *dst, __gm__ void *src, size_t elem_size, __gm__ int32_t *sig_addr, int32_t signal, int sig_op, int pe)

Asynchronous interface. Copy contiguous data on local PE to symmetric address on the specified PE then update sig_addr.

Parameters:
  • dst – [in] Pointer on local device of the destination data.

  • src – [in] Pointer on Symmetric memory of the source data.

  • elem_size – [in] Number of elements in the dest and source arrays.

  • sig_addr – [in] Symmetric address of the signal word to be updated.

  • signal – [in] The value used to update sig_addr.

  • sig_op – [in] Operation used to update sig_addr with signal. Supported operations: ACLSHMEM_SIGNAL_SET/ACLSHMEM_SIGNAL_ADD

  • pe – [in] PE number of the remote PE.

shmem_device_team.h

Functions

ACLSHMEM_DEVICE int aclshmem_my_pe (void)

Returns the PE number of the local PE.

Returns:

Integer between 0 and npes - 1

ACLSHMEM_DEVICE int aclshmem_n_pes (void)

Returns the number of PEs running in the program.

Returns:

Number of PEs in the program.

ACLSHMEM_DEVICE int aclshmem_team_my_pe (aclshmem_team_t team)

Returns the number of the calling PE in the specified team.

Parameters:

team – [in] A team handle.

Returns:

The number of the calling PE within the specified team. If the team handle is ACLSHMEM_TEAM_INVALID, returns -1.

ACLSHMEM_DEVICE int aclshmem_team_n_pes (aclshmem_team_t team)

Returns the number of PEs in the specified team.

Parameters:

team – [in] A team handle.

Returns:

The number of PEs in the specified team. If the team handle is ACLSHMEM_TEAM_INVALID, returns -1.

ACLSHMEM_DEVICE int aclshmem_team_translate_pe (aclshmem_team_t src_team, int src_pe, aclshmem_team_t dest_team)

Translate a given PE number in one team into the corresponding PE number in another team.

Parameters:
  • src_team – [in] A ACLSHMEM team handle.

  • src_pe – [in] The PE number in src_team.

  • dest_team – [in] A ACLSHMEM team handle.

Returns:

The number of PEs in the specified team. If the team handle is ACLSHMEM_TEAM_INVALID, returns -1.

ACLSHMEM_DEVICE int aclshmem_team_pe_mapping (aclshmem_team_t team, int pe)

Translate a given PE number in one team into the corresponding PE number in global team.

Parameters:
  • team – [in] A ACLSHMEM team handle.

  • pe – [in] The PE number in src_team.

Returns:

The number of PEs in the global team. If the team handle is ACLSHMEM_TEAM_INVALID, returns -1.