Function cuTensorMapEncodeTiled

Source

pub unsafe extern "C" fn cuTensorMapEncodeTiled(
    tensorMap: *mut CUtensorMap,
    tensorDataType: CUtensorMapDataType,
    tensorRank: cuuint32_t,
    globalAddress: *mut c_void,
    globalDim: *const cuuint64_t,
    globalStrides: *const cuuint64_t,
    boxDim: *const cuuint32_t,
    elementStrides: *const cuuint32_t,
    interleave: CUtensorMapInterleave,
    swizzle: CUtensorMapSwizzle,
    l2Promotion: CUtensorMapL2promotion,
    oobFill: CUtensorMapFloatOOBfill,
) -> CUresult

Expand description

\brief Create a tensor map descriptor object representing tiled memory region

Creates a descriptor for Tensor Memory Access (TMA) object specified by the parameters describing a tiled region and returns it in \p tensorMap.

Tensor map objects are only supported on devices of compute capability 9.0 or higher. Additionally, a tensor map object is an opaque value, and, as such, should only be accessed through CUDA API calls.

The parameters passed are bound to the following requirements:

\p tensorMap address must be aligned to 64 bytes.
\p tensorDataType has to be an enum from ::CUtensorMapDataType which is defined as: \code typedef enum CUtensorMapDataType_enum { CU_TENSOR_MAP_DATA_TYPE_UINT8 = 0, // 1 byte CU_TENSOR_MAP_DATA_TYPE_UINT16, // 2 bytes CU_TENSOR_MAP_DATA_TYPE_UINT32, // 4 bytes CU_TENSOR_MAP_DATA_TYPE_INT32, // 4 bytes CU_TENSOR_MAP_DATA_TYPE_UINT64, // 8 bytes CU_TENSOR_MAP_DATA_TYPE_INT64, // 8 bytes CU_TENSOR_MAP_DATA_TYPE_FLOAT16, // 2 bytes CU_TENSOR_MAP_DATA_TYPE_FLOAT32, // 4 bytes CU_TENSOR_MAP_DATA_TYPE_FLOAT64, // 8 bytes CU_TENSOR_MAP_DATA_TYPE_BFLOAT16, // 2 bytes CU_TENSOR_MAP_DATA_TYPE_FLOAT32_FTZ, // 4 bytes CU_TENSOR_MAP_DATA_TYPE_TFLOAT32, // 4 bytes CU_TENSOR_MAP_DATA_TYPE_TFLOAT32_FTZ // 4 bytes } CUtensorMapDataType; \endcode
\p tensorRank must be non-zero and less than or equal to the maximum supported dimensionality of 5. If \p interleave is not ::CU_TENSOR_MAP_INTERLEAVE_NONE, then \p tensorRank must additionally be greater than or equal to 3.
\p globalAddress, which specifies the starting address of the memory region described, must be 32 byte aligned when \p interleave is ::CU_TENSOR_MAP_INTERLEAVE_32B and 16 byte aligned otherwise.
\p globalDim array, which specifies tensor size of each of the \p tensorRank dimensions, must be non-zero and less than or equal to 2^32.
\p globalStrides array, which specifies tensor stride of each of the lower \p tensorRank - 1 dimensions in bytes, must be a multiple of 16 and less than 2^40. Additionally, the stride must be a multiple of 32 when \p interleave is ::CU_TENSOR_MAP_INTERLEAVE_32B. Each following dimension specified includes previous dimension stride: \code globalStrides[0] = globalDim[0] * elementSizeInBytes(tensorDataType) + padding[0]; for (i = 1; i < tensorRank - 1; i++) globalStrides[i] = globalStrides[i – 1] * (globalDim[i] + padding[i]); assert(globalStrides[i] >= globalDim[i]); \endcode
\p boxDim array, which specifies number of elements to be traversed along each of the \p tensorRank dimensions, must be non-zero and less than or equal to 256. When \p interleave is ::CU_TENSOR_MAP_INTERLEAVE_NONE, { \p boxDim[0] * elementSizeInBytes( \p tensorDataType ) } must be a multiple of 16 bytes.
\p elementStrides array, which specifies the iteration step along each of the \p tensorRank dimensions, must be non-zero and less than or equal to 8. Note that when \p interleave is ::CU_TENSOR_MAP_INTERLEAVE_NONE, the first element of this array is ignored since TMA doesn’t support the stride for dimension zero. When all elements of \p elementStrides array is one, \p boxDim specifies the number of elements to load. However, if the \p elementStrides[i] is not equal to one, then TMA loads ceil( \p boxDim[i] / \p elementStrides[i]) number of elements along i-th dimension. To load N elements along i-th dimension, \p boxDim[i] must be set to N * \p elementStrides[i].
\p interleave specifies the interleaved layout of type ::CUtensorMapInterleave, which is defined as: \code typedef enum CUtensorMapInterleave_enum { CU_TENSOR_MAP_INTERLEAVE_NONE = 0, CU_TENSOR_MAP_INTERLEAVE_16B, CU_TENSOR_MAP_INTERLEAVE_32B } CUtensorMapInterleave; \endcode TMA supports interleaved layouts like NC/8HWC8 where C8 utilizes 16 bytes in memory assuming 2 byte per channel or NC/16HWC16 where C16 uses 32 bytes. When \p interleave is ::CU_TENSOR_MAP_INTERLEAVE_NONE and \p swizzle is not ::CU_TENSOR_MAP_SWIZZLE_NONE, the bounding box inner dimension (computed as \p boxDim[0] multiplied by element size derived from \p tensorDataType) must be less than or equal to the swizzle size.
- CU_TENSOR_MAP_SWIZZLE_32B implies the bounding box inner dimension will be <= 32.
- CU_TENSOR_MAP_SWIZZLE_64B implies the bounding box inner dimension will be <= 64.
- CU_TENSOR_MAP_SWIZZLE_128B implies the bounding box inner dimension will be <= 128.
\p swizzle, which specifies the shared memory bank swizzling pattern, has to be of type ::CUtensorMapSwizzle which is defined as: \code typedef enum CUtensorMapSwizzle_enum { CU_TENSOR_MAP_SWIZZLE_NONE = 0, CU_TENSOR_MAP_SWIZZLE_32B, CU_TENSOR_MAP_SWIZZLE_64B, CU_TENSOR_MAP_SWIZZLE_128B } CUtensorMapSwizzle; \endcode Data are organized in a specific order in global memory; however, this may not match the order in which the application accesses data in shared memory. This difference in data organization may cause bank conflicts when shared memory is accessed. In order to avoid this problem, data can be loaded to shared memory with shuffling across shared memory banks. When \p interleave is ::CU_TENSOR_MAP_INTERLEAVE_32B, \p swizzle must be ::CU_TENSOR_MAP_SWIZZLE_32B. Other interleave modes can have any swizzling pattern.
\p l2Promotion specifies L2 fetch size which indicates the byte granurality at which L2 requests is filled from DRAM. It must be of type ::CUtensorMapL2promotion, which is defined as: \code typedef enum CUtensorMapL2promotion_enum { CU_TENSOR_MAP_L2_PROMOTION_NONE = 0, CU_TENSOR_MAP_L2_PROMOTION_L2_64B, CU_TENSOR_MAP_L2_PROMOTION_L2_128B, CU_TENSOR_MAP_L2_PROMOTION_L2_256B } CUtensorMapL2promotion; \endcode
\p oobFill, which indicates whether zero or a special NaN constant should be used to fill out-of-bound elements, must be of type ::CUtensorMapFloatOOBfill which is defined as: \code typedef enum CUtensorMapFloatOOBfill_enum { CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE = 0, CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA } CUtensorMapFloatOOBfill; \endcode Note that ::CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA can only be used when \p tensorDataType represents a floating-point data type.

\param tensorMap - Tensor map object to create \param tensorDataType - Tensor data type \param tensorRank - Dimensionality of tensor \param globalAddress - Starting address of memory region described by tensor \param globalDim - Array containing tensor size (number of elements) along each of the \p tensorRank dimensions \param globalStrides - Array containing stride size (in bytes) along each of the \p tensorRank - 1 dimensions \param boxDim - Array containing traversal box size (number of elments) along each of the \p tensorRank dimensions. Specifies how many elements to be traversed along each tensor dimension. \param elementStrides - Array containing traversal stride in each of the \p tensorRank dimensions \param interleave - Type of interleaved layout the tensor addresses \param swizzle - Bank swizzling pattern inside shared memory \param l2Promotion - L2 promotion size \param oobFill - Indicate whether zero or special NaN constant must be used to fill out-of-bound elements

\return ::CUDA_SUCCESS, ::CUDA_ERROR_DEINITIALIZED, ::CUDA_ERROR_NOT_INITIALIZED, ::CUDA_ERROR_INVALID_CONTEXT, ::CUDA_ERROR_INVALID_VALUE

\sa ::cuTensorMapEncodeIm2col, ::cuTensorMapReplaceAddress

Function cuTensorMapEncodeTiledCopy item path

Function cuTensorMapEncodeTiled